Speech Perception, Production and Acquisition: Multidisciplinary approaches in Chinese languages [1st ed.] 9789811576058, 9789811576065

This book addresses important issues of speech processing and language learning in Chinese. It highlights perception and

322 117 6MB

English Pages VIII, 279 [277] Year 2020

Table of contents :
Front Matter ....Pages i-viii
Introduction (Huei-Mei Liu, Feng-Ming Tsao, Ping Li)....Pages 1-7
Front Matter ....Pages 9-9
The Phonetic Realizations of the Mandarin Phoneme Inventory: The Canonical and the Variants (Janice Fon)....Pages 11-36
Acoustic-Based and Knowledge-Based Processing of Mandarin Tones by Native and Non-native Speakers (Chao-Yang Lee, Seth Wiener)....Pages 37-57
Individual Differences in Lexical Tone Learning (Erin M. Ingvalson, Patrick C. M. Wong)....Pages 59-75
Front Matter ....Pages 77-77
Native and Nonnative Processing of Acoustic and Phonological Information of Lexical Tones in Chinese: Behavioral and Neural Correlates (Keke Yu, Ruiming Wang, Ping Li)....Pages 79-99
Neurophysiological Studies of Mandarin Lexical Tone Acquisition in Early Childhood (Chia-Ying Lee, Ying-Ying Cheng)....Pages 101-116
Neural Processing of Tone Sandhi in Production and Perception: The Case of Mandarin Tone 3 Sandhi (Claire H. C. Chang, Wen-Jui Kuo)....Pages 117-135
Front Matter ....Pages 137-137
The Effect of Musical Experience and Congenital Amusia on Lexical Tone Perception, Production, and Learning: A Review (Jia Hoong Ong, Shen Hui Tan, Alice H. D. Chan, Francis C. K. Wong)....Pages 139-158
Multi-Modal Perception of Tone (Yue Wang, Joan A. Sereno, Allard Jongman)....Pages 159-173
Front Matter ....Pages 175-175
Lexical-Tonal Perception Development in Infancy (Feng-Ming Tsao, Huei-Mei Liu)....Pages 177-197
Early Word Recognition and Word Learning in Mandarin Learning Children (Leher Singh)....Pages 199-218
Speech Development in Mandarin-Speaking Children (Gang Peng, Fei Chen)....Pages 219-242
Behavioral and Neurophysiological Evidence of Speech Processing in Chinese-Speaking Individuals with Autism Spectrum Disorder: A Review and Future Directions (Yan H. Yu, Valerie L. Shafer)....Pages 243-279

Recommend Papers

Multidisciplinary Approaches to Language Production 9783110894028, 9783110178401

This volume comprises contributions from different disciplines (cognitive psychology, linguistics, computer science, neu

169 47 175MB Read more

A Guide to Speech Production and Perception 9780748636532

What roles do the speaker and the listener play in communication processes? Providing an overall system view, this innov

107 101 6MB Read more

Creole Languages and Language Acquisition 9783110811049, 9783110143867

185 18 6MB Read more

Additive Manufacturing in Multidisciplinary Cooperation and Production (Springer Tracts in Additive Manufacturing) [1st ed. 2024] 3031376706, 9783031376702

This book publishes the latest findings and ideas in the field of additive manufacturing presented by authors from promi

114 31 11MB Read more

Diverse Pedagogical Approaches to Experiential Learning: Multidisciplinary Case Studies, Reflections, and Strategies [1st ed.] 9783030426903, 9783030426910

This edited collection offers a unique multidisciplinary perspective into the many factors that go into designing, facil

587 109 4MB Read more

Data Visualization: Trends and Challenges Toward Multidisciplinary Perception 9811522812, 9789811522819

This book discusses the recent trends and developments in the fields of information processing and information visualiza

109 82 Read more

The Handbook of Speech Perception 1119184088, 9781119184089

A wide-ranging and authoritative volume exploring contemporary perceptual research on speech, updated with new original

478 101 6MB Read more

Chinese Walls in Time and Space: A Multidisciplinary Perspective 9781942242444

Are walls remnants of ancient and medieval societies, destined to become anachronistic in modern and post-modern times?

105 25 3MB Read more

Learning Indigenous Languages: Child Language Acquisition in Mesoamerica 9783110923148, 9783110195590

This book includes six studies on the acquisition of single Mesoamerican indigenous languages, (Huichol, Zapotec, and th

172 111 5MB Read more

Acquisition of Romance Languages: Old Acquisition Challenges and New Explanations from a Generative Perspective 9781614513575, 9781614518020

This volume presents a collection of new articles that investigate the acquisition of Romance languages across different

156 60 3MB Read more

Speech Perception, Production and Acquisition: Multidisciplinary approaches in Chinese languages [1st ed.]
9789811576058, 9789811576065

Author / Uploaded
Huei‐Mei Liu
Feng‐Ming Tsao
Ping Li

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Chinese Language Learning Sciences

Huei‐Mei Liu Feng‐Ming Tsao Ping Li Editors

Speech Perception, Production and Acquisition Multidisciplinary approaches in Chinese languages

Chinese Language Learning Sciences Series Editors Chin-Chuan Cheng, Department of Linguistics, University of Illinois, Urbana, IL, USA Kuo-En Chang, Graduate Institute of Information and Computer Education, National Taiwan Normal University, Taipei, Taiwan Yao-Ting Sung, Department of Educational Psychology and Counseling, National Taiwan Normal University, Taipei, Taiwan Ping Li, Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong, Hong Kong

This book series investigates several critical issues embedded in fundamental, technical, and applied research in the ﬁeld of Chinese as second language (CSL) learning and teaching, including learning mechanism in the brain, technology application for teaching, learning and assessment. The book series discusses these issues from the perspectives of science (evidence-based approach) and technology. The studies in the book series use the methods from the ﬁelds of linguistics (such as corpus linguistics and computational linguistics), psychological and behavioural sciences (such as experimental design and statistical analyses), informational technology (such as information retrieval and natural language processing) and brain sciences (such as neuroimaging and neurolinguistics). The book series generally covers three main interdisciplinary themes: (1) fundamental investigation of Chinese as a ﬁrst or second language acquisition, (2) development in Chinese language learning technology, and (3) applied research on Chinese language education. More speciﬁcally, the book series involves seven research topics: – – – – – – – –

language transfer mechanism in Chinese as a second language factors of Chinese as a second language acquisition in childhood cultural influence on Chinese acquisition information technology, corpus teaching material design teaching strategies and teacher training learning models assessment methods

Please contact Melody Zhang (e-mail: [email protected]) for submitting book proposals for this series.

More information about this series at http://www.springer.com/series/13176

Huei‐Mei Liu Feng‐Ming Tsao Ping Li •

•

Editors

Speech Perception, Production and Acquisition Multidisciplinary approaches in Chinese languages

123

Editors Huei‐Mei Liu National Taiwan Normal University Taipei, Taiwan

Feng‐Ming Tsao National Taiwan University Taipei, Taiwan

Ping Li The Hong Kong Polytechnic University Hong Kong, Hong Kong

ISSN 2520-1719 ISSN 2520-1727 (electronic) Chinese Language Learning Sciences ISBN 978-981-15-7605-8 ISBN 978-981-15-7606-5 (eBook) https://doi.org/10.1007/978-981-15-7606-5 © Springer Nature Singapore Pte Ltd. 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional afﬁliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

A signiﬁcant amount of research has gone in the last several decades into the study of speech learning and processing in Chinese, especially with regard to the processing of tones in Chinese (including Mandarin, Cantonese, Southern Min, and other major Chinese dialects). Against the backdrop of this research, we are pleased to present to our readers this volume as a synthesis and a roadmap to research in speech perception, production, and learning in the Chinese language context, with multidisciplinary research topics, approaches, and methodologies. The idea of this book originated from the discussions in the context of the book series Chinese Language Learning Sciences, an ambitious Springer project initiated at the National Taiwan Normal University (NTNU). When the book series was ﬁrst launched, one of the editors of this volume (PL) was on sabbatical leave visiting NTNU and had a chance to frequently discuss common interests with the other two editors (FT and HL) with regard to the topics of speech perception in children and adults, in native and non-native languages, and in typical and atypical language development. We felt that a book that can cover these grounds, with special reference to lexical tone perception in Chinese, would signiﬁcantly help researchers, in particular, young scholars, and junior faculty members at institutions both in Asia and other parts of the world to become familiar with the ﬁeld, and to carry out further exciting work on the basis of the extant literature. Thus, we have invited leading scholars from linguistics, psychology, cognitive neuroscience, and communication disorders to contribute to this volume. We speciﬁcally highlighted four major domains of work, including basic cognitive processes, neural representations, domain-general transfer and cross-modal integration, and development of speech from infancy to adulthood. Although the domains are not meant to be exhaustive of the large literature, we hope that this book represents some of the most exciting ongoing work and serves its purpose as both an overview of major research questions and a roadmap for future research. We are grateful to the book series editor Prof. Yao-Ting Sung at NTNU for his constant encouragement and support, which makes this volume possible in the ﬁrst place. We would also like to thank Lawrence Liu, Lay Peng Ang, and Melody Zhang at Springer for their support and patience during the editing of this book. v

vi

Preface

We thank the Institute for Research Excellence in Learning Sciences, NTNU, for their support. We also thank reviewers, Yang Zhang (University of Minnesota), Linjun Zhang (Beijing Language and Culture University), and Christina Zhao (University of Washington), for providing their suggestions that helped improve this book. Finally, the book would not have been possible without the hard work and strong commitment from the many contributing authors, who have gone through their manuscripts many times to ensure readability and accuracy. Needless to say, there may still be errors or areas for improvement, and we welcome readers from the community to provide critical comments and evaluation. Taipei, Taiwan State College, USA Taipei, Taiwan

Huei‐Mei Liu Ping Li Feng‐Ming Tsao

Contents

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Huei-Mei Liu, Feng-Ming Tsao, and Ping Li

Part I 2

3

4

Acoustics, Perception, and Production of Lexical Tones (in Adults)

The Phonetic Realizations of the Mandarin Phoneme Inventory: The Canonical and the Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . Janice Fon

11

Acoustic-Based and Knowledge-Based Processing of Mandarin Tones by Native and Non-native Speakers . . . . . . . . . . . . . . . . . . . Chao-Yang Lee and Seth Wiener

37

Individual Differences in Lexical Tone Learning . . . . . . . . . . . . . . . Erin M. Ingvalson and Patrick C. M. Wong

Part II 5

1

59

Neural Representations

Native and Nonnative Processing of Acoustic and Phonological Information of Lexical Tones in Chinese: Behavioral and Neural Correlates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Keke Yu, Ruiming Wang, and Ping Li

79

6

Neurophysiological Studies of Mandarin Lexical Tone Acquisition in Early Childhood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Chia-Ying Lee and Ying-Ying Cheng

7

Neural Processing of Tone Sandhi in Production and Perception: The Case of Mandarin Tone 3 Sandhi . . . . . . . . . . . . . . . . . . . . . . 117 Claire H. C. Chang and Wen-Jui Kuo

vii

viii

Contents

Part III

Domain-General Transfer and Cross-Modal Integration

8

The Effect of Musical Experience and Congenital Amusia on Lexical Tone Perception, Production, and Learning: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Jia Hoong Ong, Shen Hui Tan, Alice H. D. Chan, and Francis C. K. Wong

9

Multi-Modal Perception of Tone . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Yue Wang, Joan A. Sereno, and Allard Jongman

Part IV

Development from Infancy Through Childhood

10 Lexical-Tonal Perception Development in Infancy . . . . . . . . . . . . . 177 Feng-Ming Tsao and Huei-Mei Liu 11 Early Word Recognition and Word Learning in Mandarin Learning Children . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Leher Singh 12 Speech Development in Mandarin-Speaking Children . . . . . . . . . . 219 Gang Peng and Fei Chen 13 Behavioral and Neurophysiological Evidence of Speech Processing in Chinese-Speaking Individuals with Autism Spectrum Disorder: A Review and Future Directions . . . . . . . . . . . 243 Yan H. Yu and Valerie L. Shafer

Chapter 1

Introduction Huei-Mei Liu, Feng-Ming Tsao, and Ping Li

This handbook brings together 12 chapters written by leading scholars in the field who provide perspectives and syntheses of important issues in speech processing and language learning in Chinese, with particular reference to lexical tones. In this book, the inter-disciplinary nature of the field is reflected in the diverse expertise of authors and the fields that they represent, including linguistics, psychology, cognitive neuroscience, education, and communication disorders. The research designs and methodologies used by researchers in this field are multidimensional, including but not limited to paradigms and methods from humanities, social sciences, computational science, neuroscience, and genetics, as clearly illustrated by the many studies either conducted or reviewed by the authors of this volume. In this introductory chapter, we discuss general issues addressed in different chapters, present the organization of this book, and provide an overview of each chapter. We hope that the readers will take our introduction as a starting point to read the individual chapters and then consider to examine the specific research issues further. The organization of this volume is divided into four major areas or approaches. First, the most basic topics in this field of study are the phonological features and perceptual representations of speech sounds in Chinese, and therefore, the first part of this volume presents analyses of acoustics, perception, and production of lexical tones in native Chinese speakers and in adult learners of Chinese as a second language (CSL). Second, in addition to behavioral approaches, the rapid development and use of non-invasive neuroimaging tools to study processes of lexical tones H.-M. Liu National Taiwan Normal University, Taipei, Taiwan F.-M. Tsao (B) National Taiwan University, Taipei, Taiwan e-mail: [email protected] P. Li The Hong Kong Polytechnic University, Hong Kong, China © Springer Nature Singapore Pte Ltd. 2020 H.-M. Liu et al. (eds.), Speech Perception, Production and Acquisition, Chinese Language Learning Sciences, https://doi.org/10.1007/978-981-15-7606-5_1

1

2

H.-M. Liu et al.

not only help to reveal neurocognitive mechanisms but also provide convergent evidence that is consistent with results of behavioral approaches to support models of speech processing. Thus, the second part of this volume reviews studies on neural and cognitive representations of lexical tones in adults and children, using various neuroimaging methods, e.g., fMRI, EEG/ERP, and MEG. Third, speech sounds are acoustic signals of spoken words and are mainly processed through auditory systems; however, visual cues during speech sound production also contribute to the perception of speech sounds (cf. the McGurk effect of phonetic perception). Additionally, domain-general learning mechanisms facilitate phonetic learning. Cross-modal integration of speech perception and domain-general transfer are also evident with lexical tones. Given these effects in the literature, the third part of this volume demonstrates cross-modal integration and domain-general transfer of lexical tone processing, for example, with studies that examine the effects of visual cues of lexical tone production and music training effects on the perception of lexical tones. Finally, the scope of this volume would not be complete, if reviews on development of speech perception and production were not included. Studies of the production and perception in infants and children not only demonstrate developmental trends but also reveal underlying mechanisms that facilitate perceptual learning and production in native and non-native speakers of Mandarin Chinese. Thus, the final part of this volume covers speech perception and production development of Chinese languages from infancy through childhood, including studies exploring both monolingual and bilingual children, and also studies of typically developing children and children with communication disorders (e.g., autistic spectrum disorder). Each chapter in this volume reviews a subfield of studies with research methods, empirical data, and theoretical explanations to address essential issues of Chinese speech processing. There are multiple research questions that span the chapters of this book. For example, effects of learning a native language on the processing of lexical tones are extensive, as illustrated in lifespan development of tone perception and production from infants (Chaps. 10, 11), children (Chaps. 6, 12, 13), and adult native speakers of tonal and non-tonal languages in native and second language learners (Chaps. 2–5, 7–9). Besides the general issue of language-learning experience and its impact, several chapters focus on issues that are particularly of relevance to lexical tones through acoustic/prosodic features of the tones. Compared with phonological features of non-tonal languages, tone sandhi is a special feature of tonal languages, and several chapters discuss this effect for which the phonological context affects the production of lexical tones (Chaps. 2, 7). Additionally, effects of tone sandhi on neural representation (Chap. 7) and word perception development in children (Chap. 11) are reviewed and syntheses provided. The acoustic/prosodic features of lexical tones (i.e., fundamental frequency and pitch) are also clearly topics of research interest and their impact on both music and linguistic prosody are examined in several chapters of the volume: for example, some studies have investigated the transfer effects from music to language-learning in infants (Chap. 10), non-tonal language speaking adults with amusia or with music training (Chap. 8), and the interaction between lexical tone and linguistic prosody in children with ASD (autistic spectrum disorder) who produce inappropriate intonations (Chap. 13).

1 Introduction

3

These examples indicate that the chapters in this volume showcase a variety of research topics, approaches, and methodologies. The scope of this book goes beyond traditional studies of language teaching and learning or developmental studies of first language acquisition. It highlights learning, perception, and production of speech, from preverbal infants learning their first language to adults learning Chinese as a second language. While we recognize that speech perception in the Chinese context is a large and rapidly growing field and therefore it is impossible to represent all the exciting research in this volume, we believe that our handbook has provided a balanced view of the most important issues currently under investigation. We hope that readers will appreciate the significance of the presented issues in this volume for both theory building and applications in science, technology, education, and practice. Below we provide a quick overview of the chapters and the organization of the volume.

1.1 Part I. Acoustics, Perception, and Production of Lexical Tones (in Adults) Fon (Chap. 2) introduces the phonetic features of vowels, consonants, and lexical tones in Mandarin Chinese. In addition to depicting acoustic features of individual phonemes, Fon also presents major allophonic rules. Besides the common phonetic features in Mandarin speakers across many geographic regions, several phonetic features of Mandarin would interact with other languages, and this dynamic language interaction results in local variants of consonants, vowels, and lexical tones across different regions. Readers can take this chapter as a first step to understand the linguistic features for studying the mechanisms of speech perception and production. Lee and Wiener (Chap. 3) review both top-down and bottom-up processes that are essential to lexical tone and spoken word perception. Regarding bottom-up processing, acoustic features (e.g., F0) of lexical tones vary with phonetic contexts and native Mandarin speakers perceive F0 contours of lexical tones by considering tonal coarticulation. Regarding top-down processing, Mandarin speakers track multiple knowledge-based information, e.g., from syllable-tone co-occurrence probabilities to tonal neighborhood density, syllable-tone lexical frequency, and lexical tone categorization. In sum, both acoustic-based (i.e., bottom-up) and knowledgebased (i.e., top-down) processes are used to achieve speech perception and accurate spoken word recognition, similarly for lexical tones and for phonetic segments. There is an increasing interest in studying non-native in addition to native speakers’ processing of lexical tone, as seen in this book. Ingvalson and Wong (Chapter 4) review tone training studies in second-language learners. Their chapter provides the evidence that sources of individual differences on tone learning could be the results of neurophysiological, neuroanatomic, or even genetic differences. They also present how the individual differences of tone training effect could be

4

H.-M. Liu et al.

optimized by adjusting acoustic variations of tone stimuli to match learners’ aptitude before training. Studies discussed in this chapter show that being exposed to lexical tones in single-speaker condition during tone training program is effective for low aptitude learners, but multiple-speaker condition is optimal for high aptitude learners. These studies point to the important new and promising directions, including research focused on second-language learners who vary in their aptitude and sensitivity to pitch acuity, to musical pitches (e.g., people with congenital amusia), and those who have lost their auditory sensitivity due to aging.

1.2 Part II: Neural Representations Yu, Wang, and Li (Chap. 5) provide a review of the role of acoustic versus phonological information of lexical tones and summarize fMRI and ERP results that indicate issues related to the processing of tones by native Chinese speakers and secondlanguage learners of Chinese. They also discuss the issue of brain lateralization: for example, native processing of the acoustic information would show right hemispheric lateralization, whereas their processing of the phonological information shows left hemispheric lateralization. By contrast, non-native processing patterns are inconsistent at first. Additionally, it is hypothesized that hemispheric lateralization of non-native processing of lexical tones would shift as second-language learners (L2) improve their L2 proficiency. The time course of processing is still an open issue: for example, whether acoustic information of lexical tones is processed earlier than phonological information in native speakers’ processing of tones. Further, the functional connectivity between brain regions for specific acoustic cues (pitch height and pitch contour) in both native Mandarin speakers and non-native L2 learners requires further investigation. Lee and Cheng (Chap. 6) present neurophysiological studies that use the eventrelated potentials (ERP) as measures of perceptual discrimination to depict the developmental trends of lexical tone perception from infancy through middle childhood. The presence of mismatched negativity (MMN), one of pre-attentive ERP components, indicates children’s ability to discriminate between lexical tones. In early infancy, MMNs are observed when infants distinguish acoustically distinct tone contrasts, but positive mismatch response (P-MMR) is observed in the discrimination of acoustically similar contrasts. In addition to typically developing children, this chapter also reviews ERP studies of lexical tone perception in children experiencing language-learning difficulty, such as late-talking preschoolers or school-aged children with reading impairment, and the authors propose that several ERP components would be the neural markers to identify children at risk for later language-development impairment. Chang and Kuo (Chap. 7) focus on the neural representations of perceiving and producing tone sandhi in tone-language speaking adults. As discussed earlier, the rules of tone sandhi represent context-dependent phonological rules for producing lexical tones in multi-syllabic words. The majority of tone sandhi studies in the past

1 Introduction

5

have explored the rule of Mandarin Tone 3 sandhi in disyllables, i.e., Tone 3 of the first syllable should be produced as Tone 2 (i.e., 33 → 23) when it is followed by a second syllable that also has Tone 3. The authors show that when producing Tone 3 sandhi, neuroimaging data revealed that the right posterior inferior frontal gyrus (pIFG) and its interaction with other areas in the right hemisphere are associated with the overt production of tone sandhi. Additionally, right IFG is also involved in the perception of lexical tones. Several issues need further investigation, including how the brain represents the effects of linguistic context (e.g., learning experience with a tone language) on the processing of Tone 3 sandhi.

1.3 Part III: Domain-General Transfer and Cross-Modal Integration Ong, Tan, Chan, and Wong (Chap. 8) highlight the transfer from music training to lexical tone perception and review behavioral and neuroimaging studies that compare three groups of participants: listeners with extensive musical training, listeners with musical disorders such as amusia, and naïve listeners. Their review shows clear domain-general transfer that formal musical training benefits non-tonal language speakers in perceiving lexical tones, as well as producing better lexical tones. Tonelanguage speaking amusics tend to show impairment in musical pitch processing as well as in tone perception. The explanations of music-to-lexical tone transfer could rely on shared neural mechanisms for pitch processing in both lexical tone and music and also domain-general cognitive improvements (e.g., auditory memory). Future directions in the music-to-lexical tone transfer include setting criteria to clearly define musicians and patients with amusia in normative and neurocognitive studies. Wang, Sereno, and Jongman (Chap. 9) survey the cross-modal integration effects of visual cues on lexical tone production and perception. When producing lexical tones, head, jaw, eyebrow and lip movements are aligned with spatial and temporal movement trajectories of specific tones. Perceptual findings show that facial and hand gestures improve tone intelligibility when they correspond to producing tones, and these benefits can be augmented by linguistic experience. In addition, greater visual benefits are found for contour tones (e.g., Tone 3 in Mandarin) as compared with flat tones (e.g., Tone 1 in Mandarin). Such findings suggest language-specific mechanisms in cross-modal integration of tone production and perception. Future studies should explore specific visual cues that can benefit the production of individual tones and evaluate how visual tonal cues facilitate perceptual categorization of lexical tones.

6

H.-M. Liu et al.

1.4 Part IV: Development from Infancy Through Childhood Tsao and Liu (Chap. 10) provide an overview of studies on tone perception development in infants learning a tonal language (e.g., Mandarin and Cantonese) or a non-tonal language (e.g., English and Dutch). Infants learning a non-tonal language are able to discriminate tone contrasts of foreign languages at the age of 4–6 months, but they lose this discrimination ability when they reach the age of 9–12 months. The non-tonal language learners regain the sensitivity to distinguish tone contrasts at 18 months of age. Tone-language-learning infants improve their ability to discriminate native contrasts around the first birthday. Developmental factors for the perceptual development of lexical tones include listening to a tone language, the acoustic salience of lexical tones, statistical learning, music-to-tone transfer, and referential word learning. The authors point to the significance of assessing the effect of shortterm music exposure on the perception of lexical tones, which could be one important future direction to examine in the field of cross-domain tone learning in infancy. Expanding the study of speech perception development from lexical tone perception to word perception, Singh (Chap. 11) reviews studies on how lexical tones influence fundamental processes in the development of the mental lexicon, i.e., word segmentation, word recognition, and word learning, in both bilingual and monolingual learners of Mandarin. Studies reveal developmental differences between the course of tone acquisition and phonetic segments in early lexical processes. To fully account for early lexical development in childhood, the theoretical models need to take into consideration not only the influences of consonants and vowels in lexical processes, but also the influence of lexical tones on how children represent phonological details of novel words from infancy through early childhood. Further studies on lexical development in both monolingual and bilingual children are needed to provide empirical evidence to evaluate these models. Peng and Chen (Chap. 12) summarize phonological development of consonants, vowels, and lexical tones in Mandarin-speaking children from infancy through middle childhood. In addition to depicting the developmental time course, the authors discuss various factors, including phonological markedness, role of different phonetic units within a given language, child-directed speech, and the articulation complexity in phonological theories to account for orders of development. They point to several issues that could be addressed in future studies, including the need to evaluate the effects of factors contributing to developmental order on individual phonological acquisition, as well as the importance to assess how links between perception and production facilitate phonological development in Mandarin-speaking children. In addition to these chapters that focus on typical language development, in the final chapter, Yu and Shafer (Chap. 13) review behavioral and neurological evidence of prosody and lexical tone processing in Chinese-speaking adults and children with ASD. Chinese-speaking individuals with ASD produce atypical speech prosody and form less accurate perception of lexical tones. They also exhibit atypical auditory and language processes, e.g., enhanced pitch processing and impaired

1 Introduction

7

linguistic prosody processing. Yu and Shafter further report preliminary neural data on bilingual English–Mandarin-speaking children with ASD. This chapter shows that future directions on lexical tone development of Chinese-speaking children with ASD should explore the developmental trajectories with regard to various aspects of linguistic prosody, evaluating music-to-lexical tone transfer effect and designing language intervention programs that are based on neuroplasticity and theoretical models in the study of theory of mind, a major cognitive theory that accounts for communication impairments in ASD with respect to the lack of ability to reason about other people’s beliefs, intentions, and emotions.

Part I

Acoustics, Perception, and Production of Lexical Tones (in Adults)

Chapter 2

The Phonetic Realizations of the Mandarin Phoneme Inventory: The Canonical and the Variants Janice Fon

Abstract This chapter provides an overview on the phonetic realizations of the Mandarin phoneme inventory. There are two major sections. The first is a general description of the phoneme inventory, which includes five vowels and 19 consonants, along with four lexical tones. In addition to acoustic properties of individual phonemes, major allophonic rules are also discussed. The second section covers some variations on consonants, vowels, and tones in three major Mandarin varieties of Taiwan, Singapore, and China. Some variations are fairly region-specific, while others are more commonly found across various dialects. The former includes the retroflexed vowel suffix in the Mainland variety, the qualitative difference in realizing the neutral tone between the Taiwan and the Mainland variety, and the so-called fifth tone in Singapore Mandarin. The latter includes the deretroflexion and hypercorrection of sibilants, both of which can be found in Taiwan and Singapore Mandarin, and the syllable-final nasal mergers, which can be found in all three major dialects. Interestingly, these cross-dialectal variations also show large within-region variabilities. Both the canonical and the variant realizations of the phonological system are discussed in light of child language acquisition.

Mandarin is by far the most widely spoken language in the world. As of 2013, it boasts an estimated L1 population close to 900 million (Lewis, Simons, & Fennig, 2016), accounting for more than 12% of the total world population at the time (cf. Population Reference Bureau, 2013). It is the sole official language of Taiwan and China, and is one of the four official languages in Singapore. This chapter contains two main sections. The first is devoted to an overview of the phonetic realizations of Mandarin phonology, and the second focuses on some variations in the three major Mandarin variants of Taiwan, Singapore, and China. All three main aspects of phonology, i.e., vowels, consonants, and tones, are discussed.

J. Fon (B) Graduate Institute of Linguistics, National Taiwan University, Taipei, Taiwan e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 H.-M. Liu et al. (eds.), Speech Perception, Production and Acquisition, Chinese Language Learning Sciences, https://doi.org/10.1007/978-981-15-7606-5_2

11

12

J. Fon

Fig. 2.1 Three different proposals for the Mandarin vowel system

2.1 The Phoneme Inventory 2.1.1 Vowels Depending on the theoretical frameworks, Mandarin is said to have a system ranging from four to six single vowels. The four-vowel system is based on feature geometry and was proposed by Wu (1994) (Fig. 2.1a), while the six-vowel system is based on phonemic evidence and was proposed by C.-C. Cheng (1973) (Fig. 2.1c). Most other scholars argued for a five-vowel system, with an inventory of three high vowels, one mid-vowel, and one low vowel (Chao, 1968; R. L. Cheng, 1966; Duanmu, 2007; Y.-H. Lin, 1989; Wan & Jaeger, 2003), as shown in Fig. 2.1b. As all three proposals include largely overlapping sets of vowels, this chapter follows the perspective of the majority and reviews only the five-vowel system in detail. The three high vowels in Mandarin could be nicely demonstrated by a minimal triplet, as shown in (1).1 These three phonemes behave in a rather similar fashion and become a homorganic glide when they precede a non-high vowel, as shown in (2). [i] and [u] can also be the second part of a diphthong and act as an off-glide, as shown in (3).

1 Hanyu

Pinyin is adopted for Mandarin romanization throughout this chapter.

2 The Phonetic Realizations of the Mandarin Phoneme …

13

Fig. 2.2 Waveforms and spectrograms of the three high vowels, yí /i/ ‘aunt’, yú /y/ ‘fish’, and wú /u/ ‘nil’. The first three formants are labeled accordingly

Figure 2.2 shows the waveforms and spectrograms of the three high vowels, yí /i/ ‘aunt’, yú /y/ ‘fish’, and wú /u/ ‘nil’, produced by a female speaker. Notice that the acoustic cues correspond nicely with the articulatory descriptions of the vowels. All three vowels are high, and therefore their F1s are unanimously low. /i/ and /y/ are both front vowels, while /u/ is a back vowel, and therefore the former two have higher F2s than the latter. /y/ and /u/ are both rounded vowels and therefore have relatively low F3s (and F2s), as compared to the unrounded /i/. The high vowel /i/ is worth further mention. Besides being realized as a high front vowel and a glide, it also has a third allophone [ɨ] that occurs after dental and retroflex sibilants [ts tsh s tʂh ʂh ʐ] (see below) (C.-C. Cheng, 1973). (4) contains a near-minimal triplet of [i] and [ɨ],2 and their waveforms and the spectrograms are shown in Fig. 2.3. Compared to their counterpart [i] in tì [ti] ‘ground’, the F2s in zì [tsɨ] ‘Chinese character’ and zhì [tʂɨ] ‘mole (face)’ are much lower, indicating backing of the tongue position.

2 The

phone has been problematic in Mandarin phonology and has invited much discussion and debate. Classical views usually argued for two phones instead of one. For example, Norman (1988) proposed two apical vowels of [ɿ] and [ʅ], Lee-Kim (2014) argued for two syllabic approximates of [ô̪] and [ɻ], and Duanmu (2007) suggested two syllabic fricatives of [z̩] and [ʐ̩̩]. The former ones of each pair are designated to occur after dental sibilants, while the latter ones are designated to occur after retroflexes. However, since the formants in Fig. 2.3 show little frication, implying that it is at least not always realized as a syllabic fricative, and since the major difference between the two renditions of the phone (zì [tsɨ] ‘Chinese character’ vs. zhì [tʂɨ] ‘mole (face)’) mainly lies in F3, not F2, implying that there is little change in tongue backness, this chapter adopts a unifying symbol [ɨ] instead, following C.-C. Cheng (1973), and views the phonetic variation as a product of coarticulation. There is a slight departure from C.-C. Cheng’s (1973) original proposal, however, as this chapter argues for a nonphonemic status instead (see Fig. 2.1).

14

J. Fon

Fig. 2.3 Waveforms and spectrograms comparing the two allophonic variants of /i/ in tì [ti] ‘ground’, zì [tsɨ] ‘Chinese character’, and zhì [tʂɨ] ‘mole (face)’. The first three formants are labeled in white

Mandarin only has one mid-vowel, but its frontness and rounding show environmentally dependent allophonic changes while its height stays the same. Most previous works agree that there are at least four variants, [ђ], [e], [o], and [7] (C.-C. Cheng, 1973; R. L. Cheng, 1966; Y.-H. Lin, 1989). [ђ] occurs in CVN syllables and is considered the default (R. L. Cheng, 1966; Y.-H. Lin, 1989; Wan & Jaeger, 2003; Wu, 1994). [e] occurs in CjV and C4V syllables, while [o] occurs in CwV syllables. Finally, [7] occurs in CV syllables. (5) shows a set of examples.

Figure 2.4 shows the waveforms and spectrograms of the four allophonic variants of the mid-vowel /ђ/ in lèng [lђŋ] ‘absent-minded’, liè [lje] ‘to crack’, luò [lwo]’

Fig. 2.4 Waveforms and spectrograms of the four allophonic variants of the mid-vowel /ђ/ in lèng [lђŋ] ‘absent-minded’, liè [lje] ‘to crack’, luò [lwo] ‘to fall’, and lè [l7] ‘happy’. The first two formants of the main vowels are labeled in black or white accordingly

2 The Phonetic Realizations of the Mandarin Phoneme …

15

to fall’, and lè [l7] ‘happy’. Notice that both [e] and [7] are mid-high vowels and therefore have similar F1s. [e] has a higher F2 than [7] because it is a front vowel. [o] and [7] have similar F1s and F2s because they are both mid-high back vowels, but the former has a slightly lower F2 due to rounding. Compared with [e] and [o], [ђ] has a higher F1, showing that it is relatively low in height, and its F2 value is intermediate between that of [e] and [o], showing that it is in the central position. Similarly, Mandarin also only has one low vowel, and its frontness and height are context-dependent and show allophonic changes. However, its unrounded quality stays the same. Most studies agree that there are at least three allophonic variants, [a], [A], and [E] (C.-C. Cheng, 1973; R. L. Cheng, 1966; Y.-H. Lin, 1989; Wu, 1994). [a] is default and occurs in open and C(w)Vn syllables (R. L. Cheng, 1966; Y.-H. Lin, 1989; Wan & Jaeger, 2003). [A] occurs in CVu and C(G)Vŋ, and [E] occurs in CjVn and C4Vn syllables. (6) is a set of examples.

4 1.sg: first person singular pronoun Figure 2.5 shows the waveforms and spectrograms of the three allophonic variants of the low vowel /a/ in án [an] ‘1.sg’, áng [Aŋ] ‘to raise high’, and ián [jEn] ‘salt’. Notice that both [a] and [E] are front vowels and therefore have higher F2s than the back vowel [A]. [E] also has a lower F1 than [a] and [A], indicating its higher vowel height. There is an additional retroflex vowel /ɚ/ in Mandarin, which can only occur by itself in a bare V syllable (Duanmu, 2007). (7) shows a minimal pair of /ɚ/ versus /ђ/. Figure 2.6 shows a spectrographic comparison between the two vowels. Notice that because of the rhotic quality of the vowel /ɚ/, the third formant is drastically lowered as compared to that of the mid-vowel /ђ/.

Fig. 2.5 Waveforms and spectrograms showing the three allophonic variants of the low vowel /a/, án [an] ‘1.sg’, áng [Aŋ] ‘to raise high’, and ián [jEn] ‘salt’. The first two formants of the main vowels are labeled in black or white accordingly

16

J. Fon

Fig. 2.6 Waveforms and spectrograms of the retroflex vowel ér /ɚ/ ‘son’ and the mid-vowel é /ђ/ ‘goose’. The first three formants of the main vowels are labeled accordingly

Table 2.1 Mandarin consonant chart Labial Stop

p

Affricate

Dental ph

t ts

tsh

Fricative

f

s

Nasal

m

n

Lateral

Retroflex th

Velar k

tʂ

tʂh

ʂ

ʐ

kh

x ŋ

l

2.1.2 Consonants Mandarin has 19 consonants, including 6 stops, 4 affricates, 5 fricatives, 3 nasals, and 1 lateral (Chao, 1968; R. L. Cheng, 1966; Duanmu, 2007) (Table 2.1).3 The stops occupy three places of articulation, labial, dental, and velar, and have both the aspirated and the unaspirated series. (8) shows a minimal sextuple. The waveforms and the spectrograms of the sextuple are shown in Fig. 2.7.

3 Some

researchers contended that Mandarin consonant inventory also includes an additional set of three alveolo-palatal sibilants /tɕ tɕh ɕ/ [e.g., Luo (1993)]. However, since these three are in complementary distribution with the velar series /k kh x/, the dental series /ts tsh s/, and the retroflexes /tʂ tʂh ʂ/, this chapter views them as allophones of other phonemes and discusses them in a later section.

2 The Phonetic Realizations of the Mandarin Phoneme …

17

Fig. 2.7 Waveforms and spectrograms of the six stops, bˇu /pu/ ‘to mend’, pˇu /ph u/ ‘sheet music’, dˇu /tu/ ‘to gamble’, tˇu /th u/ ‘soil’, gˇu /ku/ ‘ancient’, and kˇu /kh u/ ‘bitter’. The downward arrows above the spectrograms indicate the stop bursts, and the horizontal line segments beneath the spectrograms indicate the aspiration noise of aspirated stops

There are two sets of affricates in Mandarin, the dental and the retroflex. Like their stop counterparts, both the aspirated and the unaspirated series are included. (9) shows a minimal quadruple. The waveforms and the spectrograms of the quadruple are shown in Fig. 2.8. Notice that the lowest major energy concentration for the frication portion of the affricates is lower for the retroflexes than that for the dentals.

Fricatives are the largest set of inventory in Mandarin. In total, there are three sibilants and two nonsibilants. Of the three sibilants, there is one voiceless dental /s/, one voiceless retroflex /ʂ/, and one voiced retroflex /ʐ/. (10) shows a minimal triplet. The waveforms and the spectrograms of the triplet are shown in Fig. 2.9. Like the affricates, the retroflexes have lower major energy concentration than the dental.

18

J. Fon

Fig. 2.8 Waveforms and spectrograms of the four affricates, zàn /tsan/ ‘to praise’, càn /tsh an/ ‘bright’, zhàn /tʂan/ ‘to stand’, and chàn /tʂh an/ ‘to shiver’. The rightward arrows indicate the lowest major energy concentration for the noise burst of the affricates. The first three formants of the main vowels are labeled in white accordingly

Fig. 2.9 Waveforms and spectrograms of the three Mandarin sibilant fricatives, sˇan /san/ ‘umbrella’, aˇ n /ʂan/ ‘to flash’, and rˇan /ʐan/ ‘to dye’. The rightward arrows indicate the lowest major energy concentration for the fricative noise. The first three formants of the main vowels are labeled in white accordingly

The two nonsibilant fricatives are labial /f/ and velar /x/. (11) shows a minimal pair. The waveforms and the spectrograms of the pair are shown in Fig. 2.10.

Mandarin has three nasals, /m n ŋ/, and one liquid /l/. /m n l/ can occur in the onset position, as indicated in (12). The waveforms and the spectrograms of the triplet are shown in Fig. 2.11. /n ŋ/ can occur in the coda position, as shown in (13).

2 The Phonetic Realizations of the Mandarin Phoneme …

19

Fig. 2.10 Waveforms and spectrograms of the two nonsibilant fricatives, fú /fu/ ‘blessing’ and hú /xu/ ‘lake’. The up–down arrows indicate the range of frequency noise for the two fricatives

Fig. 2.11 Waveforms and spectrograms of the three onset sonorants, mí /mi/ ‘mystery’, ní /ni/ ‘mud’, and lí /li/ ‘pear’. The horizontal line segments beneath the spectrograms indicate the sonorant sections

As illustrated in Fig. 2.12, the two nasals differ not only in nasal formants, but also in the F2 and F3 offsets of the preceding vowels. /n/ has both a falling F2 and a falling F3, while /ŋ/ has a rising F2 and a falling F3.

There is also a special set of alveolo-palatal sibilants [tɕ tɕh ɕ], which are in complementary distribution with the velars /k kh x/, the dentals /ts tsh s/, and the retroflexes /tʂ tʂh ʂ/. The alveolo-palatal set only occurs before /i y/, while the other

20

J. Fon

Fig. 2.12 Waveforms and spectrograms of the two nasal codas, pín /ph in/ ‘poverty’ and píng /ph iŋ/ ‘level’. The first three formants of the main vowels are labeled in white. The F2 and F3 offset tracings of the main vowels are also shown in white. The horizontal line segments beneath the spectrograms indicate the nasal sections

three only occur elsewhere, as shown in (14). Diachronically, the alveolo-palatal set stems from two historical sources, the velars and the dentals. Synchronically, some scholars have identified it with the velar set [e.g., Chao (1968), R. L. Cheng (1966)], while others have identified it with the dental set [e.g., Hartman (1944)]. In reality, the evidence could go either way (Duanmu, 2007; Y.-H. Lin, 1989). Figure 2.13 shows the waveforms and the spectrograms of an alveolo-palatal triplet.

Fig. 2.13 Waveforms and spectrograms of the three alveolo-palatals, jˇu [tɕy] ‘to lift’, qˇu [tɕh y] ‘to take’, and xˇu [ɕy] ‘to promise’. The downward arrows above the spectrograms indicate the stop bursts, and the horizontal line segments beneath the spectrograms indicate frication noise

2 The Phonetic Realizations of the Mandarin Phoneme …

21

2.1.3 Tones There are four citation tones in Mandarin, traditionally characterized as Tone 1, Tone 2, Tone 3, and Tone 4. Tone 1 is a high-level tone, Tone 2 is a rising tone, Tone 3 is a dipping (i.e., falling–rising) tone, and Tone 4 is a falling tone (Chao, 1968). (15) shows an example of a minimal quartet, using a fairly common transcription convention of suffixing the tonal number to the IPA transcription of the syllable (Duanmu, 2007). Figure 2.14 shows an example of the F0 tracks of the four tones. Although tones are realized by continuous pitch contours, different sections of the contours seem to have different weightings for different tones. Lee & Wiener (Chap. 3 of this volume) and Tsai & Liu (Chap. 10 of this volume) give a good overview of

Fig. 2.14 F0 tracks of the four tones in Mandarin on the syllable ying /iŋ/

22

J. Fon

how these contours are perceived by novice (infants and nontone language users) and expert listeners (Mandarin adults).

When a word contains two consecutive Tone 3s, the first Tone 3 undergoes a change and transforms into a Tone 2, making it homophonous with a Tone 2 + Tone 3 sequence. This is known as the Tone 3 sandhi rule. For example, mˇa-liˇan /ma3-lian3/ ‘oblong face (lit. “horse face”)’ becomes homophonous with má-liˇan /ma2-lian3/ ‘pockmarked face’ because the former undergoes the Tone 3 sandhi rule (16). Although the rule is classically posed as obligatory (Chao, 1968), the actual realization is far more complex and variable, as it has been shown that the rule interacts intricately with both linguistic factors like word frequency (Yuan & Chen, 2014) and nonlinguistic factors like speech rate (H.-B. Lin, 1982).

Not all syllables bear the four tones mentioned above. Only stressed syllables do. For unstressed syllables, they have the neutral tone (i.e., Tone 0) instead. (17) shows an example of a minimal pair. By destressing the second syllable, the meaning changes from ‘wife and son’ to ‘wife’ only. Traditionally, neutral tones are considered to be a tonal category that does not have an intrinsic tonal value of its own, and its tonal realization is largely dependent on the preceding tone (Chao, 1968). However, Chen and Xu (2006) claimed that the neural tone owns a mid-level tonal target, but is qualitatively different from non-neutral tones in that the articulatory strength for reaching the target is relatively weak and inefficient, and thus the speed at which such a target is approached is relatively slow. As a consequence, neutral tones are more susceptible to coarticulatory forces of surrounding tones. In either case, neutral tones are realized to be of shorter duration and weaker amplitude, as shown in Fig. 2.15.

Fig. 2.15 Waveforms and F0 contours for the minimal pair of q¯ı-zˇı [tɕh i1-tsɨ3] ‘wife and son’ and q¯ı-zi [tɕh i1-tsɨ0] ‘wife’

2 The Phonetic Realizations of the Mandarin Phoneme …

23

2.1.4 Syllable Structure Mandarin syllables have a maximal form of CGVX plus tone (Duanmu, 2007). Except for the nucleus vowel slot, all the others are optional. VX constitutes the rime, and the X slot can either be fulfilled by an off-glide of a diphthong or a nasal. The status of G is more controversial. Traditionally, it is analyzed as part of the final, which includes a medial slot reserved for G, and the rime (R. L. Cheng, 1966) (Fig. 2.16a). However, more recently, Duanmu (2007) claimed that G should be grouped together with C to form the onset, as it adds a secondary articulation to C when both are present (Fig. 2.16b). Figure 2.17 shows a spectrographic comparison between Mandarin suì [sw eI] ‘to shatter’ and English sway [sweI]. Notice that the major energy concentration for [s] in suì is much lower than that for [s] in sway as a result of more lip protrusion in the former. In addition, there is a clearer (and longer) voice portion of [w] in sway than that in suì. These crosslinguistic differences imply that Mandarin glides are qualitatively different from English glides when they are positioned in consonant clusters and might be more adequately characterized as a secondary articulation to the leading consonant rather than as an independent phone instead. As there are 18 consonants possible for the onset C slot (i.e., all consonants except for /ŋ/, see Table 2.1), 3 glides possible for the G slot, 5 vowels possible for the V slot (excluding the retroflex vowel /ɚ/, see Fig. 2.1b), and 4 sounds (/i, u, n, ŋ/) possible for the coda X slot, and everything except for the V slot is optional, there could theoretically be 19 (C slot) × 4 (G slot) × 5 (V slot) × 5 (X slot) = 1900 syllables

Fig. 2.16 Two different views of the Mandarin syllable structure. Please see text for explanation

24

J. Fon

Fig. 2.17 Waveforms and spectrograms for Mandarin suì [sw eI] ‘to shatter’ (left) and English sway [sweI]. The rightward arrows indicate the lowest major energy concentration for the fricative noise of [s]. The horizontal line segments beneath the spectrogram indicate the section for the glide [w]. Please see text for explanation

possible in Mandarin, disregarding tones. However, in reality, there are only about a few more than 400 syllables existing in the inventory (Duanmu, 2007). Even when tones are included, the number only goes up to about 1300, which is far fewer than the number of monosyllables in English. Although this inevitably results in a large number of homophonous monosyllabic morphemes, homophony is not a prevalent issue in daily communication, as monosyllabic words only constitute about 27% of the lexicon, and the majority (≈ 70%) is in fact bisyllabic (He & Li, 1987). If contextual information is further taken into consideration, miscommunication due to homophony is virtually nonexistent. In summary, Mandarin is much like many of the East Asian languages, with a moderate segment inventory size and a relatively simple syllable makeup, both of which seem to superficially imply a smooth and easy acquisition process. However, the language is phonologically marked in several aspects. First, it is a tone language that contains both register and contour tones, along with complicated tone sandhi realization rules (Duanmu, 2007; H.-B. Lin, 1982; Yuan & Chen, 2014). As most of the more well-studied Indo-European languages are non-tonal (Maddieson, 2013), Mandarin could provide us with a better understanding of the role of tone and how it interacts with other syllabic components in the process of acquisition. Secondly, being a tone language, Mandarin shows special constraints on its implementation of prosody, as the acoustic correlates of prosody and tone overlap to a large extent (Peng et al., 2005). How the interaction of the two affects the acquisition process compared to other non-tonal languages is an interesting issue both in theory and in application. Finally, although there is nothing phenomenal regarding the consonant inventory size of Mandarin, it contains a special set of retroflex sibilants (i.e., /tʂ tʂh ʂ ʐ/) that is relatively rare among languages of the world. According to the UCLA Phonological Segment Inventory Database (Maddieson, 1984), out of the 451 languages investigated, only 30 of them incorporate at least one retroflex sibilant in their inventory,

2 The Phonetic Realizations of the Mandarin Phoneme …

25

accounting for about 7%. If one only considers languages that have at least as many as four retroflex sibilants like Mandarin, the number goes down drastically to only 6, which is a mere 1%. Thus, studying Mandarin could also provide new insights on how crosslinguistically common and uncommon sounds are acquired in the first few beginning years.

2.2 Variations in Phoneme Realization As Mandarin is widely spoken by a vast population of speakers, variation in its phonological implementation is understandably inevitable. Since variation likely adds further complications to the already formidable task for young beginning speakers, description of such a phonological system would be incomplete if variation in phonological implementation is lacking. As Singapore has gained its independence from Malaysia in 1965, and Taiwan has been politically separated from China for nearly 70 years, it is not surprising that many of the noted variations in Mandarin phonology are found among the three standard Mandarin varieties. However, more recent studies also focus on some within-variety variations in Taiwan due to differential degrees of language contact with Southern Min, a local substrate with which about 70% of the Taiwan population show at least some familiarity (Huang, 1993). In this section, all three types of variations of vowels, consonants, and tones are delineated.

2.2.1 Vowels One aspect of vowels that shows drastic dialectal variations is the retroflexed vowels in the Mainland variety. The retroflex vowel /ɚ/ can occur as a suffix signaling diminutiveness or achieving certain stylistic purposes. It assumes the coda position and oftentimes completely replaces the original coda of the host syllable, if any (Duanmu, 2007). (18) shows an example of a minimal pair. While the non-retroflexed zhè can mean both ‘this’ and ‘here’, the retroflexed zhèr is used only to indicate ‘here’. When a syllable is retroflexed, the vowel assumes a rhotic quality and its F3 is substantially lowered, as shown in Fig. 2.18. Although retroflexed vowels are quite common in the Mainland variety, it is virtually nonexistent in Singapore (Chew, 2002) and Taiwan. In our Taiwan Mandarin spontaneous speech corpus (Fon, 2004), which currently includes more than 50 h of transcribed monologues, there is not a single instance of retroflexed vowels found.4 In other words, one can probably safely say this is a vocalic trait that is exclusive to the Mainland variety.

4 The

speakers in the Taiwan Mandarin corpus included both young (ages 20–35) and old speakers (ages 50–65) who spent all or most of their childhood and teenage years in the designated sampling locations.

26

J. Fon

Fig. 2.18 Waveforms and spectrograms for zhè /tʂђ/ ‘this; here’ and zhèr ‘here’ /tʂђɚ/. The first three formants are labeled accordingly

2.2.2 Consonants Dialectal variations are even more prevalent in consonants. In this section, two types of consonant variations are discussed. One is the realization of dental and retroflex sibilants, and the other concerns the syllable-final nasal mergers.

2.2.2.1

Dental and Retroflex Sibilants

Sibilant realization is extremely variable in Taiwan (Chuang, 2009; Chuang, Chiang, & Fon, 2012; Chuang, Wang, & Fon, 2015) and Singapore Mandarin (Ng, 1985) due to frequent contact with local language substrates. Two main processes have been documented. One is to realize the retroflex sibilants as dentals, appropriately termed the deretroflexion rule, and the other is the reverse of the deretroflexion rule, realizing the dental sibilants as retroflexes. Since the latter is deemed as a way speakers employ to counteract the prevalence of deretroflexion, it is usually termed the hypercorrection rule and is more commonly observed in formal styles (Chuang, 2009; Chung, 2006; Ng, 1985). As shown in (19), the two sets of rules between voiceless dentals and voiceless retroflexes are fairly straightforward, as they are the exact opposite of each other. However, realization for the voiced retroflex /ʐ/ is more variable. As it lacks a phonological dental counterpart (cf. Table 2.1), previous studies showed that its most common deretroflexed realization is [l] and [z] is only the second most common (Chan, 1984; Chuang et al., 2015).

2 The Phonetic Realizations of the Mandarin Phoneme …

27

Gender and genre are effective factors in determining the application of the rules (Chuang et al., 2012; Ng, 1985). Deretroflexion is more likely found among males than females, in spontaneous than read speech, and in connected speech than word lists. The trend is observed in both Taiwan and Singapore Mandarin (Fig. 2.19). On the other hand, more dialectal variation was found for hypercorrection. Although Singapore Mandarin still shows both a gender and a genre effect, as male speakers are more likely to apply hypercorrection than females, and spontaneous speech being the least likely to facilitate the application of this rule, similar trends are not observed in Taiwan Mandarin, as its hypercorrection rates are unanimously low (Fig. 2.19). In fact, both rules are far more prevalent in Singapore Mandarin than Taiwan Mandarin. Deretroflexion in Singapore Mandarin is nearly complete in male spontaneous speech, and its hypercorrection is almost 60% in male tongue twisters. In contrast, deretroflexion in Taiwan Mandarin is at best less than 40% for male spontaneous speech, and all hypercorrection rates regardless of gender and genre hover around 5% only. This might have been due to differential attitudes towards the rules in the two variants. Ng (1985) claimed that Singapore Mandarin speakers generally deem retroflex realization as undesirable, as they often associate it with foreigner speech and snobbishness. Most Singaporean speakers find it more natural with a variety that reveals local identity. On the other hand, most speakers from Taiwan associate deretroflexion as a stigma for being nonstandard (Chan, 1984; Chuang et al., 2012; Kubler, 1985a, 1985b) and would thus avoid doing so, at least consciously, when situation arises. Since hypercorrection is meant to counteract deretroflexion, it is possible that Singaporeans find it more necessary to do so in a formal context than Taiwan speakers as the former shows a much higher deretroflexion rate than the latter. At least for the Mandarin variety in Taiwan, sibilant realization is also affected by prominence (Chuang, 2009; Chuang & Fon, 2010). As shown in Fig. 2.20, sibilants in more prominent positions are realized with a larger degree of dentalization. This is true for sibilants of both dental and retroflex place, and also for both original and derived sibilants (i.e., those that underwent deretroflexion/hypercorrection). This implies that retroflex sibilants at a prosodically prominent position can be acoustically similar to dental sibilants at a prosodically weaker position, and perceptual ambiguity might potentially arise when one is devoid of context. The figure also shows that there is a gender difference in the maintenance of the two sibilant places. Female speakers in general maintain a more distinct phoneme space for the dental and retroflex series, respectively. Similar levels of distinction hold for both original and derived sibilants. On the other hand, male speakers not only show less distinction between original dentals and retroflexes, and between derived dentals and retroflexes, and their derived sibilants are also not as close to the intended targets as their female counterparts. This indicates that although male Taiwan Mandarin speakers may not have as much overt mixing between retroflex and dental sibilants at the phonemic level as their

28

J. Fon

Fig. 2.19 Realization rates of deretroflexion /ʂ/ → [s] and hypercorrection /s/ → [ʂ] in a Taiwan (Chuang et al., 2012) and b Singapore Mandarin (Ng, 1985). TT: tongue twisters; MP: minimal pairs; WL: word lists; CS: connected speech; and SP: spontaneous speech

Singaporean counterparts (cf. Figure 2.19), their phonetic realization of the sibilants indeed shows that the two series are becoming more like each other.

2 The Phonetic Realizations of the Mandarin Phoneme …

29

Fig. 2.20 Effect of prominence on the realization of voiceless sibilants. Being negatively correlated with the degree of retroflexion, the center of gravity (COG) was taken from the middle 10 ms of the fricative spectrum for all six voiceless sibilants /tʂ tʂh ʂ ts tsh s/ using a spontaneous speech corpus (Chuang, 2009; Chuang & Fon, 2010). Prominence was defined using the stress tier in the panMandarin-ToBI system (Peng et al., 2005). S2 is the default level, S3 shows extra prominence, and S1 indicates unstressed positions. Retroflex: retroflex sibilants that are underlyingly retroflex, /tʂ tʂh ʂ/ → [tʂ tʂh ʂ]; dental: dental sibilants that are underlyingly dental, /ts tsh s/ → [ts tsh s]; retroflexed dental: dental sibilants that underwent hypercorrection, /ts tsh s/ → [tʂ tʂh ʂ]; deretroflexed retroflex: retroflex sibilants that underwent deretroflexion, /tʂ tʂh ʂ/ → [ts tsh s]. There were no unstressed male tokens with the retroflexed dental in the data collected

2.2.2.2

Syllable-Final Nasal Mergers

The two Mandarin coda nasals /n/ and /ŋ/ are allowed to occur after all five vowels. However, they are found to be merging with each other after /i/ and /ђ/ (C.-Y. Chen, 1991; Fon, Hung, Huang, & Hsu, 2011; Yang, 2010). As shown in (20), merging for the former is bidirectional, and /in/ → [iŋ] and /iŋ/ → [in] are both observed, while merging for the latter is unidirectional, and only /ђŋ/ → [ђn] is found.

30

J. Fon

The three nasal mergers are not equally favored in the three major Mandarin varieties. As indicated in Fig. 2.21a, Singapore Mandarin predominantly uses /in/ → [iŋ] and /ђŋ/ → [ђn], and few /iŋ/ → [in] were found (C.-Y. Chen, 1991). On the other hand, Mainland Mandarin mainly adopts /in/ → [iŋ] and /iŋ/ → [in], while /ђŋ/ → [ђn] was rarely observed (Yang, 2010). Studies did not agree on how nasal mergers work in Taiwan Mandarin. C.-Y. Chen (1991) claimed that it uses /in/ → [iŋ] and /ђŋ/ → [ђn], but no /iŋ/ → [in], while Yang (2010) argued that it uses /ђŋ/ → [ђn] and /iŋ/ → [in], but no /in/ → [iŋ] instead. Although this controversy seems

Fig. 2.21 Nasal mergers found in a Taiwan, Singapore, and China (C.-Y. Chen, 1991; Yang, 2010), and in b northern and southern Taiwan (Fon et al., 2011)

2 The Phonetic Realizations of the Mandarin Phoneme …

31

Fig. 2.22 Effect of context on the application of nasal mergers in Taiwan Mandarin. Solid bars represent target syllables in sentence-final positions, and dashed bars represent syllables in isolation (Fon et al., 2011)

intriguing, Fon et al. (2011) showed that this discrepancy is likely due to dialectal variations within the variety. As indicated in Fig. 2.21b, northern Taiwan speakers predominantly use /in/ → [iŋ] and /ђŋ/ → [ђn], much like what was found in C.-Y. Chen (1991), while southern Taiwan speakers could potentially use all three, which at least partially corroborated with Yang’s (2010) findings. Like the realization of voiceless sibilants, the application of nasal mergers can be context-dependent, at least in Taiwan Mandarin (Fon et al., 2011). As shown in Fig. 2.22, although context exerts little effect for /in/ → [iŋ] among northerners and southern females, the effect is more robust for southern males, and the latter group tends to apply the merger more in a sentential context. On the other hand, /ђŋ/ → [ђn] showed an opposite trend, and the context effect is more robust for northerners and southern females than southern males. As Taiwan Mandarin speakers generally regarded /in/ → [iŋ] as a form that is more prestigious than /ђŋ/ → [ђn] and /iŋ/ → [in], this implies that the application of nasal mergers is likely not only contextdependent, but also prestige- and therefore speaker-group-dependent, at least for the Mandarin variety in Taiwan.

2.2.3 Tones Dialectal variations are also found in the realization of tones. This section focuses on two aspects of tonal variations, the realization of neutral tones, and that of the ‘fifth tone’ in Singapore Mandarin.

32

2.2.3.1

J. Fon

Realization of Neutral Tones

As mentioned above, Chao (1968) claimed that the actual pitch height of a neutral tone is largely dependent on the tonal value of the preceding syllable. It is realized the highest in pitch when it is preceded by a Tone 3, and the lowest when it is preceded by a Tone 4. Neutral tones following Tone 1 and Tone 2 are realized with midpitch. According to Chao (1968), neutral tones are more susceptible to coarticulatory forces due to lack of intrinsic tonal values and thus readily adopt the tonal values of their neighboring tones. However, Chao’s (1968) account might not be viable in some varieties of Mandarin, as neutral tone realization seems to show dialectdependent variations (Fig. 2.23). Although much variance is indeed found across different tonal contexts in the Mainland variety, neutral tones in Taiwan Mandarin are realized rather unanimously as mid-falling, with little effect from the preceding syllable being observed. This implies that the neutral tone in Taiwan Mandarin is qualitatively different from its Mainland counterpart, and may own an unswerving intrinsic tonal value of a mid-fall instead. Even if Chen and Xu’s (2006) account for neutral tones is adopted, a defining difference for the neutral tones between the two dialects still exists. As mentioned earlier, Chen and Xu (2006) claimed that the fundamental difference between a neutral and a regular tone does not lie in the existence of an intrinsic tonal value, but rather, in the articulatory strength with which one exerts in reaching the tonal target. This may be an accurate description of neutral tones in Mainland Mandarin, but the consistent mid-fall realization of neutral tones in the Taiwan variety shows that there is no fundamental difference in the realization strength between a neutral and a regular tone in the dialect. In other words, the neutral tone in Taiwan Mandarin might have been more rightly termed as a ‘fifth’ tone instead, as it shares all of the characteristics of a regular tone except for being intrinsically short. On the other hand, the neutral tone in the Mainland variety still maintains a qualitative distinction from its regular tones and thus still fulfills the current categorization of being ‘neutral’.

Fig. 2.23 Realization of neutral tones after a minimal quartet ting-zhe in Taiwan and Mainland Mandarin. Tone 1: t¯ıng-zhe /tiŋ1-tʂђ0/ ‘listen asp (listening)’; Tone 2: tíng-zhe /tiŋ2-tʂђ0/ ‘rest asp (resting)’; Tone 3: tˇıng-zhe /tiŋ3-tʂђ0/ ‘support asp (supporting)’; Tone 4: tìng-zhe /tiŋ4-tʂђ0/ ‘to allow (one to act arbitrarily) asp (allowing)’ (asp: aspect marker.)

2 The Phonetic Realizations of the Mandarin Phoneme …

33

Fig. 2.24 Realization of the ‘fifth tone’ in Singapore Mandarin with regard to its tonal category in Modern Mandarin and its original entering tone endings (i.e., -p, -t, -k) in Middle Chinese. Data from C.-Y. Chen (1983)

2.2.3.2

‘Fifth Tone’ in Singapore Mandarin

Singapore speakers also tend to have an additional ‘fifth tone’ in their Mandarin, but the source of this fifth tone is completely different from the one that is suggested above for Taiwan Mandarin. According to C.-Y. Chen (1983), the fifth tone in Singapore Mandarin is characterized by a falling contour, resembling Mandarin Tone 4, but is often accompanied by a final glottal stop. When the final stop is present, the tone is short and intense; otherwise, the tone is barely distinguishable from a regular Mandarin Tone 4. The fifth tone only occurs in words that have etymologically an entering tone in Middle Chinese.5 Since the majority of Singaporean Chinese speak at least one southern Chinese language natively and acquire Mandarin only after they enter the school system, C.-Y. Chen (1983) argues that the fifth tone might have stemmed from a negative transfer from the southern Chinese languages in Singapore, which still preserve entering tones in their phonology. Figure 2.24 shows that the actual realization of the fifth tone is very much dependent on the tonal category and etymology. Tone 1 shows the highest realization rate of over 80%, followed by Tone 2. For Tone 3 syllables, only those that had a -p ending in Middle Chinese show a high realization rate of the fifth tone. For syllables that previously ended in -t or -k in Middle Chinese, the realization rate is fairly low. The variations mentioned above are by no means exhaustive. Rather, they are meant to merely serve as an illustration of the wide variety of Mandarin that Mandarin-speaking children might encounter during their acquisition process. Some 5 An

entering tone is a tone that occurs on syllables ending with /p t k P/.

34

J. Fon

of the variations discussed are regionally distinct, as in the cases of the retroflex vowel suffix in the Mainland variety, the different ways of realizing neutral tones between Taiwan and Mainland Mandarin, and the fifth tone in Singapore Mandarin. From an acquisition point of view, this might not be as problematic, as long as Mandarinlearning children from different regions are considered and studied separately, and the data procured are analyzed within respective regional contexts. However, for variations that occur within a variety, such as the realization of voiceless sibilants in Taiwan and Singapore Mandarin, and the syllable final nasal mergers in Taiwan Mandarin, the situation might be much more complex. In order to master the system, children not only have to acquire the variable phonological rules, but also the appropriate occasions for applying them. Take the deretroflexion rule for example, for children acquiring Mandarin in Taiwan and Singapore, in addition to becoming proficient in producing the /tʂ tʂh ʂ ʐ/ set, they also need to be familiar with the differential effect of various genres in order to determine whether the deretroflexed set should be used instead. For children acquiring Taiwan Mandarin, they need to carry this one step further and take note of the prosodic structure in which the voiceless sibilants occur so as to accurately determine the amount of deretroflexion to be applied in a native fashion. This is no doubt a fairly daunting task, and when placed in the context of first language acquisition, it raises several important issues. Although previous studies have shown that retroflexes are acquired relatively early in Mainland Mandarin (Li & Munson, 2016; Zhu & Dodd, 2000),6 the complication of involving deretroflexion in the realization of the retroflex set in Taiwan and Singapore might substantially deter its acquisition process for Mandarin-learning children in these two locations. For these young learners of the language, they not only need to master the articulation of the retroflex set, as their counterparts in China do, but are also required to become proficient in applying the deretroflexion rule in contexts deemed to be adequate by adult natives in respective sites in order to be considered successful in acquiring the full usage of the retroflex set. How retroflexes are acquired in Taiwan and Singapore in specific and how variations across different varieties affect the acquisition process in general are relevant and significant to the Mandarin acquisition process and are thus worth further investigation. Acquisition is a dynamic process that reflects not only the mentality of the learner (i.e., the child), but also the unique linguistic and social combination of the learned (i.e., the language). This is especially true in the case of Mandarin, as its wide geographical span and diverse speaker background inevitably entail large variability of both sides, which interact intricately with each other, fermenting into distinctive regional tangs. This chapter thus provides not only a broad introduction to various aspects of Mandarin phonology, but also includes both the commonalities and differences across the three major varieties of Mandarin (i.e., Taiwan, Singapore, and China). Even though this might have created a picture that is fuzzier than ideal, it is

6 Zhu and Dodd (2000) reported that 75% of the children between ages 2;1 and 2;6 could produce/tʂ

tʂh ʂ/accurately at least once, and 75% of the children reach this criterion between ages 3;1 and 3;6 for/ʐ/. Li & Munson (2016) reported about 75% accuracy for/ʂ/ at around 3;6.

2 The Phonetic Realizations of the Mandarin Phoneme …

35

closer to reality, and thus one hopes that it could set a more realistic perspective for understanding the results in the later chapters.

References Chan, H.-C. (1984). The phonetic development of Mandarin / / in Taiwan: A sociolinguistic study. (MA), Fu Jen Catholic University, Taipei. Chao, Y. R. (1968). A grammar of spoken Chinese. Berkeley: University of California Press. Chen, C.-Y. (1983). A fifth tone in the Mandarin spoken in Singapore. Journal of Chinese Linguistics, 11(1), 92–119. Chen, C.-Y. (1991). The nasal endings and retroflexed initials in Peking Mandarin: Instability and the trend of changes. Journal of Chinese Linguistics, 19(2), 139–171. Chen, Y., & Xu, Y. (2006). Production of weak elements in speech-evidence from F0 patterns of neutral tone in Standard Chinese. Phonetica, 63, 47–75. Cheng, C.-C. (1973). A synchronic phonology of Mandarin Chinese. The Hague: De Gruyter Mouton. Cheng, R. L. (1966). Mandarin phonological structure. Journal of Linguistics, 2(2), 135–158. Chew, C. H. (2002). Xinjiapo Huayu Cihui yu Yufa (Lexicon and syntax in Singapore Mandarin). Singapore Lingzi Media. Chuang, Y.-Y. (2009). An acoustic study on voiceless retroflex and dental sibilants in Taiwan Mandarin spontaneous speech. (M.A. thesis), National Taiwan University, Taipei. Chuang, Y.-Y., Chiang, Y.-J., & Fon, J. (2012). The effects of context and Min dialect on the realizations of / / variants in Taiwan Mandarin. Paper presented at the 2nd Workshop on Sound Change, Kloster Seeon, Germany. Chuang, Y.-Y., & Fon, J. (2010). The effect of prosodic prominence on the realizations of voiceless dental and retroflex sibilants in Taiwan Mandarin spontaneous speech. Paper presented at the 5th International Conference on Speech Prosody. Chuang, Y.-Y., Wang, S.-F., & Fon, J. (2015). Cross-linguistic interaction between two voiced fricatives in Mandarin-Min simultaneous bilinguals. Paper presented at the International Congress of Phonetic Sciences, Glasgow, Scotland, U.K. Chung, K. S. (2006). Hypercorrection in Taiwan Mandarin. Journal of Asian Pacific Communication, 16(2), 197–214. Duanmu, S. (2007). The phonology of standard Chinese (2nd ed.). Oxford: Oxford University Press. Fon, J. (2004). A preliminary construction of Taiwan Southern Min spontaneous speech corpus (NSC-92–2411-H-003–050-). Fon, J., Hung, J.-M., Huang, Y.-H., & Hsu, H.-J. (2011). Dialectal variations on syllable-final nasal mergers in Taiwan Mandarin. Language and Linguistics, 12(2), 273–311. Hartman, L. M. (1944). The segmental phonemes of the Peiping dialect. Language, 20(1), 28–42. He, K., & Li, D. (1987). Xiandai Hanyu San Qian Changyong Ci Biao [Three Thousand Most Commonly Used Words in Modern Chinese]. Beijing: Beijing Shifan Daxue Chubanshe. Huang, S. (1993). Yuyan, Shehui yu Zuqun Yishi—Taiwan yuyan shehuixue de yanjiu [Language, society, and ethnic identity: Studies in language sociology in Taiwan]. Taipei: Crane Publishing. Kubler, C. C. (1985a). The development of Mandarin in Taiwan: A case study of language contact. Taipei: Student Book. Kubler, C. C. (1985b). The influence of Southern Min on the Mandarin of Taiwan. Anthropological Linguistics, 27(2), 156–176. Lee-Kim, S.-I. (2014). Revisiting Mandarin ‘apical vowels’: An articulatory and acoustic study. Journal of the International Phonetic Association, 44(3), 261–282. Li, F., & Munson, B. (2016). The development of voiceless sibilant fricatives in Putonghua-speaking children. Journal of Speech and Hearing Research, 59(4), 699–712.

36

J. Fon

Lin, H.-B. (1982). Comparison of the differences in tone sandhi among slow speech, normal speech and fast speech in Mandarin Chinese. (M.A. thesis), The Ohio State University, Columbus, OH, USA. Lin, Y.-H. (1989). Autosegmental treatment of segmental process in Chinese phonology. (Ph.D. dissertation), University of Texas at Austin, Austin. Luo, C.-C. (1993). Kuoyuxue [Studies of Mandarin Chinese]. Taipei: Wu-Nan. Maddieson, I. (1984). The UCLA phonological segment inventory database. Retrieved 4 June 2015. https://web.phonetik.uni-frankfurt.de/upsid.html. Maddieson, I. (2013). Tone. In M. S. Dryer & M. Haspelmath (Eds.), The World Atlas of language structures online. Leipzig: Max Planck Institute for Evolutionary Anthropology. Ng, B. C. (1985). A study of the variable /sh/ on Singapore Mandarin. In D. Bradley (Ed.), Language policy, language planning and sociolinguistics in South-East Asia (pp. 31–37). Canberra, Australia: Pacific Linguistics. Norman, J. (1988). Chinese. Cambridge: Cambridge University Press. Peng, S.-H., Chan, M. K. M., Tseng, C.-Y., Huang, T., Lee, O. J., & Beckman, E. M. (2005). Towards a pan-Mandarin system for prosodic transcription. In S.-A. Jun (Ed.), Prosodic typology: The phonology of intonation and phrasing (pp. 230–270). New York: Oxford University Press. Population Reference Bureau 2013. 2013 World Population Data Sheet. Population Reference Bureau. Wan, I.-P., & Jaeger, J. (2003). The phonological representation of Taiwan Mandarin vowels: A psycholinguistic study. Journal of East Asian Linguistics, 12(3), 205–257. Wu, Y. (1994). Mandarin segmental phonology. (Ph.D. dissertation), University of Toronto, Toronto. Yang, J.H.-T. (2010). Phonetic evidence for the nasal coda shift in Mandarin. Taiwan Journal of Linguistics, 8(1), 29–56. Yuan, J., & Chen, Y. (2014). 3rd tone sandhi in Standard Chinese: A corpus approach. Journal of Chinese Linguistics, 42(1), 218–237. Zhu, H., & Dodd, B. (2000). The phonological acquisition of Putonghua (Modern Standard Chinese). Journal of Child Language, 27(1), 3–42.

Chapter 3

Acoustic-Based and Knowledge-Based Processing of Mandarin Tones by Native and Non-native Speakers Chao-Yang Lee and Seth Wiener

Abstract A fundamental issue in spoken language comprehension is how listeners process the acoustic signal to retrieve intended linguistic representations. This issue is discussed by reviewing selected studies on acoustic-based and knowledgebased processing of lexical tone in speech perception and spoken word recognition. Research on acoustic-based processing suggests that native listeners are able to use phonetic knowledge to compensate for compromised F0 information, whereas nonnative listeners rely primarily on syllable-internal, canonical F0 information for tone identification. Research on knowledge-based processing shows that native listeners effectively track information, such as a syllable-tone’s lexical status, the probability of syllable–tone co-occurrences, morpheme and word frequency, and the density of homophonous syllable–tone neighborhoods. Non-native listeners also show evidence of knowledge-based tone processing, although the difference between native and non-native listeners remains to be explored.

3.1 Introduction Speech perception refers to the process in which listeners extract information from the acoustic signal and map that information onto some form of linguistic representation. Early work in speech perception focused on the discovery of acoustic correlates for consonant and vowel distinctions (Stevens & Hanson, 2010). Since the ultimate goal of speech perception is to map sound onto meaning, efforts have also been made to explicate the nature of lexical representation and process (Luce & McLennan, 2005). The relative ease of identifying speech sounds and spoken words in daily life often obscures the complexity of the sound-to-meaning mapping process. For example, the same sound or word spoken by different talkers can be acoustically C.-Y. Lee (B) Division of Communication Sciences and Disorders, Ohio University, Athens, USA e-mail: [email protected] S. Wiener Department of Modern Languages, Carnegie Mellon University, Pittsburgh, USA © Springer Nature Singapore Pte Ltd. 2020 H.-M. Liu et al. (eds.), Speech Perception, Production and Acquisition, Chinese Language Learning Sciences, https://doi.org/10.1007/978-981-15-7606-5_3

37

38

C.-Y. Lee and S. Wiener

different (Johnson, 2005). On the other hand, acoustically identical sounds or words can also be interpreted differently depending on the phonetic context (Ladefoged & Broadbent, 1957). These observations indicate that speech perception and spoken word recognition is shaped not only by auditory ability, but also by phonetic and linguistic knowledge of the listener. The overarching goal of research in speech perception and spoken word recognition, therefore, is to identify factors contributing to the mapping from the acoustic signal onto phonological and lexical representation. To that end, researchers have examined the nature of the acoustic signal, listener characteristics, and source of knowledge that contribute to the mapping process. Whereas previous work has largely investigated these questions with respect to segments (e.g., Johnson & Mullennix, 1997), less work has explored how listeners process variability at the suprasegmental level. In this chapter, we address this issue with respect to the processing of lexical tones in speech perception and spoken word recognition by native and non-native listeners. In lexical tone languages, tones are functionally analogous to consonant and vowel phonemes. Lexical tones differ, however, from segmental phonemes in that they involve a distinct set of acoustic correlates in speech perception and consequently are processed in a different time course during spoken word recognition. These differences suggest that conclusions drawn from segmental processing do not necessarily apply to lexical tone processing in speech perception and spoken word recognition. Moreover, since tone languages constitute the majority of known languages in the world (Laver, 1994), examining lexical tone processing can advance our knowledge of cross-linguistic aspects of speech perception and spoken word recognition. Investigating native and non-native speech perception also has important theoretical and practical implications. Theoretically, since non-native listeners possess imperfect knowledge of the target language, performance by non-native listeners can reveal important insights about the relative contribution of auditory processing and linguistic knowledge in speech perception and spoken word recognition (Cutler, 2012). Practically, it is commonly reported that lexical tones are incredibly challenging for non-native speakers to acquire (e.g., Wang, Spence, Jongman & Sereno, 1999; see also Ingvalson & Wong, Chap. 4 of this volume). Identifying factors relevant to the processing of lexical tone in speech perception and spoken word recognition has the potential of informing pedagogical approaches to tone language instruction. The above considerations motivate the following discussions in this chapter. In Sect. 3.2 we review basic facts about lexical tones, including their linguistic function, and acoustic and perceptual characteristics. In Sects. 3.3 and 3.4, we discuss how native and non-native listeners use two broad means of processing to overcome various sources of acoustic variability in speech perception. Within the experimental literature these forms are often discussed as ‘bottom-up’ and ‘top-down’ processing (e.g., Marslen-Wilson & Welsh, 1978). In the following sections, however, we organize our review as acoustic-based (Sect. 3.3) and knowledge-based processing of tone (Sect. 3.4).

3 Acoustic-Based and Knowledge-Based Processing of Mandarin Tones …

39

Fig. 3.1 F0 contours of the four Mandarin tones

3.2 Lexical Tones: Function, Acoustics, and Perception In lexical tone languages, tones are functionally analogous to segments. That is, lexical tones can distinguish words just as segmental structure does. Ample research has established that fundamental frequency (F0) is the primary acoustic correlate of lexical tone (Howie, 1976). In Mandarin Chinese, for example, monosyllabic words can be distinguished by F0 variations over a syllable. As an example, the syllable ma with Tone 1 (high-level F0) means ‘mother’; ma with Tone 2 (mid-rising F0) means ‘hemp’; ma with Tone 3 (low-dipping F0) means ‘horse’; and ma with Tone 4 (high-falling F0) means ‘scorn.’ In addition to F0, duration and amplitude contour serve as secondary cues to tone perception (Blicher, Diehl, & Cohen, 1990; Whalen & Xu, 1992). Nonetheless, F0 height and direction are the main acoustic cues used during tonal categorization and discrimination. The weight assigned to F0 height and direction is dependent upon a listener’s language experience (Gandour, 1983). Figure 3.1 shows an example of the four tones, illustrating F0 change and duration differences.

3.3 Acoustic-Based Tone Processing Previous research on lexical tones has primarily identified the acoustic correlates for specific tonal contrasts. However, relatively little is known about the effects of acoustic variability or adverse conditions on tone perception. As noted in the introduction, the primary puzzle in speech perception and spoken word recognition is how listeners accomplish perceptual constancy in the face of acoustic variability. In

40

C.-Y. Lee and S. Wiener

particular, speech communication usually takes place in listening conditions that are less than optimal. Examining the effects of acoustic variability arising from adverse conditions can therefore inform the nature of speech perception (Guediche, Blumstein, Fiez, & Holt, 2013; Mattys, Davis, Bradlow & Scott, 2012). Similarly, investigating the impact of acoustic variability on non-native speech perception further elucidates how linguistic knowledge affects speech perception (Lecumberri, Cooke, & Cutler, 2010). The specific question being asked in this chapter is whether acoustic variability affects native and non-native tone perception in the same way or differently. Below we discuss selected studies on Mandarin tone perception to examine how native and non-native listeners with different levels of proficiency deal with various sources of acoustic variability in lexical tone perception.

3.3.1 Fragmented F0 Input As noted earlier, F0 is the primary acoustic correlate of lexical tones. That means eliminating or reducing the amount of F0 information is likely to compromise lexical tone identification. However, there is evidence showing that native listeners are quite good at identifying tones from stimuli devoid of F0 information. Whalen and Xu (1992) manipulated Mandarin syllables such that F0 was removed but amplitude contour and duration were retained. Listeners were able to identify the majority of the tones, suggesting the use of duration and amplitude contours for tone perception. Liu and Samuel (2004) similarly showed that tone identification remained robust when F0 was neutralized with signal processing or whispered speech. Studies using the gating paradigm (Grosjean, 1980), where a stimulus is truncated systematically to manipulate the amount of acoustic information available to listeners, also showed that isolated Mandarin tones could be identified with less than half of a syllable (Lee, 2000). These findings demonstrate that native listeners are capable of using secondary acoustic cues to identify tones when F0 information is not available or substantially reduced. A series of studies employing the silent-center syllable paradigm (Strange, Jenkins, & Johnson, 1983) further demonstrated that listeners are able to use fragmented F0 information to retrieve lexical tones. Gottfried and Suiter (1997) constructed four types of Mandarin consonant–vowel syllables with varying amounts of F0 information. Native and non-native listeners were able to identify the tones of the stimuli that included intact syllables, center-only syllables (with the first six and final eight glottal pulses removed), silent-center syllables (with all but the first six and final eight glottal pulses removed), and onset-only syllables (with all but the first six glottal pulses removed). Figure 3.2 shows waveforms of the four types of stimuli used. Native listeners were highly accurate in identifying tones in all but the initialonly syllables. For example, despite the absence of the majority of the tonal contour, the silent-center tones were identified as accurately as intact and center-only tones. Non-native listeners, on the other hand, were not able to identify the silent-center tones as accurately. These results indicate that native listeners were able to integrate

3 Acoustic-Based and Knowledge-Based Processing of Mandarin Tones …

41

Fig. 3.2 From left: intact, center-only, silent-center, and onset-only stimuli

tonal information from the initial and final portions of the silent-center syllable to reconstruct the intended tones. Non-native listeners, however, did not take advantage of the dynamic tonal information in the remaining fragments of the syllable. Lee, Tao, and Bond (2008) and Lee, Tao, and Bond (2010a) extended Gottfried and Suiter (1997) by using the same types of stimuli, but with a larger number of native listeners and non-native listeners with Mandarin experience ranging from one to three years of classroom instruction. Lee and colleagues also used reaction time as an additional response measure since it is usually considered a more sensitive measure of processing differences. The accuracy results replicated Gottfried and Suiter (1997); native listeners identified silent-center tones as accurately as the intact and center-only syllables. The reaction time results, however, revealed subtle differences between the modified syllables and the intact syllables, indicating a processing cost associated with the limited F0 information available from the fragmented syllables. The ability of native listeners to recover missing tonal information is consistent with recent behavioral and neuroimaging evidence on Chinese sentence processing. Xu, Zhang, Shu, Wang, and Li (2013) showed that pitch-flattened sentences are as intelligible as normal sentences, indicating native listeners can use contextual information to access lexical meaning in sentences even if F0 information is altered substantially. Taken together, these findings suggest native listeners are quite capable of reconstructing lexical tones from reduced or altered F0 information at various levels of language comprehension.

3.3.2 Contextual Variation The aforementioned studies on the perception of fragmented tones also investigated the role of context on tone perception. The effect of context on the acoustics and perception of lexical tones is well documented. In particular, the canonical F0 contour of a tone can be substantially altered by preceding and following tones due to tonal coarticulation (Xu, 1997). Nonetheless, native listeners are able to use their knowledge of the consequence of tonal coarticulation to compensate for contextual variations (Xu, 1994). That is, when asked to identify tones in context, native listener

42

C.-Y. Lee and S. Wiener

do not simply rely on the canonical F0 contour of the target tone to judge tone identity. Rather, they interpret surface F0 contours with consideration of the acoustic consequence of tonal coarticulation. Do non-native listeners also use contextual variations to facilitate tone perception from fragmented syllables? Gottfried and Suiter (1997) presented the fragmented tones with and without a following syllable zi (‘word’), which had a high-falling F0 contour. The results showed that tone perception accuracy by native listeners was substantially higher when the fragmented tone stimuli were presented in context, but non-native tone identification performance remained the same irrespective of the presence or absence of context. Confusion patterns also differed between the native and non-native listeners. For example, native listeners misidentified onset-only Tone 4 as Tone 1 in isolation, presumably because the high F0 onset (but not the low F0 offset) of Tone 4 was present in the stimuli, which resembles the high onset of Tone 1 (see Fig. 3.1). However, the Tone 4-Tone 1 confusion disappeared when the context was present, presumably because the low offset of Tone 4 carried over to the following syllable, resulting in a lowered onset in the following Tone 1. In other words, native listeners managed to infer from the lowered onset of Tone 1 that the preceding tone had a low offset, which is consistent with Tone 4. Non-native listeners, on the other hand, did not show such a change in confusion pattern as a result of context. Lee and colleagues (2008, 2010a) evaluated the contribution of context to fragmented tone identification by recording target tones in two carrier phrases. As a result, the offset F0 of the carrier tone and the onset F0 of the stimulus tone resulted in either a match or mismatch. The stimuli were presented in the original carrier phrases (matching contextual F0), excised from the carrier phrases (no contextual F0), or excised and cross-spliced with another carrier phrase (mismatching contextual F0). For the native listeners, there was no effect of splicing, indicating comparable accuracy and speed of processing across the three contexts. However, in the crossspliced condition, syllables that were originally produced with a matching carrier tone were identified faster and more accurately, demonstrating native listeners’ sensitivity to contextual tonal variations. Non-native listeners, on the other hand, did not show such sensitivity to the original tonal context. That is, non-native tone identification was not modulated by contextual tonal variations as native tone identification was. Taken together, findings from the studies discussed so far suggest that native listeners are sensitive to dynamic tonal information from within a syllable (as indicated by the high accuracy in identifying silent-center tones) and from across syllables (as indicated by improved performance in context, and sensitivity to F0 mismatch between adjacent tones). Non-native listeners, on the other hand, appear to concentrate on syllable-internal, canonical F0 information (as indicated by accurate performance in center-only syllables but not silent-center syllables, and lack of sensitivity to contextual tonal variation). In summary, these findings support the idea that native and non-native listeners use different strategies in dealing with acoustic variability in tone perception. However, it remains to be seen whether this observation can generalize to the processing of other sources of acoustic variability. Next, we turn to two common challenges in speech perception: speaker variability and noise.

3 Acoustic-Based and Knowledge-Based Processing of Mandarin Tones …

43

3.3.3 Speaker Variability Listening to different speakers is the norm in speech communication. As noted in the introduction, the same sound or word spoken by different speakers can be quite different acoustically, yet listeners rarely have trouble understanding or adapting to an unfamiliar speaker as long as they speak the same language. How speaker variability is handled is particularly relevant to lexical tone perception. Since lexical tone perception relies primarily on F0, and since F0 range varies across speakers, listener most likely will have to interpret F0 in the acoustic signal relative to a speaker’s specific range (Lee, 2009). Does speaker variability affect tone perception more than it does segmental perception? Are non-native listeners affected by speaker variability to a larger degree than native listeners? Building on earlier studies using fragmented tone stimuli, Lee, Tao, and Bond (2009) examined the effects of speaker variability and context on Mandarin tone identification from intact, silent-center, center-only, and onset-only syllables. The stimuli were presented in isolation or with a precursor carrier phrase. Native and nonnative listeners were put under time pressure to identify the tones of the syllables. The literature on segmental processing indicates that adapting to different speakers demands cognitive resources, and the demand usually results in less accurate and more time-consuming responses to multiple-speaker stimuli than single-speaker stimuli (Creelman, 1957). This observation is supported by Lee and colleagues (2009). As shown in Fig. 3.3, both native and non-native listeners had lower identification accuracy in multi-speaker presentation than in single-speaker presentation. However, there was no evidence that non-native listeners were affected to a greater extent by speaker variability than native listeners were. That is, unlike fragmented F0 and absence of tonal context, speaker variability does not appear to pose a disproportionate challenge to non-native listeners. Lee, Tao, and Bond (2010b) further investigated the effect of speaker variability on native and non-native tone perception by using stimulus sets blocked by speaker and mixed across speakers. Previous studies showed that a mixed-speaker set is more challenging than a blocked-speaker set for tone identification (Wong & Diehl, 2003; Zhou, Zhang, Lee, & Xu 2008). Lee and colleagues presented monosyllabic Mandarin words produced by three male and three female speakers in two presentation formats (blocked by speaker and mixed across speakers) with five levels of signalto-noise ratios (quiet, 0, −5, −10, and −15 dB) to native listeners and non-native listeners with Mandarin experience ranging from one to four years. Figure 3.4 shows the accuracy results. For both native and non-native listeners, responses to stimuli blocked by speaker were faster and more accurate than responses to stimuli mixed across speakers. Native listeners outperformed non-native listeners, but the additional demand of processing mixed-speaker stimuli did not compromise non-native performance disproportionately. A possible explanation for the lack of difference between native and non-native tone perception in the speaker variability effect is the presence of tonal context in the stimuli. In particular, the stimuli in Lee et al. (2010b) were embedded in a

44

C.-Y. Lee and S. Wiener 100 Native

90

Accuracy (%)

80 70 60 50 40 30 20 Single Multiple

10 0

Intact

Center−only

Silent−center

Onset−only

100 Non−native

90

Accuracy (%)

80 70 60 50 40 30 20 10 0

Intact

Center−only

Silent−center

Onset−only

Modification Fig. 3.3 Accuracy of identification of single- and multiple-speaker tones by native listeners (Lee et al., 2009) and non-native listeners (Lee, Tao, & Bond, 2010a). Error bar indicates standard error

carrier phrase Qing3shuo1__ (‘Please say__’) to make sure listeners could actually hear the target tones against heavy background noise. Although the carrier phrase was relatively short, listeners could have obtained sufficient information about the speakers, which could have neutralized the processing difference between native and non-native listeners. To evaluate the potential impact of context, Lee, Tao, and Bond (2013) attempted to replicate Lee et al. (2010b) by presenting the same set of stimuli in isolation instead of with the carrier phrase. Lee et al. (2013) also analyzed the data by baseline tone identification proficiency (obtained from accuracy in the blocked presentation without noise) in addition to years of Mandarin instruction with the consideration that the number of years of Mandarin instruction does not necessarily reflect the actual proficiency. Both analyses yielded the same conclusion, and Fig. 3.5 shows the results of the analysis by baseline proficiency. There was no evidence

3 Acoustic-Based and Knowledge-Based Processing of Mandarin Tones … Mixed presentation

100

100

80

80

Percent correct (%)

Percent correct (%)

Blocked presentation

60

40

1st year 2nd year 3rd year 4th year Native

20

0

−15

−10

−5

0

45

Q

Signal−to−noise ratio (dB)

60

40

20

0

−15

−10

−5

0

Q

Signal−to−noise ratio (dB)

Fig. 3.4 Accuracy of Mandarin tone identification as a function of speaker variability (blocked/mixed presentation), noise level (quiet to −15 dB SNR), and listener background (native and 1–4 years of instruction) in Lee et al. (2010b), reproduced with permission from Elsevier. Error bar indicates standard error

that the mixed-speaker presentation affected non-native listeners disproportionately, suggesting that speaker variability did not pose a special challenge to non-native listeners.

3.3.4 Noise in Tone Perception Lee and colleagues (2010b, 2013) also examined the effect of noise on tone perception. Noise is arguably the most common adverse condition in speech perception. There is evidence that the perception of segmental phonemes in isolated syllables is usually not compromised disproportionately for non-native listeners, leading to the proposal that the source of disproportionate non-native difficulty with noise is not at a relatively low level of processing such as identifying consonants and vowels in isolated syllables (Bradlow & Alexander, 2007; Cutler, Weber, Smits, & Cooper,

46

C.-Y. Lee and S. Wiener Mixed presentation

100

100

80

80

Percent correct (%)

Percent correct (%)

Blocked presentation

60

40

20

0

Low Mid High Native −15

−10

−5

0

Q

Signal−to−noise ratio (dB)

60

40

20

0

−15

−10

−5

0

Q

Signal−to−noise ratio (dB)

Fig. 3.5 Accuracy of Mandarin tone identification as a function of speaker variability (blocked/mixed presentation), noise level (quiet to −15 dB SNR), and listener background (native and level of proficiency) in Lee et al. (2013), reproduced with permission from Taylor & Francis Ltd. Error bar indicates standard error

2004). In contrast, non-native difficulties with speech perception in noise usually arise only when listeners are asked to process longer stretches of speech, which need more complex linguistic processing. In other words, disproportionate non-native difficulties with noise usually do not surface when the stimuli are relatively simple. Rather, non-native difficulties with noise accumulate across all levels of spoken language comprehension. Does the conclusion drawn from the segmental literature regarding native and non-native speech perception generalize to tone perception in noise? Surprisingly, little evidence is available to address this question. As noted, Lee et al. (2010b) used multi-speaker tone stimuli embedded in speech-shaped noise with a precursor carrier phrase. The stimuli were presented to native listeners and non-native listeners with Mandarin instruction varying from one to four years. As Fig. 3.4 shows, noise compromised tone perception performance in all listener groups. There was, however, no evidence that noise compromised performance of listeners with less Mandarin experience disproportionately. Since the stimuli were simple syllables

3 Acoustic-Based and Knowledge-Based Processing of Mandarin Tones …

47

that do not require complex linguistic processing, this result appears to be consistent with the proposal noted earlier that non-native difficulties with noise do not arise from processing simple consonants, vowels, and tones (Bradlow & Alexander, 2007; Cutler et al., 2004). However, a follow-up analysis showed that when listeners were divided according to baseline performance instead of duration of Mandarin exposure, a significant noise level by baseline performance interaction emerged, suggesting disproportionate noise effect depending on Mandarin proficiency. As discussed earlier, Lee et al. (2013) replicated their earlier findings (2010b) with the same set of stimuli but without the carrier phrase. Their data were also analyzed in terms of baseline performance in addition to years of Mandarin instruction. As Fig. 3.5 shows, noise did affect some listener groups disproportionately. However, it was the listeners with higher proficiency that were affected disproportionately by noise. This result is rather counterintuitive because listeners with lower proficiency were expected to be affected more by adverse conditions because of their less robust knowledge of the target language. Lee et al. (2013) speculated that the less proficient listeners could not identify tones well in the easy, baseline conditions initially. Therefore, the extent to which their performance could be reduced became more constrained compared to the more proficient listeners.

3.4 Knowledge-Based Tone Processing Prior experience with a language affects how speech is processed. During language acquisition, listeners amass knowledge about a language’s phonological and morphological structure, as well as the statistical regularities of spoken sounds and words (Dahan, Magnuson, & Tanenhaus, 2001; Vitevitch & Luce, 1999). Listeners use this knowledge to assess the likelihood of the acoustic signal-to-representation match. Whereas models of spoken word recognition still debate how and at what stage this knowledge influences word recognition (e.g., Cutler, 2012; Norris, McQueen, & Cutler, 2000; Samuel, 2001), there is agreement that listeners are sensitive to a variety of information about the input and draw on this information at some stage of word recognition. A well-known example of knowledge-based processing of speech is the Ganong effect (1980), which demonstrates that listeners tend to identify ambiguous speech sounds in a manner that results in a word. For instance, an ambiguous sound between the two velar plosives /g/ and /k/ tends to be identified as /g/ when English listeners hear it before –ift as in ‘gift.’ In contrast, that same ambiguous sound is identified as /k/ when listeners hear it before –iss as in ‘kiss.’ Can listeners use knowledge-based processing to overcome tonal variability and poor acoustic-based processing? In many ways, Mandarin serves as an ideal language to test this research question given the language’s constrained syllable phonology, tendency for tone contours to align with a syllable (Xu, 1999), and the direct mapping of syllable–tone combinations to morphemes, words, and written characters (DeFrancis, 1986; Duanmu, 2007; Myers, 2010). For example, the first-person

48

C.-Y. Lee and S. Wiener

pronoun ‘I/me’ is the syllable wo produced with the dipping third tone. This syllable– tone combination serves as an individual morpheme; this morpheme can stand alone as a word, which in turn can be written with the character 我. While most modern Mandarin words by type are multisyllabic/multimorphemic, spoken corpora like SUBTLEX-CH (Cai & Brysbaert, 2010) indicate that the majority of speech tokens uphold this 1:1:1:1 mapping from syllable–tone to morpheme to word to written character. This suggests that listeners could potentially draw on different sources of knowledge to overcome tonal variability. We first outline these potential sources of information and then identify what role they may play in knowledge-based processing of tone.

3.4.1 Sources of Linguistic Knowledge Mandarin makes use of roughly 400 unique (C)V(C) syllables: a syllabary roughly one sixth the size of the English syllabary (Duanmu, 2009; see Fon, Chap. 2 of this volume). The syllable serves as the critical or ‘proximate’ unit in Mandarin perception and production (Chen, Chen & Dell, 2002; O’Seaghda, Chen, & Chen, 2010). Given the limited number of syllable types and their privileged role in speech, native speakers may track syllables’ distributional properties in a way that is beneficial for word recognition. One way a listener may do that is by tracking the frequency at which a syllable token occurs in speech. Like other speech sounds, syllables occur with certain statistical regularities. The syllable shi, for example, occurs so frequently that the Chinese linguist Chao Yuen Ren famously wrote a poem consisting of 92 characters, all of which share the syllable shi but differ in tone: Shi1 shi4 shi2 shi1 shi3 ‘The story of Mr. Shi eating lions.’ Native speakers, therefore, may be aware that a syllable like shi occurs frequently in speech while a syllable like nuan occurs infrequently in speech. At the syllable–tone level, a listener may track whether a particular syllable cooccurs with all four tones. An often-overlooked feature of Mandarin is that the four lexical tones are not evenly distributed across all syllables. Due to the historic evolution of tone, some syllables appear with all four tones, whereas some syllables only appear with one tone. As examples, the syllable gei only appears with tone 3 as gei3. The syllable cong appears with only two of the four tones as cong1 and cong2. The syllable ban appears with three of the four tones as ban1, ban3 and ban4. The syllable shi appears with all four tones as shi1, shi2, shi3, and shi4. Calculations based on spoken and written corpora (Cai & Brysbaert, 2010; McEnery & Xiao, 2008; Wang, 1986) as well as a modern usage dictionary (CCCEDICT, 2013) reveal that of the roughly 400 syllables in use, less than a third appear with all four lexical tones. Due to such gaps, modern Mandarin makes use of only around 1400 unique syllable–tone combinations (Duanmu, 2007). Listeners may not need to process tonal information for words in which the segmental information is sufficient for word identification (e.g., gei), whereas listeners may need to carefully process tone when the syllable co-occurs with all four tones (e.g., shi).

3 Acoustic-Based and Knowledge-Based Processing of Mandarin Tones …

49

Table 3.1 Number of unique morphemes given a Mandarin syllable and tone Syllable

Tone 1

Tone 2

Tone 3

Tone 4

you

10

19

11

14

li

0

32

16

61

qiong

1

11

0

0

cou

0

0

0

3

Native listeners may be aware of these co-occurrences and tolerate different degrees of tonal variability given the syllable–tone nonword gaps present in the lexicon. At the morpheme level, Mandarin possesses a relatively high degree of homophony. Wen (1980) reports that 11.6% of Mandarin words have homophones, compared to only 3.1% of English words. Similarly, Duanmu (2009) estimates Mandarin’s homophone density at 9.0 and that of English at roughly 2.4. On average, a particular syllable–tone combination corresponds to 11 semantically and orthographically unrelated morphemes or characters (Perfetti & Tan, 1998), though this can fluctuate across different syllables. For instance, the syllable–tone combination gei3 only corresponds to one morpheme, whereas the syllable–tone combination yi4 corresponds to approximately 90 homophonous morphemes (CC-CEDICT, 2013). As a result, tone varies greatly in its informativeness for morpheme identification. In many cases, without additional disambiguating speech or context, listeners may still be faced with a dense tonal neighborhood of homophonous morphemes (Packard, 1999, 2000). To illustrate the asymmetric role of tone in morpheme identification, Table 3.1 shows four syllables and their number of unique morphemes per tone according to a modern usage dictionary (CC-CEDICT, 2013). The syllables you and li both occur frequently in speech and both appear with relatively dense tonal neighborhoods. For the syllable you, each tone combination corresponds to at least 10 morphemes. For the syllable li, even though there is a syllable–tone gap, the remaining combinations all correspond to a relatively large number of morphemes, with li4 corresponding to over 60 homophonous morphemes. Tone appears to be less informative for these high frequency syllables associated with dense tonal homophone neighborhoods simply because additional context may be required for accurate morpheme identification. These two high token frequency syllables are juxtaposed with the two low token frequency syllables qiong and cou. The syllable qiong almost exclusively appears with the second tone as qiong2—qiong1 only corresponds to one extremely rare morpheme. Qiong3 and qiong4 are syllable–tone nonword gaps. The syllable cou only occurs with the fourth tone, thereby rendering tone redundant since the syllable alone is sufficient to access these morphemes. Given the relatively small number of syllable–tone combinations, and the varying size of these tonal neighborhoods, a listener may track the probability of a certain syllable co-occurring with each tone. Using the four syllables in Table 3.1 as examples, and assuming each morpheme occurs at the same rate (i.e., not accounting for any lexical frequency effects), a listener has a 100% probability of hearing the

50

C.-Y. Lee and S. Wiener

syllable cou with the fourth tone, an extremely high probability (roughly 92%) of hearing qiong with the second tone, an over 50% probability of hearing li with the fourth tone, and roughly equal probabilities of hearing you with each tone. Thus, in addition to knowledge of whether a syllable co-occurs with all four tones, listeners may draw on knowledge regarding the relative size of a syllable’s tonal neighborhood. That is, a listener may not only be aware that qiong3 and qiong4 are both syllable–tone nonword gaps but also that qiong2 is much more probable than qiong1 in speech given the size of each tonal neighborhood. At the word/character level, syllable–tone combinations vary in their frequency of occurrence resulting in a Zipfian distribution (Zipf, 1935). Crucially, lexical frequency has implications for the activation of specific members within a tonal neighborhood. For instance, even though the syllable–tone combination you3 only corresponds to 11 morphemes—you2 and you4 both correspond to more morphemes—one of those you3 morphemes is the verb ‘to have.’ This verb 有 is the eleventh most frequently spoken word within the 33.5 million-word corpus SUBTLEX-CH (Cai & Brysbaert, 2010). Therefore, despite the syllable you appearing with all four tones and each tonal neighborhood resulting in a relatively similar number of homophonous morphemes, the lexical frequency of ‘to have’ causes listeners to experience more you3 exemplars than you1, you2 or you4 exemplars. As a result, native listeners may expect and recognize the verb ‘to have’ more often and faster than other you3 words, and faster than other you1, you2 and you4 words (e.g., Janssen, Bi, & Caramazza, 2008; Wiener & Turnbull, 2016; Zhou & Marslen-Wilson, 1994, 1995). In sum, native Mandarin listeners may track a variety of information about the spoken language. Listeners may track syllable information such as how frequently a syllable token occurs, whether it co-occurs with all four tones, and what the probabilities of these co-occurrences are. Listeners may also track the density of homophonous syllable–tone neighborhoods and the degree to which tone informs morpheme or word identification. Additionally, listeners may track the lexical frequency of particular words within a tonal neighborhood and across the Mandarin lexicon. We next discuss whether this statistical knowledge could be used during speech processing, particularly as a means of overcoming variability in the speech signal.

3.4.2 Evidence for Knowledge-Based Tone Processing Fox and Unkefer (1985) first tested whether native Mandarin listeners draw on their knowledge of syllable–tone co-occurrences during tone categorization. Participants listened to syllable–tone combinations created from a nine-step tone continuum and were forced to categorize the stimulus as either Tone 1 or Tone 2. In the baseline condition, participants heard a word at each end of the continuum, such as fei1 ‘to fly’ and fei2 ‘fat.’ Results demonstrated categorical perception of tones similar to categorical perception of phonemes (e.g., Francis, Ciocca, & Ng, 2003; Harnad, 1987). These results were then compared to conditions in which one end of the continuum

3 Acoustic-Based and Knowledge-Based Processing of Mandarin Tones …

51

contained a non-word, such as hei1 ‘black’ and hei2 (non-word) or shei1 (non-word) and shei2 ‘who.’ Native listeners’ tonal category boundaries shifted in the direction of the non-word endpoints for both word/non-word and non-word/word when compared to the word/word continuum baseline. When the study was run on native English speakers with no knowledge of Mandarin, no such tonal category boundary shift was observed. These results indicated that native listeners are aware of syllable– tone co-occurrences and non-word gaps and draw on this knowledge when they process ambiguous speech sounds. Moreover, non-native speakers lack this lexical knowledge and therefore do not draw on it during non-native tonal categorization. Fox and Unkefer’s study demonstrated that knowledge of syllable–tone combinations affects tone categorization, but to what degree does phonological, morphological, and lexical information affect spoken word recognition? In an eye tracking study, Wiener and Ito (2015) demonstrated that listeners are sensitive to syllable token frequencies, syllable–tone co-occurrences probabilities and the relative size of tonal neighborhoods. Moreover, the authors showed that this statistical information affects the earliest stages of spoken word recognition. Wiener and Ito recorded participants’ eye movements while four frequency-controlled characters were displayed on a monitor. Participants were instructed to click on the character that matched the perceived spoken syllable–tone combination. Participants were auditorily presented with target words consisting of either a high or low token frequency syllable carrying either the most or least probable tone for that particular syllable. For instance, the high token frequency syllable you is most likely to appear in speech with the third tone as you3 largely due to the high frequency word ‘to have’ (i.e., adjusting for lexical frequency, given the syllable you, listeners have well over a 50% probability of hearing you3). In contrast, you1 is the least likely you combination to occur because there are fewer you1 morphemes and no you1 word occurs at a high frequency (i.e., listeners have a less than chance probability of hearing you1). In addition to the on-screen target (e.g., you3), the three other on-screen words included the tonal competitor, which shared the same syllable but the opposite tonal probability (e.g., you1), and two other distractors that carried different, non-target syllables. Response time results from Wiener and Ito demonstrated that native Mandarin listeners mouse clicked fastest on low token frequency syllables carrying the most probable tone and slowest on low token frequency syllables carrying the least probable tone. Participants’ eye fixations revealed a similar pattern; participants looked fastest to characters corresponding to a low token frequency syllable with a more probable tone and slowest for characters corresponding to a low token frequency syllable with a less probable tone. Slower fixations and mouse-click responses to low token frequency syllables with less probable tones reflected the competition from the more probable tonal competitor. For high token frequency syllables like you, no effect of tonal probability was observed in eye fixations. Mouse-click response times showed a trend in which high probability tones were identified faster than low probability tones, but not at a significantly different speed. The authors’ results suggest that Mandarin listeners track and use syllable token frequencies, as well as syllable-specific tonal probabilities, though these probabilities are primarily used when listening to low token frequency syllables (i.e., syllables that occur less often

52

C.-Y. Lee and S. Wiener

in speech and tend to carry fewer tonal homophones). Wiener and Ito argued that listeners form syllable-specific tonal hypotheses (cf. Ye & Connine, 1999) as soon as the speech signal begins to unfold, in part, because tone is more informative for word identification on these infrequent syllables. In a follow-up gating study, Wiener and Ito (2016) used the same stimuli to test how much of the acoustic signal is required for listeners to begin forming a syllablespecific tonal hypothesis. In the first gate, participants heard only the syllable onset. In each successive gate, participants heard 40 ms increments of the vowel. After each stimulus was heard, participants were forced to respond with the perceived syllable– tone combination. Results indicated that minimal acoustic cues from the onset and 40 ms of the vowel triggered knowledge-based processing. Listeners more accurately identified high token frequency syllables (and their tones) than low token frequency syllables (and their tones). An analysis of listeners’ correct-syllable–incorrect-tone responses revealed an effect of tonal probability for low token frequency syllables; participants reported the most probable tone for the perceived low token frequency syllable, even when that tone was acoustically dissimilar to the truncated stimuli. When the perceived syllable was a high token frequency syllable, participants did not demonstrate the same knowledge-based processing of tone; participants reported a tone that was acoustically similar to the truncated stimuli. This probability effect was short lived. After hearing the onset and 120 ms, participants began reporting more acoustically similar tones irrespective of the perceived syllable. To summarize, a small yet growing body of work has shown that native Mandarin listeners track and use a variety of information regarding syllable token frequencies, syllable–tone co-occurrences probabilities, tonal neighborhood densities, and syllable–tone lexical frequencies. This knowledge may be used to overcome tonal uncertainty, improve tonal categorization, and improve spoken word recognition. Knowledge-based processing of tone appears to be most useful in the identification of a relatively unique set of lexical candidates given infrequent syllable tokens and/or fewer tonal homophones. Can non-native listeners make use of a similar knowledge-based tone processing mechanism? This question was explored in a series of artificial language learning studies (Wiener, 2015). Monolingual English speakers and L2 learners with over a year of classroom Mandarin experience learned an artificial tonal language in which visual nonce symbols were associated with Mandarin-like monosyllables and tones. The stimuli were designed to mimic Mandarin’s rich statistical regularities including high and low syllable token frequencies and varied syllable–tone co-occurrence probabilities. After four consecutive days of training on the artificial language, monolingual participants demonstrated evidence of knowledge-based processing of tone in their mouse-clicks (Wiener, Ito, & Speer, 2016). The L2 participants—like the native Mandarin speakers tested in Wiener and Ito (2015)—looked at a higher rate to symbols corresponding to low token frequency syllables with more probable tones than to symbols corresponding to low token frequency syllables with less probable tones. Interestingly, Wiener, Ito, and Speer (2018) found that when learners of the artificial language were trained on multi-speaker input, they relied more on their knowledge of syllable–tone co-occurrence probabilities and less on the incoming acoustic

3 Acoustic-Based and Knowledge-Based Processing of Mandarin Tones …

53

signal. As a result, learners exposed to single speaker speech or low speaker variability recovered from incorrect probability-based predictions of tone more rapidly than participants exposed to multi-speaker speech or high speaker variability. These results suggest that L2 learners may, in part, rely on knowledge-based tone processing as means of overcoming speaker variability and initial tonal perceptual uncertainty (Wiener & Lee, 2020). More recently, Wiener, Lee, and Tao (2019) expanded Wiener et al.’s (2016, 2018) findings by investigating knowledge-based tone processing in intermediate L2 Mandarin learners. The authors modified Wiener and Ito’s (2016) gating stimuli and tested L2 learners before and after roughly 10–12 weeks of university classroom instruction. An L1 group also performed the gating task as a baseline. Tone-only and syllable–tone word accuracy results at the final gate (i.e., the full acoustic signal) revealed that the L1 group was statistically more accurate than the L2 group at both tests. Although the L2 group made modest tone-only and word accuracy improvements, these gains were not statistically significant. Analyses of the early gates, in which the acoustic information was truncated and listeners had to rely on their Mandarin syllable–tone knowledge, revealed that both L1 and L2 listeners identified high token frequency syllables (and their tones) more accurately than low token frequency syllables (and their tones). An analysis of correct-syllable–incorrect-tone responses revealed that L1 speakers reported more probable syllable–tone combinations when faced with the consonant and up to 80 ms of vowel information, thus corroborating Wiener and Ito (2016). L2 learners only showed a trend toward greater probability-based errors, suggesting that more experience with Mandarin syllable–tone combinations may be required to trigger native-like knowledge-based tone processing. It is hoped that future work will continue to explore what additional acoustic factors promote knowledge-based processing of tone, how L1 and L2 speakers navigate between acoustic-based and knowledge-based processing of tone, and what are the neurocognitive correlates of these processes (e.g., Yu, Wang, & Li, Chap. 5 of this volume; see also Politzer-Ahles, Wiener, & Zhang, 2017).

3.5 Summary In this chapter, we explored how native and non-native listeners make use of the acoustic signal and their linguistic knowledge to process tones in speech perception and spoken word recognition. The review of selected studies on processing acoustic variability in tone perception showed that not all sources of acoustic variability are equally disruptive to native and non-native tone perception. Whereas most adverse conditions compromised non-native tone perception disproportionately (fragmented F0, contextual variation, & noise), speaker variability appears to affect native and non-native tone perception similarly. It seems that non-native tone perception is compromised disproportionately only when syllable-internal, canonical F0 information is removed or altered. This observation is consistent with the

54

C.-Y. Lee and S. Wiener

observation that non-native listeners rely primarily on syllable-internal, canonical F0 information for tone identification, whereas native listeners are able to use their knowledge of tonal coarticulation and contextual tonal variation to compensate for lost F0 information. Consequently, when syllable-internal, canonical F0 information is reduced (as in fragmented tones) or altered (as in tones excised from original tonal context), non-native tone perception tends to be disrupted disproportionately. In contrast, speaker variability does not affect non-native tone perception disproportionately because speaker variability does not remove or alter syllable-internal, canonical F0 information. From a knowledge-based perspective, native listeners are able to process tone by drawing on their previous experience with speech. Because tone cannot occur devoid of segmental information, we identified multiple sources of information listeners may track and rely on during speech perception and spoken word recognition: syllable token frequencies, syllable–tone co-occurrence probabilities, syllable–tone homophone densities, and syllable–tone lexical frequencies. These various sources of information are the result of listeners generalizing over the phonological, morphological, and lexical patterns that emerge across the lexicon. Non-native learners are also able to process tone in a knowledge-based manner, though the timing and degree to which learners rely on this mechanism appears to depend upon learners’ Mandarin experience and the degree of acoustic variability in the speech signal. In sum, native and non-native listeners process tones similar to that of segments, through acousticbased and knowledge-based processing to achieve speech perception and accurate spoken word recognition.

References Blicher, D. L., Diehl, R. L., & Cohen, L. B. (1990). Effects of syllable duration on the perception of the Mandarin tone 2/tone 3 distinction: Evidence of auditory enhancement. Journal of Phonetics, 18, 37–49. Bradlow, A. R., & Alexander, J. A. (2007). Semantic and phonetic enhancements for speech-innoise recognition by native and non-native listeners. Journal of the Acoustical Society of America, 121, 2339–2349. Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PLoS ONE, 5, e10729. CC-CEDICT. (2013). Online Chinese dictionary. https://www.mdbg.net. Chen, J.-Y., Chen, T.-M., & Dell, G. S. (2002). Word-form encoding in Mandarin Chinese as assessed by the implicit priming task. Journal of Memory and Language, 46, 751–781. Cutler, A. (2012). Native listening. Cambridge, MA: MIT Press. Cutler, A., Weber, A., Smits, R., & Cooper, N. (2004). Patterns of English phoneme confusions by native and non-native listeners. Journal of the Acoustical Society of America, 116, 3668–3678. Creelman, C. D. (1957). Case of the unknown talker. Journal of the Acoustical Society of America, 29, 655. Dahan, D., Magnuson, J. S., & Tanenhaus, M. K. (2001). Time course of frequency effects in spoken-word recognition: Evidence from eye movements. Cognitive Psychology, 42, 317–367. DeFrancis, J. (1986). The Chinese language: Fact and fantasy. University of Hawaii Press.

3 Acoustic-Based and Knowledge-Based Processing of Mandarin Tones …

55

Duanmu, S. (2007). The phonology of standard Chinese (2nd ed.). New York: Oxford University Press. Duanmu, S. (2009). Syllable structure: The limits of variation. New York: Oxford University Press. Fox, R. A., & Unkefer, J. (1985). The effect of lexical status on the perception of tone. Journal of Chinese Linguistics, 13, 69–89. Francis, A. L., Ciocca, V., & Ng, B. K. C. (2003). On the (non) categorical perception of lexical tones. Perception & Psychophysics, 65, 1029–1044. Gandour, J. (1983). Tone perception in Far Eastern languages. Journal of Phonetics, 11, 149–175. Ganong, W. F. (1980). Phonetic categorization in auditory word perception. Journal of Experimental Psychology: Human Perception and Performance, 6, 110–125. Gottfried, T. L., & Suiter, T. L. (1997). Effects of linguistic experience on the identification of Mandarin Chinese vowels and tones. Journal of Phonetics, 25, 207–231. Grosjean, F. (1980). Spoken word recognition processes and the gating paradigm. Perception and Psychophysics, 28, 267–283. Guediche, S., Blumstein, S. E., Fiez, J. A., & Holt, L. L. (2013). Speech perception under adverse conditions: Insights from behavioral, computational, and neuroscience research. Frontiers in Systems Neuroscience, 7, 126. Harnad, S. (1987). Psychophysical and cognitive aspects of categorical perception: A critical overview. In Categorical perception: The groundwork of cognition (pp. 1–52). Cambridge University Press. Howie, J. M. (1976). Acoustical studies of Mandarin vowels and tones. Cambridge: Cambridge University Press. Janssen, N., Bi, Y., & Caramazza, A. (2008). A tale of two frequencies: Determining the speed of lexical access for Mandarin Chinese and English compounds. Language and Cognitive Processes, 23, 1191–1223. Johnson, K. A. (2005). Speaker normalization in speech perception. In D. Pisoni, R. Remez (Eds.), The handbook of speech perception (pp. 363–389). Wiley-Blackwell. Johnson, K., & Mullennix, J. W. (1997). Talker variability in speech processing. Morgan Kaufmann Publishers Inc. Ladefoged, P., & Broadbent, D. E. (1957). Information conveyed by vowels. Journal of the Acoustical Society of America, 29, 98–104. Laver, J. (1994). Principles of phonetics. Cambridge: Cambridge University Press. Lecumberri, M. L. G., Cooke, M., & Cutler, A. (2010). Non-native speech perception in adverse conditions: A review. Speech Communication, 52, 864–886. Lee, C.-Y. (2000). Lexical tone in spoken word recognition: A view from Mandarin Chinese. Unpublished doctoral dissertation. Brown University. Lee, C.-Y. (2009). Identifying isolated, multispeaker Mandarin tones from brief acoustic input: A perceptual and acoustic study. Journal of the Acoustical Society of America, 125, 1125–1137. Lee, C.-Y., Tao, L., & Bond, Z. S. (2008). Identification of acoustically modified Mandarin tones by native listeners. Journal of Phonetics, 36, 537–563. Lee, C.-Y., Tao, L., & Bond, Z. S. (2009). Speaker variability and context in the identification of fragmented Mandarin tones by native and non-native listeners. Journal of Phonetics, 37, 1–15. Lee, C.-Y., Tao, L., & Bond, Z. S. (2010a). Identification of acoustically modified Mandarin tones by non-native listeners. Language and Speech, 53, 217–243. Lee, C.-Y., Tao, L., & Bond, Z. S. (2010b). Identification of multi-speaker Mandarin tones in noise by native and non-native listeners. Speech Communication, 52, 900–910. Lee, C.-Y., Tao, L., & Bond, Z. S. (2013). Effects of speaker variability and noise on identifying isolated Mandarin tones by native and non-native listeners. Speech, Language and Hearing, 16, 46–54. Liu, S., & Samuel, A. G. (2004). Perception of Mandarin lexical tones when F0 information is neutralized. Language and Speech, 47, 109–138. Luce, P. A., & McLennan, C. T. (2005). Spoken word recognition: The challenge of variation. In D. Pisoni & R. Remez (Eds), The handbook of speech perception (pp. 590–609). Wiley-Blackwell.

56

C.-Y. Lee and S. Wiener

Marslen-Wilson, W. D., & Welsh, A. (1978). Processing interactions and lexical access during word recognition in continuous speech. Cognitive Psychology, 10, 29–63. Mattys, S. L., Davis, M. H., Bradlow, A. R., & Scott, S. K. (2012). Speech recognition in adverse conditions: A review. Language and Cognitive Processes, 27, 953–978. McEnery, T., & Xiao, R. (2008). The Lancaster Corpus of Mandarin Chinese (LCMC). https:// www.lancaster.ac.uk/fass/projects/corpus/LCMC/. Myers, J. (2010). Chinese as a natural experiment. The Mental Lexicon, 5, 421–435. Norris, D., McQueen, J. M., & Cutler, A. (2000). Merging information in speech recognition: Feedback is never necessary. Behavioral and Brain Sciences, 23, 299–325. O’Seaghdha, P. G., Chen, J.-Y., & Chen, T.-M. (2010). Proximate units in word production: Phonological encoding begins with syllables in Mandarin Chinese but segments in English. Cognition, 115, 282–302. Packard, J. L. (1999). Lexical access in Chinese speech comprehension and production. Brain and Language, 68, 89–94. Packard, J. L. (2000). The morphology of Chinese: A linguistic and cognitive approach. Cambridge: Cambridge University Press. Perfetti, C. A., & Tan, L. H. (1998). The time course of graphic, phonological, and semantic activation in Chinese character identification. Journal of Experimental Psychology: Learning, Memory, and Cognition, 24, 101–118. Politzer-Ahles, S., Wiener, S., & Zhang, C. (2017). Predictive tones facilitate Mandarin lexical identification: evidence from ERPs. In International Conference on Theoretical East Asian Psycholinguistics. Hong Kong, Hong Kong, 10–12 March. Samuel, A. G. (2001). Knowing a word affects the fundamental perception of the sounds within it. Psychological Science, 12, 348–351. Stevens, K. N., & Hanson, H. M. (2010). Articulatory-acoustic relations as the basis of distinctive contrasts. In W. J. Hardcastle & J. Laver, F. E. Gibbon (Eds.), The handbook of phonetic sciences (2nd ed., pp. 424–453). Wiley-Blackwell. Strange, W., Jenkins, J. J., & Johnson, T. L. (1983). Dynamic specification of coarticulated vowels. Journal of the Acoustical Society of America, 74, 695–705. Vitevitch, M. S., & Luce, P. A. (1999). Probabilistic phonotactics and neighborhood activation in spoken word recognition. Journal of Memory and Language, 40, 374–408. Wang, H. (1986). Modern Chinese frequency dictionary. Beijing: Beijing Language Institute Press. Wang, Y., Spence, M. M., Jongman, A., & Sereno, J. A. (1999). Training American listeners to perceive Mandarin tones. The Journal of the Acoustical Society of America, 106, 3649–3658. Wen, W. (1980). Cong Yingwen de tongxingci lai kan hanyu pinyin wenzi de tongyinci [A study of Chinese homophones from the view of English homographs]. Yuwen xiandaihua [Modernizing Our Language], 2, 120–124. Whalen, D. H., & Xu, Y. (1992). Information for Mandarin tones in the amplitude contour and in brief segments. Phonetica, 49, 25–47. Wiener, S. (2015). The Representation, Organization and Access of Lexical Tone by Native and Non-Native Mandarin Speakers. Unpublished doctoral dissertation. The Ohio State University. Wiener, S., & Ito, K. (2015). Do syllable-specific tonal probabilities guide lexical access? Evidence from Mandarin, Shanghai and Cantonese speakers. Language, Cognition & Neuroscience, 30, 1048–1060. Wiener, S., & Ito, K. (2016). Impoverished acoustic input triggers probability-based tone processing in mono-dialectal Mandarin listeners. Journal of Phonetics, 56, 38–51. Wiener, S., Ito, K., & Speer, S. R. (2016). Individual variability in the distributional learning of L2 lexical tone. In J. Barnes, A. Brugos, S. Shattuck-Hufnagel, & N. Veilleux (Eds.), Speech Prosody (pp. 538–542). Boston, MA: Speech Prosody 2016. Wiener, S., Ito, K., & Speer, S. R. (2018). Early L2 spoken word recognition combines acousticbased and probability-based processing. Language and Speech, 61, 632–656.

3 Acoustic-Based and Knowledge-Based Processing of Mandarin Tones …

57

Wiener, S., & Lee, C.-Y. (2020). Multi-talker speech promotes greater knowledge-based spoken Mandarin word recognition in first and second language listeners. Frontiers in Psychology, 11, 214. Wiener, S., Lee, C.-Y., & Tao, L. (2019). Statistical regularities affect the perception of second language speech: Evidence from adult classroom learners of Mandarin Chinese. Language Learning, 69, 527–558. Wiener, S., & Turnbull, R. (2016). Constraints of tones, vowels and consonants on lexical selection in Mandarin Chinese. Language and Speech, 59, 59–82. Wong, P. C. M., & Diehl, R. L. (2003). Perceptual normalization for inter and intra-talker variation in Cantonese level tones. Journal of Speech, Language, and Hearing Research, 46, 413–421. Xu, G., Zhang, L., Shu, H., Wang, X., & Li, P. (2013). Access to lexical meaning in pitch-flattened Chinese sentences: An fMRI study. Neuropsychologia, 51, 550–556. Xu, Y. (1994). Production and perception of coarticulated tones. Journal of the Acoustical Society of America, 95, 2240–2253. Xu, Y. (1997). Contextual tonal variations in Mandarin. Journal of Phonetics, 25, 61–83. Xu, Y. (1999). Effects of tone and focus on the formation and alignment of F0 contours. Journal of Phonetics, 27, 55–105. Ye, Y., & Connine, C. M. (1999). Processing spoken Chinese: The role of tone information. Language and Cognitive Processes, 14, 609–630. Zhou, N., Zhang, W., Lee, C.-Y., & Xu, L. (2008). Lexical tone recognition with an artificial neural network. Ear and Hearing, 29, 326–335. Zhou, X., & Marslen-Wilson, W. (1994). Words, morphemes and syllables in the Chinese mental lexicon. Language and Cognitive Processes, 9, 393–422. Zhou, X., & Marslen-Wilson, W. (1995). Morphological structure in the Chinese mental lexicon. Language and Cognitive Processes, 10, 545–600. Zipf, G. K. (1935). The Psycho-biology of language. Oxford, England: Houghton, Mifflin.

Chapter 4

Individual Differences in Lexical Tone Learning Erin M. Ingvalson and Patrick C. M. Wong

Abstract It is now well established that second language learning training results in large individual variation in learning outcomes. Native English speakers learning the lexical tones of Mandarin Chinese are no exception (e.g., Wang et al., 1999). In this chapter, we review a series of studies undertaken by our group investigating both the sources of individual differences in lexical tone learning (Chandrasekaran et al., 2010; Wong et al., 2007, 2008) and how such differences can be mediated by matching learners to the training paradigm that is best suited to their baseline phonological perception (Ingvalson et al., 2013; Perrachione et al., 2011). We include studies that have sought to identify possible genetic markers of individual variation in lexical tone learning outcomes (Wong et al., 2012a, b) and expansion of our training paradigms to older adults (Ingvalson et al., 2017) to provide further insight into individual variation across learning populations.

In second language speech perception research, there have been numerous efforts to train listeners to perceive non-native speech segments (e.g., Ingvalson, Holt, & McClelland, 2012; Iverson & Evans, 2009; Logan, Lively, & Pisoni, 1991; SebastianGalles & Soto-Faraco, 1999). More recently, these efforts have turned to training listeners to perceive lexical tone, recognizing that though lexical tone is a suprasegmental contrast, it nonetheless represents a novel phonological dimension to listeners who are unfamiliar with tone languages (e.g., Shen, 1989). As with non-native segmental training, non-native lexical tone-training research has found extensive individual variation in outcomes. We first review the history of lexical tone training, E. M. Ingvalson School of Communication Science and Disorders, Florida State University, Tallahassee, USA P. C. M. Wong (B) Department of Linguistics and Modern Languages, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong, S.A.R., China e-mail: [email protected] Brain and Mind Institute, The Chinese University of Hong Kong, Room G03, Leng Kau Kui Building, Shatin, N.T., Hong Kong, S.A.R., China © Springer Nature Singapore Pte Ltd. 2020 H.-M. Liu et al. (eds.), Speech Perception, Production and Acquisition, Chinese Language Learning Sciences, https://doi.org/10.1007/978-981-15-7606-5_4

59

60

E. M. Ingvalson and P. C. M. Wong

demonstrating that though traditional approaches to non-native speech perception training are successful in aggregate, there remain extensive individual differences in learning outcomes, with some learners approaching native-like performance posttraining whereas other learners make very little progress over the course of training. We then seek to identify the sources of individual differences in learning outcomes, looking at such disparate possibilities as musical training, neuroanatomical and neurophysiological variability, and genetic variation. Having reviewed the possible sources of individual variation, we review work that seeks to optimize learning outcomes. We close this chapter by highlighting our groups’ ongoing efforts into lexical tone training, branching out into understanding learning differences between older versus younger adults and seeking to understand the mechanisms of pitch perception and pitch learning in individuals with congenital amusia.

4.1 History of Lexical Tone Training Wang, Spence, Jongman, and Sereno (1999) undertook among the first efforts to train native English listeners to perceive lexical tone. Building on the high-variability phonetic training paradigm developed by Logan et al. (1991) to train native Japanese listeners to perceive the English /r–l/ contrast, they trained native English listeners to perceive the four tones of Mandarin Chinese (we leave a discussion of the tone contours of Mandarin for elsewhere in this volume; for one example see Lee & Wiener, Chap. 3). This paradigm had been shown to lead to successful generalization to both untrained tokens produced by trained talkers as well as untrained tokens produced by untrained talkers (Bradlow, Akahane-Yamada, Pisoni, & Tohkura, 1999; Bradlow, Pisoni, Akahane-Yamada, & Tohkura, 1997; Lively, Logan, & Pisoni, 1993; Logan et al., 1991; but see Ingvalson, Ettlinger, & Wong, 2014 for potential limitations with the paradigm). Consistent with earlier uses of the paradigm, following training, trained listeners were more accurate at identifying both trained and novel tokens and talkers relative to untrained listeners. Perhaps more impressively, the degree of tone confusion decreased substantially following training. However, there were extensive individual differences in learning outcomes, with some trained listeners making large gains in their tone identification performance whereas other trained listeners made much smaller gains. Building on the initial perceptual training study, Wang et al. (2003a) went on to determine if perceptual training would lead to improvements in learners’ ability to produce Mandarin tones (c.f. Bradlow et al., 1997). Both perceptual judgments by native Mandarin listeners and acoustical measurements taken pre- and postperceptual training demonstrated that perceptual training alone was sufficient to improve native English speaker production of Mandarin tones. Like in the perceptual training, though, there were extensive individual differences in outcomes, with some learners making large gains in perceptual accuracy (both as judged by the native listeners and in the acoustic measurements) whereas other listeners made much smaller gains.

4 Individual Differences in Lexical Tone Learning

61

Finally, Wang et al. (2003b) investigated the cortical effects of high-variability phonetic training of Mandarin tones in native English listeners. Using a functional magnetic resonance imaging (fMRI) paradigm during which listeners identified lexical tones, they determined that, pre-training, all listeners showed activation in areas associated with language processing, including Broca’s area, Wernicke’s area, auditory cortex, and supplementary motor areas. Post-training, for all listeners, improvements in lexical tone identification were associated with an increase in the spatial extent of activation on left superior temporal gyrus and the emergence of activity in right inferior frontal gyrus (see also Yu, Wang, & Li, Chap. 5, this volume). This pattern of activation was taken as evidence that learning a novel phonology involves expansion of existing language-related areas and recruitment of additional areas. Once again, though, there were individual differences in learning gains and the fact that these individual differences in learning gains are correlated with individual differences in patterns of neural activation suggests that the source for these individual differences in learning success may lie in the way the brain is processing lexical tone. We explore this possibility more fully in the following section. There have been other training efforts to teach native English listeners to perceive lexical tone beyond using the high-variability phonetic training paradigm for native English listeners learning Mandarin tone (e.g. Wang et al. 1999, 2003a, 2003b). Caldwell-Harris, Lancaster, Ladd, Dediu, and Christiansen (2015) sought to determine first, whether lexical tone could be learned in a statistical learning paradigm (Saffran, Aslin, & Newport, 1996), and, second, what learner-level characteristics predicted the extent to which listeners were able to learn lexical tone in the statistical learning paradigm. The training stimuli were based on the lexical tone patterns of African languages. Using African lexical tone patterns as stimuli allowed native speakers of Asian tone languages—including Mandarin Chinese and Thai—to participate, permitting investigations on the role of existing lexical tone knowledge on the statistical learning on novel lexical tone patterns. Following the exposure phase, listeners were presented with minimal pairs that differed either segmentally or suprasegmentally and asked to indicate which member of the pair was a “word” in the earlier speech stream (see Saffran et al., 1996 for a fuller description of the statistical learning paradigm). Both listeners familiar with a tone language and those who were not (i.e., native English listeners) performed equally well-identifying words using segmental cues. However, those listeners who spoke a tone language in addition to English were significantly better at identifying words using lexical tone cues than listeners who did not speak a tone language; listeners who had no knowledge of a tone language were at chance on this task. Though some earlier work suggested that musical training influenced listeners’ ability to categorize lexical tone (Alexander, Wong, & Bradlow, 2005; Chang, Hedberg, & Wang, 2016; Kempe, Bublitz, & Brooks, 2015; Wu et al., 2015), regression analyses indicated no effect of musicianship, instead indicating that Asian language experience was the strongest predictor of success learning tone words. Another investigation in lexical tone training also utilized a statistical learning paradigm. Building on the unimodal and bimodal distributions developed by Maye and colleagues (Maye, Werker, & Gerken, 2002), the hypothesis tested was whether

62

E. M. Ingvalson and P. C. M. Wong

passive listening to a speech stream that included more exemplars from the endpoints of the lexical tone continuum—a bimodal distribution—would result in better learning of the lexical tone contrast than passive listening to a speech stream that included more exemplars from the center of the lexical tone continuum—a unimodal distribution (Ong, Burnham, & Escudero, 2015). Passive listening resulted in no learning in either condition, but a second experiment that required listeners to actively attend to each sound did produce better learning in the bimodal condition relative to the unimodal condition. Though statistical learning paradigms are meant to capture naturalistic learning, these two studies (Caldwell-Harris et al., 2015; Ong et al., 2015) suggest that to the extent second language learners are able to learn lexical tone via statistical learning, and it is very limited and may only be successful for those listeners who are already proficient in a language that utilizes tone phonemically. As a result, for the remainder of this chapter we will focus on individual difference in learning using variations of the high-variability phonetic training paradigm, and how this paradigm might be modified to optimize individual outcomes.

4.2 Individual Differences in Lexical Tone Learning Outcomes Discussed above, though training native English listeners to perceive Mandarin Chinese lexical tone led to improved identification of trained talkers and tokens, improved identification of untrained talkers and tokens, improved perception, and changes in patterns of neural activation (Wang et al., 1999, 2003a, 2003b), there were extensive individual differences in learning outcomes. In segmental speech learning, both at the first and second language level, sensitivity to the to-be-learned phonology had been found to be important for successful learning (e.g., Werker & Stager, 2000), leading to the hypothesis that a similar sensitivity to tone may be important for successful lexical tone learning (Wong & Perrachione, 2007). Noting also that monolingual English-speaking musicians had performed better than monolingual English-speaking non-musicians when identifying lexical tone (Alexander et al., 2005), Wong and Perrachione (2007) further hypothesized that musical training may be important for lexical tone acquisition (note that this is in the context of the high-variability phonetic training paradigm and the lack of a music effect in Caldwell-Harris et al. (2015) is likely due to the limitations of the statistical learning paradigm for lexical tone). Though they used many of the hallmarks of the high-variability phonetic training paradigm—natural speech tokens, multiple talkers, multiple phonetic contexts, and feedback—instead of training listeners to identify lexical tones (c.f. Wang et al., 1999), Wong and Perrachione utilized a lexical learning task in which listeners were trained to associate each tone–syllable pair with a meaning. Importantly, each syllable was matched with three possible tones, meaning that listeners had to master the lexical tone contrasts in order to successfully complete the word-learning task. Listeners were trained to criterion, defined as 95%

4 Individual Differences in Lexical Tone Learning

63

accuracy for two consecutive sessions (successful learners) or as a failure to improve by more than 5% over four consecutive sessions (less successful learners). Consistent with the hypothesis that phonological sensitivity for lexical tone is an important predictor of learning success, a pretest that measured listeners’ ability to identify pitch patterns accounted for 49% of the variance in listeners’ learning outcomes. Also as expected, musicians were more likely to be successful learners than were non-musicians (78% of successful learners were musicians, whereas only 12% of less successful learners were musicians). In a similar paradigm, Cooper and Wang (2012) trained native English listeners and native Thai listeners to learn words that differed on the basis of Cantonese tones. For the native English listeners, final learning performance was reliably predicted by baseline ability to identify Cantonese tones, accounting for 59% of the variance (Wong & Perrachione, 2007); conversely, baseline tone identification performance was not a reliable predictor of final learning outcomes for native Thai listeners. There was also a significant benefit of previous musical experience for native English listeners, with musicians having a higher level of tone word learning than nonmusicians. However, musicianship was not found to be beneficial for the native Thai listeners, in that there was no difference in the ultimate learning performance between the musicians and non-musicians. Finally, there was a significant interaction of language background and musicianship such that native English musicians’ final learning performance was on par with that of native Thai listeners. Though there is a clear benefit to earlier experience with a tonal language in learning a new language that uses lexical tone (c.f., Caldwell-Harris et al., 2015), these data, together with the data from Wong and Perrachione (2007), paint a preliminary picture of the possible sources of individual differences in lexical tone learning in listeners who are not native speakers of a tone-based language: baseline aptitude for identifying pitch patterns, possibly as a result of previous musical training. Wang et al. (2003b) found correlations between listeners’ behavioral outcomes and changes in the patterns of neural activation following training to identify lexical tone, demonstrating individual variation in both the post-training identification performance and the post-training patterns of neural activation. Wong, Perrachione, and Parrish (2007) hypothesized that such individualized variation in patterns of neural activation could differentiate successful versus less successful learners; in particular, because the training task of Wong and Perrachione (2007) required listeners to associate syllable–tone pairs with lexical meanings, they hypothesized that successful learners would show post-training activation in a relatively small network consisting primarily of regions associated with language processing whereas less successful learning would show post-training activation in a more diffuse network that incorporated areas associated with attention and nonlinguistic pitch processing. The participants from Wong and Perrachione participated in an fMRI protocol that asked them to discriminate the pitch patterns in the training words both before and after training. The data were consistent with the hypotheses, and successful learners showed a shift toward a language-based network of activation post-training relative to less successful learners, where post-training activation was more diffuse (see also Yu, Wang, & Li, Chap. 5, this volume). Thus, not only does

64

E. M. Ingvalson and P. C. M. Wong

baseline aptitude for pitch perception influence behavioral outcomes, it also influences the pattern of neural activation post-training. Perhaps more interestingly, there were differences in the patterns of neural activation between the successful and less successful learners in response to the lexical tone stimuli pre-training in regions associated with language processing and these differences in pre-training activation reliably predicted which listeners would ultimately master the lexical tone task. We are therefore seeing evidence that the differences in learners’ post-training outcomes learning lexical tone can be traced to pre-training differences in the way listeners’ brains are responding to the training stimuli. We have now seen evidence that behavioral responses to pitch patterns (tones removed from their lexical context) and neural patterns of activations in response to lexical tone both reliably predict which learners will successfully learn words that differ in lexical tone. Completing this series, Wong et al. (2008) investigated whether there were neuroanatomical differences between successful and less successful learners that could reliably predict learning success. Relative to less successful learners, successful learners had a larger volume of Heschl’s gyrus; Heschl’s gyrus has been associated with nonlinguistic pitch processing and language learning and had therefore been hypothesized to be important for learning of lexical tone languages. Across this series of three studies, then, we see evidence that individual variation in lexical tone learning can be traced back to individual variation in neuroanatomy and neurophysiology and that these individual differences in neuroanatomy and neurophysiology can be reliably predicted via a behavioral test for listeners’ baseline aptitude for phonological sensitivity to pitch patterns. The above neuroanatomical and neurophysiological results, as well as the original individual differences in neural responses collected by Wang et al. (2003b), were all collected from the cortex. However, there are structures in the midbrain known to be important for auditory processing, and differences in the ability to perceive pitch patterns at the level of these structures could also be predictive of variation in lexical tone learning. One such possible structure is the inferior colliculus (IC), identified as a midbrain structure involved in auditory processes. To test the possibility that individual variation in IC activity might predict lexical tone learning success, before and after training native English-speaking listeners completed an fMRI paradigm in which they listened to repeated instances of the lexical tone stimuli used in training (similar to the stimuli from Wong & Perrachione, 2007) as well as an auditory brainstem response (ABR) task in which they listened to a single syllable with a rising tone (Chandrasekaran, Kraus, & Wong, 2011). Training procedures were similar to Wong and Perrachione (2007), except that all listeners trained for nine days, rather than to criterion. Listeners who showed a reduction in neural response in the IC following repeated instances of the same lexical tone—an indicator of more efficient neural processing, called repetition suppression—also showed better tracking of the rising pitch on the ABR measure. Conversely, listeners who showed an increase in neural response in the IC following repeated instances of the same lexical tone— an indicator of less efficient neural processing, called repetition enhancement—also showed poorer tracking of the rising pitch on the ABR measure. Those listeners who showed a repetition suppression response pre-training were also those listeners who

4 Individual Differences in Lexical Tone Learning

65

went on to be successful learners, demonstrating that neurophysiological differences between successful and less successful learners exist not only at the cortical level, but also in midbrain structures important for auditory processing. With the exception of the neuroanatomical measurements of Heschl’s gyrus, all the above studies asked the participants to engage with the lexical tone stimuli in some way, even if by passive listening to repeated instances. However, the brain engages in spontaneous activity even while at rest, aptly termed resting state activation. Of recent interest is the question of how such patterns of resting state activation might predict performance on other tasks, such as lexical tone learning (Harmelech & Malach, 2013). Using the same training procedure as Chandrasekaran et al. (2011), Deng, Chandrasekaran, Wang, and Wong (2016) sought to determine if lexical tone learning success could be predicted on the basis of pre-training patterns of resting state activation. Indeed, patterns of resting state activation were predictive of lexical tone learning success, with activation localized to the left superior temporal gyrus—associated with language processing—being positively correlated with learning outcomes. Knowing that individual variation in lexical tone learning outcomes can be traced to individual variation in neuroanatomy and neurophysiology, one question that naturally arises is where do these variations in neuroanatomy and neurophysiology come from? A population-based study demonstrated a significant correlation between the frequency of derived alleles in the ASPM and MCPH1 genes and the use of lexical tone in a language, suggesting that genes may give rise to brain differences that ultimately lead to tone perception differences (Dediu & Ladd, 2007). To test this possibility, native English speakers of European descent (i.e., no ancestry in areas associated with linguistic tone) completed the phonological awareness test for pitch patterns used by Wong and Perrachione (2007), completed an fMRI procedure during which they heard repeated instances Mandarin lexical tones, and submitted a genetic sample (Wong, Chandrasekaran, & Zheng, 2012a). A derived allele from the ASPM gene was found to be a significant predictor of both phonological sensitivity for pitch patterns and the degree of repetition suppression to lexical tone; the predictive relationship between the derived allele and repetition suppression remained even after the listeners’ phonological sensitivity for pitch patterns was factored out. Having tied both behavioral and neurological predictors of individual variation in lexical learning outcomes to genetic variation, we close this section by returning to behavioral predictors of learning success, namely phonological sensitivity to pitch patterns. As noted by Gandour (1983), native Mandarin listeners tend to give more weight to the direction of the pitch contour whereas native English listeners tend to give more weight to pitch height. The above investigations of listeners’ phonological sensitivity to pitch patterns did not differentiate whether listeners were weighting pitch height or pitch direction more heavily, though one might surmise that they were giving more weight to pitch direction, leading to success learning the lexical tone words. Prior to training, successful listeners gave more weight to pitch direction and were better able to identify pitch direction relative to less successful learners (Chandrasekaran, Sampath, & Wong, 2010), supporting the hypothesis that successful learners come to the learning task already weighting the acoustic cues within pitch

66

E. M. Ingvalson and P. C. M. Wong

patterns in a more native-like manner (see Lee & Wiener, Chap. 2, this volume, for a discussion for how non-native listeners utilize acoustic cues to perceive tone after learning). Thus, though the situation for less successful learners would initially seem bleak, as their struggles mastering lexical tone appear to have a genetic basis, having this understanding of how successful and less successful learners are differentially approaching the task prepares us to attempt to optimize the learning outcomes of the less successful learners by adjusting the training paradigms to capitalize on less successful learners’ learning methods. We discuss some of these efforts in the following section.

4.3 Optimizing Learning Success We closed the previous section by observing that, when learning to associate tone– syllable pairs with word meanings, successful learners come to the task weighting pitch direction more heavily whereas less successful learners come to the task weighting pitch height more heavily (Chandrasekaran et al., 2010). In an effort to bring less successful learners’ post-training performance more in line with successful learners’ post-training performance, Chandrasekaran, Yi, Smayda, and Maddox (2016) attempted to increase listeners’ attention to pitch direction. When training listeners to categorize Mandarin lexical tones, listeners were told to (1) attend to pitch height, (2) attend to pitch direction, (3) attend to pitch direction and pitch height both, (4) attend to pitch direction but not pitch height, or (5) given no additional instructions. Those listeners who were told to attend to pitch direction showed better categorization performance post-training than did those listeners who were instructed to attend to pitch height (including height and direction), who were no different from listeners who received no instruction. The failure of listeners to receive a benefit from the pitch height instruction was attributed to native English listeners’ bias toward pitch height (Gandour, 1983). Though this was a lexical tone categorization task, and not a word-learning task where the words differ on lexical tone (as in Chandrasekaran et al., 2010, 2011; Wong et al., 2007, 2008; Wong & Perrachione, 2007), it nonetheless suggests that less successful learners’ performance could be improved by orienting them toward pitch direction. In another effort to orient listeners to pitch direction, Liu et al. (2011) tested three methods of teaching listeners to identify lexical tones by directing their attention toward the contour of the pitch. Simultaneous with auditory presentation of a real word in Mandarin, spoken by a native Mandarin speaker, learners saw a visual presentation of (1) a schematic of the pitch contour and the pinyin representation of the word, (2) the number of the lexical tone and the pinyin representation of the word, or (3) a schematic of the pitch contour. Listeners were instructed to identify the tone. All participants were simultaneously enrolled in a college-level introductory Mandarin class, and participation in the experiment served as additional training in lexical tone identification. The two pinyin conditions showed faster tone learning during training, and the pinyin + pitch contour condition led to the best performance

4 Individual Differences in Lexical Tone Learning

67

on the posttest. Because the pinyin was always presented before the auditory stimulus, the authors suggested this allowed listeners to focus exclusively on the tone and to ignore the syllable. They further suggested that schematic representations of the pitch contours led to a more robust representation of pitch direction. Again, this study did not ask listeners to use lexical tone in a lexical context, but it does provide further evidence that orienting learners to pitch dimension could lead to improved learning outcomes. Returning to training native English listeners to associate word meaning with lexical tones (c.f. Wong & Perrachione, 2007). Perrachione, Lee, Ha, and Wong (2011) hypothesized that one potential source of difficulty for less successful listeners was the need to integrate across multiple talkers. As noted above, the high-variability phonetic training paradigm utilizes natural speech tokens produced by multiple talkers (Ingvalson et al., 2014; Iverson, Hazan, & Bannister, 2005; Logan et al., 1991). Previous work had demonstrated that multi-talker paradigms led to reduced accuracy identifying phonemes (e.g., Nygaard & Pisoni, 1998), possible due to increased cognitive cost resulting from the acoustic variability found in a multiple-talker environment (Nusbaum & Magnuson, 1997). Perrachione et al. therefore hypothesized that reducing the amount of acoustic variability across trials by using only one training talker throughout could lead to improved learning success for the less successful learners identified by Wong and Perrachione (2007). Using the phonological pitch sensitivity test developed by Wong and Perrachione, listeners were divided into either high aptitude listeners or low aptitude listeners based on their likelihood of being successful listeners on the multiple-talker paradigm. Half of each listener group was then assigned to either the multiple-talker training paradigm used previously (Wong & Perrachione, 2007) or to a similar paradigm that differed only by using one talker throughout training. All listeners were trained for eight days. The primary outcome measure was listeners’ ability to identify trained words spoken by novel talkers, called the generalization test. Consistent with the studies in the previous section, low aptitude listeners assigned to the multiple-talker condition showed small gains across training sessions and poor performance on the generalization test relative to high aptitude listeners assigned to the same condition (these are the same pattern of post-training results as found in Chandrasekaran et al., 2010, 2011; Wong et al., 2007, 2008; Wong & Perrachione, 2007). However, the training condition × aptitude group interaction was significant, with high aptitude listeners assigned to multipletalker training performing better relative to those assigned to single-talker training and, more markedly, low aptitude listeners assigned to the single-talker training performing much better than those low aptitude listeners assigned to multiple-talker training. Without explicitly focusing the low aptitude listeners on pitch direction, their performance was nonetheless improved by matching them to a condition that reduced across-trial acoustic variability. Importantly, this study highlighted the need to match both high aptitude and low aptitude listeners to their optimal training conditions, as high aptitude listeners performed best in the traditional high-variability phonetic training paradigm whereas low aptitude listeners had more learning success following single-talker training.

68

E. M. Ingvalson and P. C. M. Wong

One point we have not yet mentioned in our review of the lexical learning training studies is that the Mandarin tones were superimposed on single syllables that were consistent with English phonotactics (Wong & Perrachione, 2007). This was done to eliminate the need for listeners to learn both unfamiliar segmental information while simultaneously learning unfamiliar suprasegmental information; by choosing segments consistent with English phonotactics, listeners could focus on learning the unfamiliar lexical tone. Additionally, though many words in Mandarin are bisyllabic, monosyllables were used to reduce the potential working memory load during training, particularly for less successful learners (e.g., Baddeley, Gathercole, & Papagno, 1998). As a result, there is the question of whether the division into successful and less successful learners will hold when listeners are tasked with bisyllabic stimuli where two different tones may be present on each syllable (or two identical tones realized as two different tones due to tone sandhi, Chang & Kuo, Chap. 7, this volume). Sadakata and McQueen (2014) tested whether Perrachione et al.’s (2011) division into high aptitude and low aptitude listeners would continue to interact with training variability when stimuli were bisyllabic and produced by native Mandarin speakers. Following five days of training on a low-, moderate-, or high-variability paradigm, they found the same training condition × aptitude group interaction as Perrachione and colleagues, in which listeners who were identified as high aptitude performed the best following high-variability paradigm whereas low aptitude listeners performed best following a low-variability paradigm. It therefore seems reasonable to conclude that the learning mechanisms and training paradigms identified here will translate beyond the pseudo-Mandarin stimuli used for laboratory investigation to real Mandarin words. Though single-talker training led to improved learning outcomes for low aptitude listeners relative to multiple-talker training, performance on the final training day and on the generalization test was still not on par with that of high aptitude listeners. One possible means of further improving low aptitude listeners’ learning performance is to bring their attention to pitch direction, as was done by Chandrasekaran et al. (2016) and Liu et al. (2011), albeit in a non-lexical context. This was the approach taken by Ingvalson, Barr, and Wong (2013). Like Perrachione et al. (2011), listeners were again divided into high aptitude listeners and low aptitude listeners. Half of each listener group completed single-talker training, chosen to optimize the learning outcomes for the low aptitude listener group. The remaining listeners were assigned to tone pre-training prior to the lexical training component. In the tone pre-training, listeners heard the same tone–syllable pairings used in the lexical training (originally developed by Wong & Perrachione, 2007) but instead of seeing a picture that represented the word’s meaning, listeners saw an arrow that represented the pitch direction of the tone. Listeners were therefore trained to identify pitch direction in the same stimuli they would later learn to associate with lexical meanings. The single-talker lexicalonly group was trained for eight days; the tone-training group was trained for three days on the tone-training portion and then for five days on the lexical training portion, for a total of eight days. As in Perrachione et al., the primary outcome measure was generalization to trained tokens spoken by novel talkers. Following training, the low aptitude group assigned to the tone-training condition performed significantly better

4 Individual Differences in Lexical Tone Learning

69

on the generalization test than low aptitude listeners in the lexical-only condition whereas there was no difference in high aptitude listeners’ performance regardless of condition. In a similar study that sought to train native English listeners to learn words differing on the tones from Cantonese, Cooper and Wang (2013) also gave listeners pre-training experience identifying Cantonese tones in the context of multiple syllables produced by multiple talkers. Following tone identification training, listeners completed seven sessions of the lexical learning training from Cooper and Wang (2012). Importantly, all the listeners recruited for the tone-training study were nonmusicians, allowing Cooper and Wang to compare learning performance to that of the musicians in the lexical-only condition from their earlier study. Following the lexical learning component, the non-musicians who received tone identification pretraining were able to identify words differing on Cantonese lexical tone as well as musicians who had received lexical-only training, and both groups were significantly better than non-musicians who had only received lexical training. Even within the lexical learning context, then, it appears that orienting listeners to pitch direction can substantially improve learning outcomes for less successful learners and may even mimic the effects of earlier musical training.

4.4 Future Directions for Optimizing Individual Outcomes for Lexical Tone Learning Across all the studies discussed in the preceding sections, the participants were young (generally college-aged) listeners who reported no hearing deficits. In an increasingly globalized society, there is increasing pressure for older adults, whose hearing acuity is often lower than that of younger adults (e.g., Humes, 2013) to gain familiarity with a second language (Antoniou, Gunasekera, & Wong, 2013), raising the question as to whether older adults will have success mastering unfamiliar phonologies using the paradigms developed for younger adults, including the high-variability phonetic training paradigm. Specific to lexical tone, individuals who have congenital amusia provide an interesting test case to investigate how an inability to detect pitch changes impacts the ability to learn languages that utilize pitch phonemically. Recent work by our group has begun to investigate these areas, and we highlight some of our findings here. Lexical tone learning by older adults. Relative to younger adults, older adults have more difficulty identifying and discriminating changes in pitch (Shen, Wright, & Souza, 2016). However, older adults also show extensive individual variation in their ability to identify and discriminate pitch patterns, not unlike the individual variation younger adults show on the phonological pitch pattern test developed by Wong and Perrachione (2007). Ingvalson, Nowicki, Zong, and Wong (2017) therefore hypothesized that older adults would show the same listener aptitude group × training condition interaction as the younger adults from Perrachione et al. (2011)

70

E. M. Ingvalson and P. C. M. Wong

but that the older adults’ performance would be attenuated overall relative to that of younger adults. One of the immediate findings was that many older adults had such difficulty identifying the pitch patterns used in the phonological pitch sensitivity test that instead of using the criterion used to divide younger adults into high aptitude listeners versus low aptitude listeners, those older adults whose performance was significantly better than chance were classified as high aptitude listeners and those whose performance was statistically equivalent to chance were classified as low aptitude listeners. As in Perrachione et al., half of each listener group was assigned to either multiple-talker training or single-talker training to learn to associate Mandarin tone–syllable pairs with lexical meanings; all listeners were trained for eight days. Unlike in younger adults, older adults’ baseline phonological pitch pattern sensitivity was not a significant predictor of training performance nor of generalization performance for any listener group or any training type. Instead, baseline measures of auditory working memory and declarative memory were the best predictors of older adults’ generalization performance. Comparing predictors of generalization performance between younger and older adults revealed that younger adults’ generalization performance was best predicted by baseline phonological measures whereas older adults’ performance was best predicted by baseline working and declarative memory measures. This suggests that not only do older adults have more difficulty perceiving pitch changes than do younger adults, but they are approaching the learning task in a fundamentally different way, in essence attempting to rote-memorize each individual stimulus item rather than extrapolate larger category features (Maddox, Chandrasekaran, Smayda, & Yi, 2013; Maddox, Pacheco, Reeves, Zhu, & Schnyer, 2010). Paradigms that lead to successful learning by older adults, then, may need to accommodate older adults’ learning strategies by reducing the working memory load across training trials or induce a more efficient categorization strategy in older adult learners. Lexical tone learning in individuals with congenital amusia. Congenital amusia affects approximately 4% of the population, including populations that use tone languages (Wong et al. 2012a, 2012b). Individuals with congenital amusia have difficulty discriminating pitch, detecting changes in pitch, identifying pitch direction, and poor memory for pitch (Ayotte, Peretz, & Hyde, 2002; Hyde & Peretz, 2004; Tillmann et al., 2010, 2011), although some of these abilities seem to be malleable (Liu, Jiang, Francart, Chan, & Wong, 2017). The reader will note that the list of tasks that individuals with congenital amusia find difficult corresponds to the list of skills found to predict lexical tone learning success in listeners who are typically developing. Given this overlap in skill set, as well as the fact that musicians show enhanced brainstem encoding of both musical and speech stimuli relative to non-musicians (e.g., Parbery-Clark, Skoe, Lam, & Kraus, 2009), Liu, Maggu, Lau, and Wong (2015) sought to investigate how native speakers of a tone language who have congenital amusia could identify lexical tone and the extent to which this identification performance corresponded to brainstem responses to musical and lexical tone stimuli. Native Cantonese speakers with congenital amusia—identified by poor performance on the Montreal Battery of Evaluation of Amusia (Peretz, Champod, & Hyde, 2003)—were matched to native Cantonese speakers with typical pitch perception abilities on age and amount of musical training. At the brainstem level, there

4 Individual Differences in Lexical Tone Learning

71

was no difference in the neural response to musical stimuli nor to lexical tone stimuli between the listeners with congenital amusia and the listeners with typical pitch perception abilities. However, the listeners with congenital amusia performed much more poorly than the listeners with typical pitch perception when asked to identify lexical tone. Interestingly, in the listeners with congenital amusia, there was no correlation between the brainstem response to lexical tone and their identification performance (though the correlation did hold for listeners with typical pitch perception, c.f. Chandrasekaran et al., 2011). A follow-up study asked native Cantonese-speaking listeners with amusia to again identify lexical tone, as well as to identify pitch direction in a speech syllable and in a piano tone then to produce a series of lexical tones and sing (Liu et al., 2016). Native Cantonese speakers with typical pitch perception also completed all tasks and served as controls. Participants’ productions—both lexical tones and singing—were judged by native Cantonese listeners who were naïve to the experiment and were acoustically analyzed for pitch accuracy and pitch direction. As in the previous study, listeners with congenital amusia were less accurate than listeners with typical pitch perception at identifying lexical tone; they were also less accurate at identifying pitch direction both in speech and in piano tones. The songs by participants with congenital amusia were judged to be less accurate, and acoustic measurements demonstrated that their songs included more pitch errors. However, the lexical tone productions by the participants with congenital amusia were not less recognizable nor were they acoustically distinguishable from those produced by the participants with typical pitch perception. This pattern of results presents an interesting puzzle for investigators going forward: In native English speakers learning lexical tone, encoding at the brainstem was a predictor of future learning success (Chandrasekaran et al., 2011), but the brainstem response and tone identification are not associated for individuals with congenital amusia. Further, one of the earliest studies training listeners to identify lexical tone demonstrated that perceptual training was sufficient to improve production performance (Wang, Jongman, & Sereno, 2003a), but though individuals with congenital amusia’s lexical tone identification are impaired, their production is typical. A further exploration of tone perception and tone production by individuals with congenital amusia can be found in Ong, Tan, Chan, and Wong (Chap. 8, this volume). Better understanding of where the dissociation between the brainstem’s encoding and the behavioral response occurs, and the relationship between perception and production, will provide insight for improving lexical tone learning for both listeners with congenital amusia living in a tone language environment as well as native English listeners who may struggle to master a lexical tone using the training paradigms identified to date.

4.5 Conclusions Over the course of this chapter, we have presented a series of studies demonstrating first that it is possible for native English listeners to learn lexical tone, though there

72

E. M. Ingvalson and P. C. M. Wong

may be some limitations on the ability to do so in a statistical learning context. As is often the case in non-native phoneme learning, there are extensive individual differences in learning outcomes. We traced the sources of these individual differences in learning outcomes to individual variation in neuroanatomy, neurophysiology, and genetic expression, which can be assessed via a baseline test of phonological sensitivity for pitch patterns. Additional sources of variability were traced to listeners’ musical background. Recognizing that successful learners place more weight on pitch direction relative to less successful learners, who place more weight on pitch height, efforts that draw less successful learners’ attention to pitch direction can optimize learning success for this group. At the same time, successful learners do the best with the traditional multiple-talker paradigm, demonstrating the importance of identifying, at baseline, which listeners are likely to be successful learners and which listeners are likely to be less successful learners in order to match learners to the correct training paradigm. We ended this chapter by presenting some new challenges for optimizing individual outcomes, older adult learners and individuals with congenital amusia, both of whom appear to approach the task of learning and perceiving lexical tone very differently from the listeners who participated in our previous work. We look forward to continuing to investigate these challenges, and encourage other researchers to recognize the importance of optimizing learning outcomes for each individual learner. Acknowledgements The authors would like to acknowledge the support from the National Institutes of Health (USA) (R21DC016069), and the Research Grants Council (HKSAR) (34000118).

References Alexander, J. A., Wong, P. C., & Bradlow, A. R. (2005). Lexical tone perception in musicians and non-musicians (pp. 397–400), Lisbon, Portugal. Antoniou, M., Gunasekera, G. M., & Wong, P. C. M. (2013). Foreign language training as cognitive therapy for age-related cognitive decline: A hypothesis for future research. Neuroscience & Biobehavioral Reviews, 37(10, Part 2), 2689–2698. https://doi.org/10.1016/j.neubiorev.2013. 09.004. Ayotte, J., Peretz, I., & Hyde, K. (2002). Congenital amusia: A group study of adults afflicted with a music-specific disorder. Brain: A Journal of Neurology, 125(Pt 2), 238–251. Baddeley, A., Gathercole, S., & Papagno, C. (1998). The phonological loop as a language learning device. Psychological Review, 105(1), 158–173. Bradlow, A. R., Akahane-Yamada, R., Pisoni, D. B., & Tohkura, Y. (1999). Training Japanese listeners to identify English/r/ and/l/: Long-term retention of learning in perception and production. Perception and Psychophysics, 61(5), 977–985. Bradlow, A. R., Pisoni, D. B., Akahane-Yamada, R., & Tohkura, Y. (1997). Training Japanese listeners to identify English /r/ and /l/: IV. Some effects of perceptual learning on speech production. Journal of the Acoustical Society of America, 101(4), 2299–2310. Caldwell-Harris, C. L., Lancaster, A., Ladd, D. R., Dediu, D., & Christiansen, M. H. (2015). Factors influencing sensitivity to lexical tone in an artificial language. Studies in Second Language Acquisition, 37(Special Issue 02), 335–357. https://doi.org/10.1017/S0272263114000849.

4 Individual Differences in Lexical Tone Learning

73

Chandrasekaran, B., Kraus, N., & Wong, P. C. M. (2011). Human inferior colliculus activity relates to individual differences in spoken language learning. Journal of Neurophysiology, 107(5), 1325– 1336. https://doi.org/10.1152/jn.00923.2011. Chandrasekaran, B., Sampath, P. D., & Wong, P. C. M. (2010). Individual variability in cueweighting and lexical tone learning. The Journal of the Acoustical Society of America, 128(1), 456–465. https://doi.org/10.1121/1.3445785. Chandrasekaran, B., Yi, H.-G., Smayda, K. E., & Maddox, W. T. (2016). Effect of explicit dimensional instruction on speech category learning. Attention, Perception, & Psychophysics, 78(2), 566–582. https://doi.org/10.3758/s13414-015-0999-x. Chang, D., Hedberg, N., & Wang, Y. (2016). Effects of musical and linguistic experience on categorization of lexical and melodic tones. The Journal of the Acoustical Society of America, 139(5), 2432–2447. https://doi.org/10.1121/1.4947497. Cooper, A., & Wang, Y. (2012). The influence of linguistic and musical experience on Cantonese word learning. The Journal of the Acoustical Society of America, 131(6), 4756–4769. https://doi. org/10.1121/1.4714355. Cooper, A., & Wang, Y. (2013). Effects of tone training on Cantonese tone-word learning. The Journal of the Acoustical Society of America, 134(2), EL133–EL139. https://doi.org/10.1121/1. 4812435. Dediu, D., & Ladd, D. R. (2007). Linguistic tone is related to the population frequency of the adaptive haplogroups of two brain size genes, ASPM and Microcephalin. Proceedings of the National Academy of Sciences, 104(26), 10944–10949. https://doi.org/10.1073/pnas.0610848104. Deng, Z., Chandrasekaran, B., Wang, S., & Wong, P. C. M. (2016). Resting-state low-frequency fluctuations reflect individual differences in spoken language learning. Cortex, 76, 63–78. https:// doi.org/10.1016/j.cortex.2015.11.020. Gandour, J. T. (1983). Tone perception in Far Eastern languages. Journal of Phonetics, 11, 149–175. Harmelech, T., & Malach, R. (2013). Neurocognitive biases and the patterns of spontaneous correlations in the human cortex. Trends in Cognitive Sciences, 17(12), 606–615. https://doi.org/10. 1016/j.tics.2013.09.014. Humes, L. E. (2013). Understanding the speech-understanding problems of older adults. American Journal of Audiology, 22, 303–305. https://doi.org/10.1044/1059-0889(2013/12-0066). Hyde, K. L., & Peretz, I. (2004). Brains that are out of tune but in time. Psychological Science, 15(5), 356–360. https://doi.org/10.1111/j.0956-7976.2004.00683.x. Ingvalson, E. M., Barr, A. M., & Wong, P. C. M. (2013). Poorer phonetic perceivers show greater benefit in phonetic-phonological speech learning. Journal of Speech, Language, and Hearing Research, 56(3), 1045–1050. https://doi.org/10.1044/1092-4388(2012/12-0024). Ingvalson, E. M., Ettlinger, M., & Wong, P. C. M. (2014). Bilingual speech perception and learning: A review of recent trends. International Journal of Bilingualism, 18(1), 35–47. https://doi.org/ 10.1177/1367006912456586. Ingvalson, E. M., Holt, L. L., & McClelland, J. L. (2012). Can native Japanese listeners learn to differentiate /r–l/ on the basis of F3 onset frequency? Bilingualism: Language and Cognition, 15(2), 255–274. https://doi.org/10.1017/S1366728911000447. Ingvalson, E. M., Nowicki, C., Zong, A., & Wong, P. C. M. (2017). Non-native speech learning in older adults. Frontiers in Psychology, 8. https://doi.org/10.3389/fpsyg.2017.00148. Iverson, P., & Evans, B. G. (2009). Learning English vowels with different first-language vowel systems II: Auditory training for native Spanish and German speakers. The Journal of the Acoustical Society of America, 126(2), 866. https://doi.org/10.1121/1.3148196. Iverson, P., Hazan, V., & Bannister, K. (2005). Phonetic training with acoustic cue manipulations: A comparison of methods for teaching English /r/-/l/ to Japanese adults. Journal of the Acoustical Society of America, 118(5), 3267–3278. https://doi.org/10.1121/1.2062307. Kempe, V., Bublitz, D., & Brooks, P. J. (2015). Musical ability and non-native speech-sound processing are linked through sensitivity to pitch and spectral information. British Journal of Psychology, 106(2), 349–366. https://doi.org/10.1111/bjop.12092.

74

E. M. Ingvalson and P. C. M. Wong

Liu, F., Chan, A. H., Ciocca, V., Roquet, C., Peretz, I., & Wong, P. C. (2016). Pitch perception and production in congenital amusia: Evidence from Cantonese speakers. The Journal of the Acoustical Society of America, 140(1), 563–575. Liu, F., Jiang, C., Francart, T., Chan, A. H. D., & Wong, P. C. M. (2017). Perceptual Learning of pitch direction in congenital amusia: Evidence from Chinese speakers. Music Perception: An Interdisciplinary Journal, 34(3), 335–351. https://doi.org/10.1525/mp.2017.34.3.335. Liu, F., Maggu, A. R., Lau, J. C. Y., & Wong, P. C. M. (2015). Brainstem encoding of speech and musical stimuli in congenital amusia: Evidence from Cantonese speakers. Frontiers in Human Neuroscience, 8, 1029. https://doi.org/10.3389/fnhum.2014.01029. Liu, Y., Wang, M., Perfetti, C. A., Brubaker, B., Wu, S., & MacWhinney, B. (2011). Learning a tonal language by attending to the tone: An in vivo experiment. Language Learning, 61(4), 1119–1141. https://doi.org/10.1111/j.1467-9922.2011.00673.x. Lively, S. E., Logan, J. S., & Pisoni, D. B. (1993). Training Japanese listeners to identify English /r/ and /l/. II: The role of phonetic environment and talker variability in learning new perceptual categories. Journal of the Acoustical Society of America, 94(3), 1242–1255. Logan, J. S., Lively, S. E., & Pisoni, D. B. (1991). Training Japanese listeners to identify English /r/ and /l/: A first report. Journal of the Acoustical Society of America, 89(2), 874–886. Maddox, W. T., Chandrasekaran, B., Smayda, K., & Yi, H.-G. (2013). Dual systems of speech category learning across the lifespan. Psychology and Aging, 28(4), 1042–1056. https://doi.org/ 10.1037/a0034969. Maddox, W. T., Pacheco, J., Reeves, M., Zhu, B., & Schnyer, D. M. (2010). Rule-based and information-integration category learning in normal aging. Neuropsychologia, 48(10), 2998–3008. https://doi.org/10.1016/j.neuropsychologia.2010.06.008. Maye, J., Werker, J. F., & Gerken, L. A. (2002). Infant sensitivity to distributional information can affect phonetic discrimination. Cognition, 82(3), B101–B111. Nusbaum, H. C., & Magnuson, J. S. (1997). Talker normalization: Phonetic constancy as a cognitive process. Talker Variability in Speech Processing, 109–132. Nygaard, L. C., & Pisoni, D. B. (1998). Talker-specific learning in speech perception. Perception & Psychophysics, 60(3), 355–376. Ong, J. H., Burnham, D., & Escudero, P. (2015). Distributional learning of lexical tones: A comparison of attended vs. unattended listening. PloS One, 10(7), e0133446. Parbery-Clark, A., Skoe, E., Lam, C., & Kraus, N. (2009). Musician enhancement for speech-innoise. Ear & Hearing, 30(6), 653–661. Peretz, I., Champod, A. S., & Hyde, K. (2003). Varieties of musical disorders. The Montreal battery of evaluation of amusia. Annals of the New York Academy of Sciences, 999, 58–75. Perrachione, T. K., Lee, J., Ha, L. Y. Y., & Wong, P. C. M. (2011). Learning a novel phonological contrast depends on interactions between individual differences and training paradigm design. Journal of the Acoustical Society of America, 130(1), 461–472. https://doi.org/10.1121/1.359 3366. Sadakata, M., & McQueen, J. M. (2014). Individual aptitude in Mandarin lexical tone perception predicts effectiveness of high-variability training. Language Sciences, 5, 1318. https://doi.org/ 10.3389/fpsyg.2014.01318. Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8-month-old infants. Science, 274(5294), 1926–1928. https://doi.org/10.1126/science.274.5294.1926. Sebastian-Galles, N., & Soto-Faraco, S. (1999). Online processing of native and non-native phonemic contrasts in early bilinguals. Cognition, 72, 111–123. Shen, J., Wright, R., & Souza, P. E. (2016). On older listeners’ ability to perceive dynamic pitch. Journal of Speech Language and Hearing Research, 59(3), 572. https://doi.org/10.1044/2015_J SLHR-H-15-0228. Shen, X. (1989). Interplay of the four citation tones and intonation in Mandarin Chinese. Journal of Chinese Linguistics, 17(1), 61–74.

4 Individual Differences in Lexical Tone Learning

75

Tillmann, B., Jolicœur, P., Ishihara, M., Gosselin, N., Bertrand, O., Rossetti, Y., & Peretz, I. (2010). The amusic brain: Lost in music, but not in space. PLoS ONE, 5(4). https://doi.org/10.1371/jou rnal.pone.0010173. Tillmann, B., Rusconi, E., Traube, C., Butterworth, B., Umiltà, C., & Peretz, I. (2011). Fine-grained pitch processing of music and speech in congenital amusia. The Journal of the Acoustical Society of America, 130(6), 4089–4096. https://doi.org/10.1121/1.3658447. Wang, Y., Jongman, A., & Sereno, J. A. (2003a). Acoustic and perceptual evaluation of Mandarin tone productions before and after perceptual training. The Journal of the Acoustical Society of America, 113(2), 1033. https://doi.org/10.1121/1.1531176. Wang, Y., Sereno, J. A., Jongman, A., & Hirsch, J. (2003b). fMRI evidence for cortical modification during learning of Mandarin lexical tone. Journal of Cognitive Neuroscience, 15(7), 1019–1027. Wang, Y., Spence, M. M., Jongman, A., & Sereno, J. A. (1999). Training American listeners to perceive Mandarin tones. Journal of the Acoustical Society of America, 106(6), 3649–3658. Werker, J. F., & Stager, C. L. (2000). Developmental changes in infant speech perception and early word learning: Is there a link? In M. Broe & J. Pierre-Humbert (Eds.), Papers in laboratory phonology (Vol. 5, pp. 181–193). Cambridge: Cambridge University Press. Wong, P. C. M., Chandrasekaran, B., & Zheng, J. (2012a). The derived allele of ASPM is associated with lexical tone perception. PLoS ONE, 7(4), e34243. https://doi.org/10.1371/journal.pone.003 4243. Wong, P. C. M., Ciocca, V., Chan, A. H. D., Ha, L. Y. Y., Tan, L. H., & Peretz, I. (2012b). Effects of culture on musical pitch perception. PLoS ONE, 7(4), e33424. Wong, P. C. M., & Perrachione, T. K. (2007). Learning pitch patterns in lexical identification by native English-speaking adults. Applied Psycholinguistics, 28(04), 565–585. https://doi.org/10. 1017/s0142716407070312. Wong, P. C. M., Perrachione, T. K., & Parrish, T. B. (2007). Neural characteristics of successful and less successful speech and word learning in adults. Human Brain Mapping, 28(10), 995–1006. https://doi.org/10.1002/hbm.20330. Wong, P. C. M., Warrier, C. M., Penhune, V. B., Roy, A. K., Sadehh, A., Parrish, T. B., & Zatorre, R. J. (2008). Volume of left Heschl’s gyrus and linguistic pitch learning. Cerebral Cortex, 18(4), 828–836. https://doi.org/10.1093/cercor/bhm115. Wu, H., Ma, X., Zhang, L., Liu, Y., Zhang, Y., & Shu, H. (2015). Musical experience modulates categorical perception of lexical tones in native Chinese speakers. Auditory Cognitive Neuroscience, 436. https://doi.org/10.3389/fpsyg.2015.00436.

Part II

Neural Representations

Chapter 5

Native and Nonnative Processing of Acoustic and Phonological Information of Lexical Tones in Chinese: Behavioral and Neural Correlates Keke Yu, Ruiming Wang, and Ping Li Abstract The perception of acoustic and phonological information in lexical tones is crucial for understanding Chinese words correctly. Research in the past has considered the linguistic functions of both acoustic and phonological information. However, it has been debated whether Chinese lexical tones are processed in the right or the left hemisphere, and whether different types of information may be handled differently in the two hemispheres. For native Chinese speakers (L1), the acoustic information of tones appears to be processed in the right hemisphere, whereas the phonological information of tones is mostly processed in the left hemisphere. For second language (L2) Chinese learners, it has been hypothesized that they may show right-lateralized pattern for processing both acoustic and phonological information at the early stage of Chinese learning; when their processing of these two types of information improves to a higher level at a later stage of Chinese learning, native-like patterns emerge. In this chapter, we discuss how these two types of information play their roles in the processing of lexical tones in Chinese by both native speakers and second language learners of Chinese.

5.1 Introduction Chinese is a tonal language that uses different categories of tones at the syllable level to differentiate word meanings. According to the linguistic function of lexical tones, there are two different types of information contained in Chinese lexical tones: acoustic information and phonological information. Acoustic information K. Yu · R. Wang (B) Center for Studies of Psychological Application & School of Psychology, South China Normal University, Guangzhou, China e-mail: [email protected] P. Li Department of Chinese and Bilingual Studies, Hong Kong Polytechnic University, Hong Kong, China © Springer Nature Singapore Pte Ltd. 2020 H.-M. Liu et al. (eds.), Speech Perception, Production and Acquisition, Chinese Language Learning Sciences, https://doi.org/10.1007/978-981-15-7606-5_5

79

80

K. Yu et al.

is the acoustic physical features of pitch, which is primarily reflected by fundamental frequency (F0), especially pitch height and pitch contour (e.g., Howie, 1976; Xu, 1997). Specifically, pitch height is the relative height of F0, while pitch contour depicts the F0 variation over a whole syllable (Gandour, 1983). Processing of acoustic information mainly reflects bottom-up processing of the input auditory stimuli. Moreover, tones can be further classified into level tones versus contour tones: A level tone remains the same F0 height and has minimal variation in pitch contour over the whole syllable, whereas a contour tone varies in pitch across the syllable. For example, in Mandarin Chinese, Tone 1 is a level tone and the other three tones are contour tones. Unlike the acoustic information of tones, the phonological information of tones refers to the distinct semantics expressed by different categories of tones, which are carried by the same syllable; for example, syllable /fu/ means “skin” in Tone 1, while it means “father” in Tone 4. Processing of the phonological information is generally top-down, relying on the user’s knowledge or experience of specific tonal languages. Note that the acoustic information and phonological information of tones are not totally independent, as the phonological information is carried out on the syllable through the acoustic information. Processing of lexical tones in speech perception of Chinese has been extensively explored in behavioral studies over the past several decades. In recent years, researchers have also become interested in the cognitive and neural mechanisms underlying lexical tone processing (see Gandour, 2007; Jongman, Wang, Moore, & Sereno, 2006; Wong, 2002; Zatorre & Gandour, 2008 for reviews; see also other chapters in this volume). Early studies mainly examined tone perception by considering tone as a whole (e.g., Hallé, Chang, & Best, 2004; Wang, Jongman, & Sereno, 2001), but more recent studies have turned to the interaction between specific acoustic and phonological information in lexical tones during processing (e.g., Xi, Zhang, Shu, Zhang, & Li, 2010; Yu, Wang, Li, & Ping, 2014; see also Chap. 3 in this volume). In this chapter, we discuss how acoustic and phonological information in lexical tones are processed by native speakers (herenceforth L1 or Chinese speakers) and second language learners of Chinese (herenceforth L2 or nonnative Chinese speakers), and the distinct roles played by the two types of information in Chinese lexical tone perception. Our discussion focuses on the neurocognitive correlates of the tone perception processes.

5.2 Hemispheric Lateralization for the Processing of Lexical Tones Hemisphere lateralization has been a major focus in the study of lexical tones in Chinese. Researchers have examined whether a certain hemisphere may be more involved in tone perception.

5 Native and Nonnative Processing of Acoustic and Phonological …

81

In earlier studies, there were two competing hypotheses, the acoustic hypothesis and the functional hypothesis, on brain lateralization of pitch processing. The acoustic hypothesis proposes that hemisphere lateralization of auditory processing depends on the acoustic features of the sound. According to Zatorre and colleagues (e.g., Zatorre & Belin, 2001; Zatorre, Belin, & Penhune, 2002), sounds varying spectrally, i.e., precisely but slowly in frequency (such as music), are usually processed in the right hemisphere (RH), whereas sounds varying temporally, i.e., rapidly on broadband (such as speech), tend to be processed in the left hemisphere (LH). According to this hypothesis, Chinese lexical tones should mainly be processed in the RH because the acoustic properties of lexical tones change in fundamental frequency. Unlike the acoustic hypothesis, the functional hypothesis considers that the linguistic function of sounds determines the brain involvement in sound and speech perception. General acoustic features of sounds tend to be processed in the RH, whereas sounds containing linguistic/semantic information are processed primarily in the LH (Van Lancker & Fromkin, 1973, 1978; Van Lancker, 1980). According to this hypothesis, because Chinese lexical tones signal different semantics, they should be processed preferentially in the LH. Many cross-language studies on native and nonnative listeners support the functional hypothesis. Wang et al. (2001) conducted a dichotic listening task on Chinese and English speakers with no tonal experience. Participants were presented with different stimuli in each ear and required to identify which tone they heard in each ear. The results showed that Chinese listeners had better performance in the right ear while English listeners did not show ear preference. In the dichotic listening paradigm, better performance in the right ear reflects LH dominance in processing whereas better performance in the left ear reflects RH dominance. Thus, their study suggested that only Chinese speakers showed LH lateralization in the perception of Chinese lexical tones. In another study, Li, Gandour, Wong, and Hutchins (2001) compared the discrimination of Chinese lexical tones between Chinese speakers and English speakers using Positron Emission Tomography (PET). They found that Chinese speakers showed increased activities in left premotor cortex, left inferior frontal gyrus (IFG) pars opercularis, and IFG pars triangularis in the task, while English speakers showed increased activity in right IFG regions. These results are consistent with Wang et al. in terms of LH dominance patterns. Although the LH dominance view (the functional hypothesis) and the RH dominance view (the acoustic hypothesis) were supported by different studies (see Gandour, 2007; Wong, 2002 for review), neither of these two hypotheses could illustrate the brain mechanism of tone perception completely. This is because the studies supporting the RH-dominant view usually focused on the spectral acoustic features but did not take the linguistic function of the lexical tones into account (e.g., Jia, Tsang, Huang, & Chen, 2013; Ren, Yang, & Li, 2009). At the same time, the studies that provided evidence to the LH-dominant view mainly examined the linguistic processing of lexical tones but confounded the effects of acoustic and linguistic features on lexical tone processing (e.g., Klein, Zatorre, Milner, & Zhao, 2001; Li et al., 2001). Moreover, electroencephalogram (EEG) studies appeared to point to RH dominance for Chinese tone perception. For example, Luo et al. (2006)

82

K. Yu et al.

studied Chinese speakers with EEG. In their study, participants listened to Chinese words with different tones passively while their EEGs were recorded. The authors analyzed a specific event-related potential (ERP) component, the mismatch negativity (MMN), elicited by tones (Chap. 6 in this volume reviews MMN studies of lexical tones specifically). MMN reflects the automatic processing of auditory stimuli and is usually distributed at frontal-central brain areas and peaks at 100–250 ms after stimuli onset (Näätänen & Alho, 1997; Näätänen, Paavilainen, Rinne, & Alho, 2007). It is generally elicited by the oddball paradigm, which consists of frequent standard stimuli (e.g., 70–90% of total stimuli) with infrequent deviant stimuli (e.g., 10–30% of total stimuli). Researchers generally use MMN mean amplitude and peak latency to investigate the extent and time course of auditory processing. Luo et al.’s results showed that the mean amplitudes of MMNs at the RH were larger than those at the LH, suggesting the RH dominance in the processing of Chinese lexical tones. Ren et al. (2009) also utilized MMN to investigate Chinese speakers’ automatic processing of lexical tones. Participants were asked to listen to Chinese lexical tones and hums passively. The hums shared the same pitch contour with Chinese words but had no segmental information (vowel and consonant) in them. Results from the source localization analysis of MMN suggested that processing of both types of the tone stimuli was lateralized to the RH, unlike what was found in previous behavioral or imaging studies (e.g., Li et al., 2001; Wang et al., 2001). Rather than simply focusing on the LH or the RH for lexical tone processing, Gandour and his colleagues put forward another hypothesis (Gandour et al., 2002, 2004; Gandour, 2006): both hemispheres involved in the processing of lexical tones. In particular, RH mainly processes the acoustic features of lexical tones while LH is more related to the processing of linguistic features in lexical tones. In Gandour et al.’s (2004) study, Chinese speakers and English speakers were asked to complete a discrimination task on Mandarin lexical tones during functional magnetic resonance imaging (fMRI) scanning. Their results showed that both groups exhibited activation in right superior temporal gyrus (STG) and right middle frontal gyrus (MFG). However, Chinese participants showed additional activation in the left intraparietal sulcus, and left anterior/posterior STG and frontopolar regions. The same activation in RH regions elicited by both groups of listeners may reflect similar acoustic processing of Chinese lexical tones, while the additional activation in LH areas for Chinese speakers may only be related to the processing of linguistic features in tones, given that lexical tones signal different meanings for Chinese speakers. Moreover, such pattern of processing may only be applicable to lexical tones, as Gandour et al. (2003) suggested that the processing of segmental and suprasegmental features may involve different neural mechanisms. Gandour and colleagues’ hypothesis discarded the previous dichotomous views (LH or RH dominance) and provided a better illustration for the brain’s involvement in lexical tone perception. In addition, it has also encouraged researchers to further examine this issue by considering both acoustic and linguistic/phonological information in lexical tones.

5 Native and Nonnative Processing of Acoustic and Phonological …

83

5.3 Native and Nonnative Processing of Acoustic and Phonological Information in Chinese Lexical Tones In recent years, researchers have begun to consider the processing of both acoustic and phonological information in lexical tones. In this section, we discuss the processing of lexical tones in Chinese first as a native language (L1) and then as a second language (L2).

5.3.1 Native Speakers’ Processing of Chinese Lexical Tones Several studies have explored the processing of acoustic and phonological information in Chinese lexical tones by comparing within-category and across-category tonal contrasts (e.g., Xi et al., 2010; Yu et al., 2014) or by using artificially generated pitch materials, such as lexical tones superimposed in real and pseudo-syllables (e.g., Shuai & Gong, 2014). Studies of within-category and across-category tonal contrasts were based on the “Categorical Perception” paradigm (Liberman, Harris, Hoffman, & Griffith, 1957; Zhang, 2016). According to this paradigm, listeners can perceive continuous pitch variations as discrete tonal categories (Francis, Ciocca, & Ng, 2003; Hallé et al., 2004; Xu, Gandour, & Francis, 2006) while at the same time perceive variations within a category boundary as the same. Thus, tones with equally spaced F0 can make up two types of tonal contrasts: within-category tonal contrast and across-category tonal contrast. Specifically, tonal variations (individual stimuli) in the same category (within-category) are perceived as the same tone and only differ in acoustic information, whereas tonal variations (individual stimuli) in different categories (acrosscategory) are perceived as different tones and differ in both acoustic information and phonological information. Given this categorical perception paradigm, researchers can investigate the acoustic processing of lexical tones with within-category stimuli and the phonological processing by comparing the processing of across-category stimuli with that of within-category stimuli. Xi et al. (2010), Zhang, Shu, Zhou, Wang, and Li (2011) and Zhang, Xi, Wu, Shu, and Li (2012) investigated the brain mechanisms involved in lexical tone perception by Chinese speakers with the above approach. Xi et al. (2010) examined the processing of acoustic and phonological information in Chinese lexical tones using MMN at the pre-attentive stage. The pre-attentive stage is a stage in which stimuli are processed automatically and without attention (Kubovy, Cohen, & Hollier, 1999; Neisser, 1967). Exploration of the pre-attentive processing could eliminate the potential strategic effects of participants’ attention to the experimental stimuli. Using a speech synthesis program these authors generated a tonal continuum (11 individual stimuli) from /pa2/ to /pa4/ and chose a within-category contrast and an across-category contrast from the continuum with equal F0 spacing (See Fig. 5.1). Participants in the study were required to see a movie but ignore the auditory stimuli.

84

K. Yu et al.

Fig. 5.1 Fundamental frequency (F0) of the stimuli in /pa2/ to /pa4/ tonal continuum in Xi et al. (2010). Specifically, stimuli 3 and stimuli 7 were used as the across-category tonal contrast, while stimuli 7 and stimuli 11 were used as the within-category tonal contrast (from Xi et al., 2010, Fig. 2, reproduced with permission from Elsevier)

The results showed that within-category tonal stimuli elicited larger MMN mean amplitude in RH than in LH, while across-category tonal stimuli elicited the opposite pattern: the MMN mean amplitude was larger in the left than in the right hemisphere. The study clearly showed a LH lateralization for the pre-attentive processing of phonological information and a RH lateralization for the pre-attentive processing of acoustic information. Zhang et al. (2012) further investigated the processing of acoustic and phonological information with the same stimuli from Xi et al. (2010) at the attentive stage by looking at another two ERP components: N2b and P3b. The attentive stage is a stage in which stimuli are processed with attention (Kubovy et al., 1999; Neisser, 1967). Investigation of the attentive processing could probably reveal the role of attention and decision on tone perception. Both N2b and P3b have been implicated in previous literature as ERP components sensitive to attention. N2b is primarily distributed at central-parietal areas and peaks at approximately 200–300 ms after stimuli onset. It mainly reflects deviant detection with attention (Folstein & Petten, 2008; Novak, Ritter, Vaughan, & Wiznitzer, 1990). P3b usually has a temporal-parietal distribution and peaks at nearly 350–450 ms after stimuli onset. It is associated with attention and subsequent memory processing (Duncan-Johnson & Donchin, 1977; Polich, 2007). Both N2b and P3b have also been shown to reflect acoustic and phonological processing in speech perception at attentive stage. The paradigm in Zhang et al.’s study used the oddball paradigm, but the task for participants was to detect deviant stimuli by pressing buttons. Their results showed that the mean amplitudes of both N2b and P3b elicited by across-category stimuli were larger in LH than in RH,

5 Native and Nonnative Processing of Acoustic and Phonological …

85

whereas only the mean amplitude of N2b elicited by within-category stimuli were greater in RH than in LH. However, the mean amplitude of P3b elicited by withincategory stimuli showed no significant difference between the two hemispheres. The results of the study provided reliable ERP evidence (N2b and P3b mean amplitudes) to the LH preference in phonological processing at the attentive stage elicited by across-category stimuli. On the other hand, N2b mean amplitudes elicited by withincategory stimuli demonstrate the RH preference in acoustic processing, at least within the N2b time window. In an fMRI study, Zhang et al. (2011) explored specific brain regions involved in the processing of acoustic and phonological information in Chinese lexical tones. They also utilized the within-category and across-category tone stimuli from Xi et al. (2010) to distinguish between acoustic and phonological information. Participants were instructed to complete a visual task and ignore the tonal stimuli. The results showed stronger activation in the left middle portion of MTG (mMTG) for across-category stimuli, as compared with the within-category stimuli. In contrast, brain responses in right Heschl’s gyrus (HG) and STG were stronger for the withincategory stimuli than the across-category stimuli. In previous literature, left mMTG was considered as a key brain area for high-level phonological representations and right HG/STG was thought to analyze acoustic, especially pitch features (Joanisse, Zevin, & Mccandliss, 2007; Liebenthal, Binder, Spitzer, Possing, & Medler, 2005; Meyer, Steinhauer, Kai, Friederici, & Cramon, 2004; Zhang, Shu, Zhou, Wang, & Li, 2010). Thus, these results clearly demonstrated that processing of phonological information in Chinese lexical tones is lateralized to the LH, whereas acoustic information to the RH. In addition to the within-category and across-category tonal contrasts, researchers have also examined special speech or sound materials that share identical F0 features with Chinese lexical tones, such as pseudo-syllables or hums by removing the speech features from speech sounds (e.g., using sound synthesis software like Praat, https:// www.fon.hum.uva.nl/praat). Although these special materials were different from real Chinese words in word meanings, researchers can detect the acoustic processing of lexical tones with these materials, and infer the processing of phonological information by comparing them with real Chinese words. For example, Shuai and Gong (2014) used real and pseudo-Mandarin syllables with different lexical tones to differentiate acoustic from phonological information in Chinese lexical tones and explored these two types of information processing. Participants were required to complete three tasks including dichotic listening, lexical decision with phonological priming, and semantic violation. Two ERP components were analyzed in the study: P2, which peaks at around 200 ms after stimuli onset and is distributed at central brain areas (Luck, 2005), and N400, which appears at 250–550 ms time window and has a central-frontal distribution (Kutas & Federmeier, 2011). These two ERP components can reflect both acoustic and semantic processing of auditory stimuli, respectively (Ackermann, Lutzenberger, & Hertrich, 1999; Dumay, Benraïss, Barriol, & Colin, 2001; Liebenthal et al., 2010; Kutas & Hillyard, 1980). Shuai and Gong’s results of mean amplitude showed that the acoustic processing of lexical tones elicited larger P200 and N400 in RH, whereas the phonological processing of lexical tones elicited

86

K. Yu et al.

larger P200 and N400 in LH. Thus, this study is consistent with other studies reviewed above regarding the acoustic versus phonological processing in the two hemispheres.

5.3.2 Native Versus Nonnative Speakers’ Processing of Chinese Lexical Tones: Cross-Language Comparisons A comparison between native and nonnative speakers will be insightful because both groups of speakers can process acoustic information of lexical tones, but only native speakers have the experience or knowledge of phonological information in the tones (Chap. 3 in this volume also reviews native versus nonnative studies on tone perception, see Chap. 3 for more details). Taking advantage of this comparison, Klein et al. (2001) conducted a PET study, in which both Chinese speakers and English speakers with no tonal experience served as participants. They were required to discriminate a series of Chinese monosyllabic word pairs that differed only in tones. The results showed that for English speakers, the changes in regional cerebral blood flow (rCBF) were all in the regions of right hemisphere, including ventrolateral frontal cortex, anterior orbitofrontal gyrus, lateral orbital gyrus, cingulate cortex, and STG. But Chinese speakers doing the same discrimination task showed rCBF changes in several left brain regions additionally: ventromedial orbital frontal cortex, frontopolar cortex, pre-central and post-central gyri, inferior and superior parietal cortex, lateral occipitotemporal and middle occipital gyri. For English speakers who have no tonal experience, pitch was just an acoustic feature, not a linguistic feature. Their brain responses demonstrated RH dominance in the acoustic processing of lexical tones. In addition, by comparing the brain activation results between the two groups, the researchers identified a LH preference for indexing phonological processing of lexical tones in the Chinese group. Similarly, Wong, Parsons, Martinez, and Diehl (2004) asked Chinese and English speakers to discriminate Mandarin lexical tones. In their study, the tones were superimposed on Mandarin or English syllables. Their results showed that Chinese speakers’ strongest activation in discriminating Mandarin words was in left anterior insular cortex, while their strongest activation in discriminating English words was in the right anterior insular. But English speakers showed right anterior insular activation in both Mandarin words and English words conditions. Although the specific regions of brain responses differed from those found in the Klein et al. (2001) study, by considering the performance of the native versus nonnative Chinese versus English groups, Wong et al. also revealed RH and LH lateralization patterns in the processing of acoustic and phonological information of Chinese lexical tones, respectively. Gandour et al. (2004), as discussed above, found that both Chinese speakers and English speakers showed activation in the right STG and MFG, but Chinese speakers showed additional activation in intraparietal sulcus, anterior/posterior STG and frontopolar regions of left hemisphere. Their study also gave evidence to the

5 Native and Nonnative Processing of Acoustic and Phonological …

87

view that acoustic information and phonological information in Chinese lexical tones were lateralized to different hemispheres depending on the type of processing (acoustic or phonological) performed by native versus nonnative speakers. Gandour and colleagues also examined cross-language processing patterns from other tonal languages such as Thai, which similarly supported the view that acoustic versus phonological processing is lateralized differently depending on the type of processing involved (Gandour et al., 2000, 2002). In sum, we cannot simply claim whether the processing of Chinese lexical tones is left- or right-lateralized, but need to examine in detail the specific information contained in the lexical tones, that is, acoustic versus phonological information. Studies of both native language speakers and nonnative language speakers have indicated that processing of the specific type of information engages the right versus the left hemisphere differentially, and the type of processing could vary depending on whether they are native speakers (identifying the two types of information in tones) or nonnative speakers (identifying only acoustic information in tones).

5.4 Acoustic and Phonological Processing of Lexical Tones in Chinese as a Second Language It is clear from the above discussion that nonnative speakers who do not have experience with tonal languages show different brain response patterns in processing Chinese lexical tones. But as they gain experience in learning Chinese as a second language, do responses of their brains change over time? In this section, we first discuss the Chinese learners’ behavioral performance and neural correlates in Chinese lexical tone perception, and then suggest mechanisms of change for acoustic and phonological processing during learning.

5.4.1 Behavioral Performance in Chinese Lexical Tone Perception A number of studies have explored how second language learners of Chinese identify and discriminate Chinese lexical tones (e.g., Chandrasekaran, Sampath, & Wong, 2010; Hao, 2012; Wang, Spence, Jongman, & Sereno, 1999; Zhang et al., 2016), and their categorical perception of Chinese lexical tones (e.g., Shen & Froud, 2015). Wang et al. (1999) trained English speakers to identify Mandarin lexical tones with real Mandarin words for two weeks, and their results showed that English listeners performed significantly better in tone identification after training and the improvement could retain even six months after the training. Hao (2012) found that both English and Cantonese learners of Chinese had difficulty in discriminating Mandarin

88

K. Yu et al.

Tone 2 and Tone 3, but Cantonese learners showed additional confusion in discriminating Mandarin Tone 1 and Tone 4. Furthermore, Chandrasekaran et al. (2010) considered the specific acoustic features: pitch height and pitch contour in lexical tones. Their results revealed that good learners of Mandarin lexical tones paid more attention to pitch contour than pitch height, and this was true during both identification and discrimination tasks before and after training on the lexical tones. Moreover, with respect to categorical perception of lexical tones, Shen and Froud (2015) found, using the tone identification and discrimination tasks, English-speaking learners of Mandarin who took advanced Mandarin courses could show categorical perception of lexical tones. In a recent study, Zhang et al. (2016) explored the effect of pitch contour of lexical tones on Mandarin sentence recognition by Japanese learners of Chinese. Their results showed that Japanese listeners performed better in the sentences with normal pitch contour than with flattened pitch, and further, when pitch contour of lexical tones were flattened, semantic context would promote the recognition of the sentences.

5.4.2 Neural Correlates in Chinese Lexical Tone Perception In addition to behavioral studies, a number of studies in recent years have examined the neural correlates of Mandarin lexical tone processing by second language learners of Chinese, using short-term learning or training paradigms to identify brain changes based on task-dependent fMRI, resting-state fMRI, and brain structure analysis. Task-dependent fMRI Studies A few studies have trained nonnative speakers with no prior tonal experience to associate Mandarin lexical tones with different semantics (sound-to-meaning mapping). For example, Wang, Sereno, Jongman, and Hirsch (2003) was an early neuroimaging study to use a short-term training program for English-speaking learners of Mandarin. They examined English learners’ processing of Mandarin lexical tones using task-dependent fMRI. The participants learned real Chinese monosyllabic words and were required to discriminate between Mandarin lexical tones during fMRI scanning before and after training. The fMRI results showed consistent activation in bilateral middle temporal gyrus (MTG), bilateral STG, left IFG and left medial frontal gyrus in pre- and post-training. However, after training of Chinese (eight sessions, with each session lasting 40 min), the activation area of left STG expanded and the neighboring brain region of left STG (Brodmann’s area 42, auditory cortex) and right IFG also became engaged when processing Chinese tones. The expansion of both existing brain area and new areas reflected the neural plasticity through learning experience, and these areas may play a major role in processing Mandarin lexical tones for second language learners of Chinese. In Wong, Perrachione, and Parrish (2007), learning materials were synthetic words with Mandarin lexical tones superimposed on pseudo-syllables in learners’ native language. The corresponding meanings for these words were represented by pictures. Participants were English speakers who learned pseudo-Chinese words and

5 Native and Nonnative Processing of Acoustic and Phonological …

89

completed a tone discrimination task in scanner before and after learning. The authors found that successful learners showed increased activation in the left posterior STG after learning, while less successful learners showed increased activation in the right STG and right IFG. Such differences between the successful and less successful learners were even observed before learning. Similarly, Yang et al. (2015) trained native English participants to make word-picture associations during pseudo-Chinese lexical learning in a six-week training session. The new lexical items resembled the syllabic structure and tonal distinctions in Mandarin Chinese. With a pre-test (T1) and post-test (T2) fMRI schedule, the participants were scanned twice in response to the same stimuli. In Yang et al.’s study, one group of learners served as a control group and did not go through training but were also scanned at T1 and T2. This method allowed the authors to effectively track neural changes underlying both learners’ and non-learners’ behavior. A major finding from Yang et al. (2015) was that at T2 after learning, the successful learners, as compared with both the less successful learners and the control group non-learners, displayed a more connected, better-integrated multi-path neural network with the left STG as a hub. More surprising is the finding that the successful learners, as compared with the other two groups, had a betterconnected brain network at T1, that is, even before learning took place (see Fig. 5.2). This finding, along with that from Wong et al. (2007), suggests that we can use fMRI brain response patterns to not only distinguish between good versus poor lexical tone learners but also predict who might become the more successful learners in advance. In another study, Asaridou, Takashima, Dan, Hagoort, and Mcqueen (2015) trained Dutch speakers to learn “Dutchinese” words (Dutch pseudowords with Mandarin lexical tones) and utilized a fMRI repetition suppression paradigm to investigate Dutch learners’ processing of tones before and after training. fMRI repetition suppression is a paradigm evaluating the neural processing proficiency, and

Fig. 5.2 Brain networks in successful learners, less successful learners, and non-learners when they performed tone discrimination at T1 and T2. The blue lines represent the brain network at T1 (before training), and the orange lines indicate the brain network at T2 (after training). The arrows refer to the blood oxygen level-dependent (BOLD) activity in one brain region that could statistically predict the BOLD activity in another brain region. The width of the arrow indicates the strength of the prediction. IFG, inferior frontal gyrus; MFG, middle frontal gyrus; SMA, supplementary motor area; INS, insula; STG, superior temporal gyrus; IPL, inferior parietal lobule (from Yang et al., 2015, Fig. 5.3, reproduced with permission from Elsevier)

90

K. Yu et al.

the larger the repetition suppression, the higher the neural processing proficiency, according to Grill-Spector, Henson, and Martin (2006). In scanning, participants were required to complete an unrelated task (discriminating the intensity of stimuli) when they heard words with different tones. The results showed that if participants had larger repetition suppression in the left IFG (hence higher neural processing proficiency in this area), they would learn Mandarin lexical tones better. The role of left IFG in Mandarin lexical tone perception as demonstrated in these fMRI studies is consistent with the earlier fMRI findings from Wang et al. (2003), although it was unclear how the IFG activation patterns were related to the activity of STG, a key hub of tone processing as implicated in other studies discussed. Brain Structure Studies Not only functional brain changes are observed as above, several studies have also found structural brain changes as a result of second language learning of Chinese (see Li, Legault, & Litcofsky, 2014, for a review of structural brain changes due to L2 learning, along with a discussion of the various structural brain measures for L2-experience dependent changes). Crinion et al. (2009) compared the brain structure among Chinese speakers, English speakers who were late learners of Mandarin Chinese, and European multilinguals who had no prior experience of any tonal languages. Their results showed that both L1 and L2 speakers of Chinese showed greater gray and white matter density in the posterior region of left insula and right anterior temporal lobe (ATL) than European multilinguals. Eliminating the effect of ethnicity, the left insula and right ATL may be biomarkers for using Mandarin Chinese by Chinese speakers and learners of Mandarin Chinese. Particularly, the increases in gray and white matter structures may be related to the processing of Mandarin lexical tones, because pitch variation in Mandarin Chinese is a feature that distinguishes between Chinese and European non-tonal languages. Wong et al. (2008) trained English speakers who did not have tonal knowledge to learn distinct word meanings by Mandarin lexical tones. The authors compared the neuroanatomic differences between successful and less successful learners after training. They found a positive relation between the left Heschl’s gyrus (HG) volume and training performance; that is, successful learners showed larger left HG volume, especially gray matter volume, compared with less successful learners, which suggested a significant role of left HG in tone perception and learning. Schlegel, Rudelson, and Peter (2012) examined white matter (WM) integrity using diffusion tensor imaging (DTI) to study the differences between English speakers who had taken a nine-month intensive Chinese learning course and those who did not take the course. The results showed a higher fractional anisotropy (FA) value for Chinese learners in WM tracts, especially in the frontal tracts across the genu of corpus callosum (CC), and the FA slope of Chinese learners correlated positively with the amount of Chinese they learned. As lexical tone is a unique feature in Chinese and is not in English, their study indicated that lexical tone learning was associated with changes in the WM structure. Further, Qi, Han, Garel, Chen, and Gabrieli (2015) also examined the WM tracts of Chinese learners. The learners were English speakers who received a four-week intensive Chinese course. Results in their

5 Native and Nonnative Processing of Acoustic and Phonological …

91

study were consistent with Schlegel et al., showing a positive correlation between WM structures in the RH and Chinese learning performance: the larger FA in the right superior and inferior longitudinal fasciculus, the better the participants’ Chinese learning performance. This study further indicated a potential role of WM structures in the RH in Chinese lexical tone learning. Resting-state fMRI Studies Unlike task-dependent fMRI studies, resting-state fMRI (rs-fMRI) studies measure the blood oxygen level-dependent (BOLD) signals when participants are at rest with no task. Such measurement is independent of task and therefore provides a more neutral way to assess brain patterns. Various previous studies have found the rs-fMRI can be related to patterns of task-based fMRI, and more importantly, they can provide powerful predictors of individual differences in language learning (see a recent study by Sun, Li, Ding, Wang, & Li, 2019). Although researchers have already begun to examine rs-fMRI responses for second language processing (e.g., Veroude, Norris, Shumskaya, Gullberg, & Indefrey, 2010; VenturaCampos et al. 2013), the use of rs-fMRI in second language learning of Chinese lexical tone is still rare. One study by Deng, Chandrasekaran, Wang, and Wong (2015) did explore the relationship between low-frequency fluctuations (LFFs) of spontaneous brain activity measured by rs-fMRI before training and the learners’ performance of Mandarin lexical tone learning. They found that the regional amplitude of spontaneous low-frequency fluctuation (ALFF) in the left STG was positively correlated with tone learning performance. In addition, the degree and local efficiency of left STG based on a graph-theoretical analysis were also correlated with the learning performance positively. These results reflected an important role of left STG in processing Mandarin lexical tones, consistent with the previous literature using task-based fMRI as discussed above. In short, these neuroimaging studies suggest that a network of the left IFG, STG, and HG is involved in tone processing, and could be used to index and predict second language learners’ success in learning Chinese lexical tones.

5.4.3 Processing of Acoustic and Phonological Information for Second Language Learners of Chinese For second language learners of Chinese, processing of Chinese lexical tones may be a dynamic process. The question is how the brain correlates are reflected in hemispheric lateralization patterns at different learning stages for the processing of acoustic and phonological information of Chinese lexical tones. No study has been designed to examine this question specifically, but previous studies as reviewed above may be useful to address this issue, as we hypothesize below. At the very early stage when learners have not mastered the phonological information signaled by Chinese lexical tones, their brain responses to lexical tones may be similar to nonnative listeners. Based on the cross-language studies of nonnative speakers (Gandour et al., 2004; Klein et al., 2001; Wong et al., 2004; Xu et al., 2006),

92

K. Yu et al.

we may hypothesize that beginning learners of Chinese may predominantly use the RH to process Chinese tones. This is because at this stage they are mostly treating the lexical tones as acoustic information. Once they have learned Chinese to a certain level of proficiency, they may become sensitive to the phonological information in tones and can reliably associate different tones with different word meanings. At this stage, the hemispheric lateralization of acoustic and phonological processing may be altered, as evidence has indicated in the previous studies (e.g., Wang et al., 2003). Previous neuroimaging studies of lexical tone training have revealed several brain regions that play crucial roles in processing Mandarin lexical tones, including bilateral IFG (Asaridou et al., 2015; Wang et al., 2003; Wong et al., 2007), left STG (Deng et al., 2015; Wang et al., 2003; Wong et al., 2007; Yang et al., 2015), left posterior region of insula (Crinion et al., 2009), left HG (Wong et al., 2008) and the right ATL (Crinion et al., 2009). These areas have also been implicated in many studies of Chinese speakers. Specifically, left brain regions such as the left STG and left IFG have been considered to take part in processing phonological information in lexical tones (Gandour et al,. 2004; Wang et al., 2003; Wong et al., 2004), whereas right brain regions such as right ATL and right IFG have been implicated in the processing of acoustic information of lexical tones (Gandour et al., 2004; Klein et al., 2001; Li et al., 2001; Zhang et al., 2011). Only at the final stage of learning when second language learners of Chinese become fully proficient in Chinese, we would predict that they may process Mandarin lexical tones as Chinese speakers do and show a RH lateralization in processing acoustic information and a LH lateralization for processing phonological information of Chinese lexical tones. The transition to the final stage may also show individual variation in terms of the proficiency of the learner and the degree to which second language learners truly resemble native speakers in brain response patterns.

5.5 Additional Issues for Further Studies 5.5.1 Time Course for Acoustic and Phonological Processing A few previous studies have examined the time course or temporal relation for the processing of acoustic and phonological information in Chinese lexical tones using the ERP method. Luo et al. (2006) proposed a two-stage model for these two types of processing. The model hypothesized that the processing of acoustic and phonological information took place at pre-attentive first and attentive stage second for Chinese speakers. More specifically, only acoustic information is processed at early pre-attentive stage and the phonological information is processed at later attentive stage. However, Xi et al. (2010) and Yu, Wang, Li, and Ping (2014) showed that both of these two types of information can be processed at pre-attentive stage. In addition, Xi et al. (2010) found that there was no significant difference between the MMN peak latencies elicited by acoustic and phonological information, which

5 Native and Nonnative Processing of Acoustic and Phonological …

93

Fig. 5.3 MMNs elicited by different types of deviant tonal stimuli at electrode locations F3, FZ, F4, in which comparison between across-category versus within-category deviants indicate the processing of phonological information in lexical tones, and comparison between large versus small interval deviants suggest the processing of acoustic information in lexical tones (from Yu et al., 2014, Fig. 3, reproduced with permission from Frontiers)

revealed that these two types of information may be processed in parallel at preattentive stage. But Yu et al. (2014) further showed that the peak latency of MMN evoked by phonological information was shorter than that by acoustic information, indicating that phonological information may be processed even earlier than acoustic information (see Fig. 5.3). Zhang et al. (2012) also investigated native speakers’ processing of these two types of information at the attentive stage. Results from the peak latencies of both N2b and P3a did not find significant difference between acoustic and phonological information, which revealed the parallel attentive processing for both acoustic and phonological information. As discussed earlier, Shuai and Gong (2014) also explored the attentive processing of the two types of information in Chinese listeners. Their results of P2 and N400 peak latencies based on different tasks (dichotic listening, lexical decision with phonological priming, and semantic violation) also indicated that acoustic and phonological information of lexical tones were processed in parallel at the attentive stage. Although it seems that the processing of acoustic and phonological information in Chinese lexical tones occurs simultaneously at the attentive stage for Chinese speakers, the time course of these two types of information at the preattentive stage remains unclear, and should be investigated in the future. In addition, it is also unclear what the time course is for second language learners of Chinese when they process these two types of information, which provides new avenues for future research.

94

K. Yu et al.

5.5.2 Relative Role of Pitch Height and Pitch Contour as Specific Acoustic Information Pitch height and pitch contour are two primary types of specific acoustic features in pitch (Gandour, 1983). Some studies have also investigated the processing of these two types of specific acoustic information in Chinese lexical tones. For example, Chandrasekaran et al. (2007) found that pitch contour was more important for Chinese speakers than English speakers who had no tonal experience when perceiving Chinese lexical tones. The results revealed that listeners with different language experience perceived pitch height and pitch contour with different weight. Chandrasekaran et al. (2007) further found that the mean amplitude of MMN elicited by large pitch contour difference (Mandarin Tone 1 vs. Mandarin Tone 3) was larger than that elicited by small pitch contour difference (Mandarin Tone 2 vs. Mandarin Tone 3), which suggested that pitch contour impact the pre-attentive processing of lexical tones. However, the difference was not found in English speakers with no tonal experience. It further indicated that language experience affected the processing of pitch contour in Chinese lexical tones at the pre-attentive stage. Yu et al. (2014) also explored the effect of pitch contour on the processing of Chinese lexical tones at the pre-attentive stage by controlling the phonological information in tonal stimuli. They examined larger pitch contour variations between within-category tonal stimuli that had the same word meaning, compared with pitch contour variations between across-category tonal stimuli which had different meanings. The authors found that both elicited larger MMN mean amplitude than smaller pitch contour variations. But the time course of MMN did not differ from these different pitch contour variations. However, Wang et al. (2013) examined Chinese speakers’ pre-attentive processing of both pitch height and pitch contour in Chinese lexical tones, and found that pitch height was processed earlier than pitch contour. With regard to hemispheric lateralization, processing of pitch height was lateralized in the RH, while processing of pitch contour showed a tendency in the LH according to these authors. In short, these studies show that specific acoustic features such as pitch height and pitch contour play different roles in the processing of Chinese lexical tones, and that the processing was clearly influenced by language experience. However, except Yu et al. (2014), most studies focused on the acoustic information in Chinese lexical tones did not consider that the phonological information/semantics may be altered by acoustic variation. For example, Chandrasekaran, Krishnan et al. (2007), used meaningful Mandarin words with different tones to differentiate large versus small pitch contour. So the difference between tonal stimuli for Chinese speakers was not only based on pitch contour, but also on meaning differences. In order to address this issue, Yu et al. (2017) simultaneously controlled the pitch features and the phonological information of lexical tones. Their results revealed a strong interaction between pitch type (pitch height and pitch contour) and phonological information in the processing

5 Native and Nonnative Processing of Acoustic and Phonological …

95

of lexical tones. In future studies, especially natives’ studies of acoustic information in Chinese lexical tones, the interaction between pitch type and phonological information should be examined in further detail.

5.6 Conclusion Chinese lexical tones are not only general acoustic sound variations in pitch, but also have linguistic function to signal different word meanings. In this regard, Chinese lexical tones consist of two types of information, acoustic information and phonological information. Dissociation of these two types of tonal information contributes to a better understanding of lexical tone processing and informs us of the underlying neurocognitive mechanisms for this processing. In this chapter, we have discussed how these two types of information were processed by native Chinese speakers, by nonnative speakers who have no prior knowledge of tonal languages, and by second language learners of Chinese who gradually gain proficiency in lexical tones. We also reviewed the issue of hemispheric lateralization involved in acoustic and phonological processing, along with other issues such as the time course for these two types of processing in current and future studies. For Chinese speakers, we have clear evidence that acoustic information is processed mainly in the right hemisphere, whereas phonological information is lateralized to the left hemisphere. In addition, during the attentive processing stage, acoustic information and phonological information are processed in parallel. With regard to second language learners of Chinese, it is hypothesized that beginning learners of Chinese with low proficiency still process Chinese lexical tones using the right hemisphere, but advanced learners of Chinese with high proficiency can achieve native-like patterns in processing acoustic and phonological information in the right and the left hemisphere, respectively. Further investigation of this latter group of learners will also allow us see transition and individual differences. Studies of Chinese lexical tone are growing, but we are only at the beginning of understanding the neurocognitive mechanisms underlying Chinese tone processing. Several questions need to be pursued in future studies. First, as we have alluded to, the processing of acoustic and phonological information does not rely on a single brain region in a single hemisphere, but a brain network in which different brain regions work together. Although the issue has been raised in previous work on the processing of lexical tones (e.g., Deng et al., 2015; Gandour, 2007; Yang et al., 2015), the functional or effective connectivity patterns involved in specific tonal processing require further investigation. Furthermore, for second language learners of Chinese, we need more reliable evidence and more longitudinal studies to test the hypotheses of hemispheric lateralization in acoustic and phonological processing of Chinese lexical tones. Finally, we need to translate the understanding of the neurocognitive mechanisms of lexical tone processing to practical pedagogical use, and to provide information to educators to develop effective teaching and learning programs for Chinese language learning, given the increasing number of people around the world

96

K. Yu et al.

who are learning Chinese as a second language. How this translation can be done remains to be carefully examined in the future. Acknowledgements This research was supported by the Foundation for Innovation Team in Guangdong Higher Education (2015WCXTD003), and Guangdong Province Universities and colleges Pearl River Younger Scholar Funded Scheme (2016) to Ruiming Wang. Research support was also provided by NSF grants (BCS-1533625; BCS-1349110) to Ping Li. We thank three anonymous reviewers for comments on the manuscript.

References Ackermann, H., Lutzenberger, W., & Hertrich, I. (1999). Hemispheric lateralization of the neural encoding of temporal speech features: a whole-head magnetencephalography study. Cognitive Brain Research, 7(4), 511–518. Asaridou, S. S., Takashima, A., Dan, D., Hagoort, P., & Mcqueen, J. M. (2015). Repetition suppression in the left inferior frontal gyrus predicts tone learning performance. Cerebral Cortex, 26(6), 2728–2742. Chandrasekaran, B., Gandour, J. T., & Krishnan, A. (2007). Neuroplasticity in the processing of pitch dimensions: a multidimensional scaling analysis of the mismatch negativity. Restorative Neurology & Neuroscience, 25(3–4), 195–210. Chandrasekaran, B., Krishnan, A., & Gandour, J. T. (2007). Mismatch negativity to pitch contours is influenced by language experience. Brain Research, 1128(1), 148–156. Chandrasekaran, B., Sampath, P. D., & Wong, P. C. M. (2010). Individual variability in cueweighting and lexical tone learning. Journal of the Acoustical Society of America, 128(1), 456–465. Crinion, J. T., Green, D. W., Chung, R., Ali, N., Grogan, A., Price, G. R., & Price, C. J. (2009). Neuroanatomical markers of speaking Chinese. Human Brain Mapping, 30(12), 4108–4115. Deng, Z., Chandrasekaran, B., Wang, S., & Wong, P. C. M. (2015). Resting-state low-frequency fluctuations reflect individual differences in spoken language learning. Cortex, 76, 63–78. Dumay, N., Benraïss, A., Barriol, B., & Colin, C. (2001). Behavioral and electrophysiological study of phonological priming between bisyllabic spoken words. Journal of Cognitive Neuroscience, 13(1), 121–143. Duncan-Johnson, C. C., & Donchin, E. (1977). On quantifying surprise: The variation of eventrelated potentials with subjective probability. Psychophysiology, 14(5), 456–467. Folstein, J. R., & Petten, C. V. (2008). Influence of cognitive control and mismatch on the N2 component of the ERP: a review. Psychophysiology, 45(1), 152–170. Francis, A. L., Ciocca, V., & Ng, B. K. C. (2003). On the (non) categorical perception of lexical tones. Perception & Psychophysics, 65(7), 1029–1044. Gandour, J. (1983). Tone perception in Far Eastern-Languages. Journal of Phonetics, 11, 149–175. Gandour, J., Wong, D., Hsieh, L., Weinzapfel, B., Van, L. D., & Hutchins, G. D. (2000). A crosslinguistic PET study of tone perception. Journal of Cognitive Neuroscience, 12(1), 207–222. Gandour, J., Wong, D., Lowe, M., Dzemidzic, M., Satthamnuwong, N., Tong, Y., Li, X. (2002). A cross-linguistic fMRI study of spectral and temporal cues underlying phonological processing. Journal of Cognitive Neuroscience, 14(7), 1076-1087. Gandour, J., Xu, Y., Wong, D., Dzemidzic, M., Lowe, M., Li, X., & Tong, Y. (2003). Neural correlates of segmental and tonal information in speech perception. Human Brain Mapping, 20(4), 185–200. Gandour, J., Tong, Y., Wong, D., Talavage, T., Dzemidzic, M., Xu, Y. et al. (2004). Hemispheric roles in the perception of speech prosody. Neuroimage, 23(1), 344–357.

5 Native and Nonnative Processing of Acoustic and Phonological …

97

Gandour, J. (2006). Brain mapping of Chinese speech prosody. In P. Li, L. H. Tan, E. Bates, & O. J. L. Tzeng (Eds.), Handbook of East Asian Psycholinguistics (Vol. 1: Chinese, pp. 308–319). Cambridge, UK: Cambridge University Press. Gandour, J. (2007). Neural substrates underlying the perception of linguistic prosody. Tones and Tunes, 2, 3–25. Grill-Spector, K., Henson, R., & Martin, A. (2006). Repetition and the brain: Neural models of stimulus-specific effects. Trends in Cognitive Sciences, 10(1), 14–23. Hallé, P. A., Chang, Y. C., & Best, C. T. (2004). Identification and discrimination of Mandarin Chinese tones by Mandarin Chinese vs. French listeners. Journal of Phonetics, 32(3), 395–421. Hao, Y. C. (2012). Second language acquisition of Mandarin Chinese tones by tonal and non-tonal language speakers. Journal of Phonetics, 40(2), 269–279. Howie, J. (1976). Acoustical Studies of Mandarin vowels and tones. Cambridge University Press. Joanisse, M. F., Zevin, J. D., & Mccandliss, B. D. (2007). Brain mechanisms implicated in the preattentive categorization of speech sounds revealed using fMRI and a short-interval habituation trial paradigm. Cerebral Cortex, 17(9), 2084–2093. Jongman, A., Wang, Y., Moore, C., & Sereno, J. (2006). Perception and production of Mandarin tone. In P. Li, L. H. Tan, E. Bates, & O. J. L. Tzeng (Eds.), Handbook of East Asian Psycholinguistics (Vol. 1: Chinese, pp. 209–217). Cambridge, UK: Cambridge University Press. Jia, S., Tsang, Y. K., Huang, J., & Chen, H. C. (2013). Right hemisphere advantage in processing Cantonese level and contour tones: Evidence from dichotic listening. Neuroscience letters, 556, 135–139. Klein, D., Zatorre, R. J., Milner, B., & Zhao, V. (2001). A cross-linguistic PET study of tone perception in Mandarin Chinese and English speakers. Neuroimage, 13(4), 646–653. Kubovy, M., Cohen, D. J., & Hollier, J. (1999). Feature integration that routinely occurs without focal attention. Psychonomic Bulletin & Review, 6(2), 183–203. Kutas, M., & Federmeier, K. D. (2011). Thirty years and counting: Finding meaning in the N400 component of the event related brain potential (ERP). Annual Review of Psychology, 62(1), 621–647. Kutas, M., & Hillyard, S. A. (1980). Reading senseless sentences: Brain potentials reflect semantic incongruity. Science, 207, 203–208. Lee, C.-Y., & Cheng, Y. (2020). Neurophysiological studies of Mandarin lexical tone acquisition in the early childhood. In H-.M. Liu, F.-M. Tsao, & P. Li (Eds.), Speech learning, perception, and production: Multidisciplinary approaches in Chinese language research. Singapore: Springer. Lee, C.-Y., & Wiener, S. (2020). Processing lexical tone in speech perception and spoken word recognition: The roles of acoustic variability and linguistic knowledge. In H.-M. Liu, F.-M. Tsao, & P. Li (Eds.), Speech learning, perception, and production: Multidisciplinary approaches in Chinese language research. Singapore: Springer. Li, H., Gandour, J., Wong, D., & Hutchins, G. D. (2001). Functional heterogeneity of inferior frontal gyrus is shaped by linguistic experience. Brain and Language, 76(3), 227–252. Li, P., Legault, J., & Litcofsky, K. A. (2014). Neuroplasticity as a function of second language learning: Anatomical changes in the human brain. Cortex, 58, 301–324. Liebenthal, E., Binder, J. R., Spitzer, S. M., Possing, E. T., & Medler, D. A. (2005). Neural substrates of phonemic perception. Cerebral Cortex, 15(10), 1621–1631. Liebenthal, E., Desai, R., Ellingson, M. M., Ramachandran, B., Desai, A., & Binder, J. R. (2010). Specialization along the left superior temporal sulcus for auditory categorization. Cerebral Cortex, 20(12), 2958–2970. Liberman, A. M., Harris, K. S., Hoffman, H. S., & Griffith, B. C. (1957). The discrimination of speech sounds within and across phoneme boundaries. Journal of experimental psychology, 54(5), 358–368. Luck, S. (2005). An introduction to the event-related potential technique. Cambridge, MA: MIT Press.

98

K. Yu et al.

Luo, H., Ni, J. T., Li, Z. H., Li, X. O., Zhang, D. R., Zeng, F. G., & Chen, L. (2006). Opposite patterns of hemisphere dominance for early auditory processing of lexical tones and consonants. Proceedings of the National Academy of Sciences, 103(51), 19558–19563. Meyer, M., Steinhauer, K., Kai, A., Friederici, A. D., & Cramon, D. Y. V. (2004). Brain activity varies with modulation of dynamic pitch variance in sentence melody. Brain and Language, 89(2), 277–289. Näätänen, R., & Alho, K. (1997). Mismatch negativity–The measure for central sound representation accuracy. Audiology and Neurotology, 2(5), 341–353. Näätänen, R., Paavilainen, P., Rinne, T., & Alho, K. (2007). The mismatch negativity (MMN) in basic research of central auditory processing: a review. Clinical Neurophysiology, 118(12), 2544–2590. Neisser, U. (1967). Cognitive psychology. New York: Appleton-Century-Crofts. Novak, G. P., Ritter, W., Vaughan, H. G., & Wiznitzer, M. L. (1990). Differentiation of negative event-related potentials in an auditory discrimination task. Electroencephalography & Clinical Neurophysiology, 75(4), 255–275. Polich, J. (2007). Updating P300: An integrative theory of P3a and P3b. Clinical Neurophysiology, 118(10), 2128–2148. Qi, Z., Han, M., Garel, K., Chen, E. S., & Gabrieli, J. D. (2015). White-matter structure in the right hemisphere predicts Mandarin Chinese learning success. Journal of Neurolinguistics, 33, 14–28. Ren, G. Q., Yang, Y., & Li, X. (2009). Early cortical processing of linguistic pitch patterns as revealed by the mismatch negativity. Neuroscience, 162(1), 87–95. Schlegel, A. A., Rudelson, J. J., & Peter, U. T. (2012). White matter structure changes as adults learn a second language. Journal of Cognitive Neuroscience, 24(8), 1664–1670. Shen, G., & Froud, K. (2015). Neurophysiological correlates of perceptual learning of Mandarin Chinese lexical tone categories: An event-related potential study. Journal of the Acoustical Society of America, 137(4), 2384–2384. Shuai, L., & Gong, T. (2014). Temporal relation between top-down and bottom-up processing in lexical tone perception. Frontiers in Behavioral Neuroscience, 79(8), 97–97. Sun, X., Li, L., Ding, G., Wang, R., & Li, P. (2019). Effects of language proficiency on cognitive control: Evidence from resting-state functional connectivity. Neuropsychologia, 129, 263–275. Van Lancker, D. (1980). Cerebral lateralization of pitch cues in the linguistic signal. Research on Language & Social interaction, 13(2), 201–277. Van Lancker, D., & Fromkin, V. A. (1973). Hemispheric specialization for pitch and “tone”: Evidence from Thai. Journal of Phonetics, 1, 101–109. Van Lancker, D., & Fromkin, V. A. (1978). Cerebral dominance for pitch contrasts in tone language speakers and in musically untrained and trained English speakers. Journal of Phonetics, 6, 19–23. Veroude, K., Norris, D. G., Shumskaya, E., Gullberg, M., & Indefrey, P. (2010). Functional connectivity between brain regions involved in learning words of a new language. Brain and language, 113(1), 21–27. Ventura-Campos, N., Sanjuán, A., González, J., Palomar-García, M. Á., Rodríguez-Pujadas, A., Sebastián-Gallés, N., et al. (2013). Spontaneous brain activity predicts learning ability of foreign sounds. Journal of Neuroscience, 33(22), 9295–9305. Wang, X. D., Wang, M., & Chen, L. (2013). Hemispheric lateralization for early auditory processing of lexical tones: Dependence on pitch level and pitch contour. Neuropsychologia, 51(11), 2238– 2244. Wang, Y., Spence, M. M., Jongman, A., & Sereno, J. A. (1999). Training American listeners to perceive Mandarin tones. Journal of the Acoustical Society of America, 106(6), 3649–3658. Wang, Y., Jongman, A., & Sereno, J. A. (2001). Dichotic perception of Mandarin tones by Chinese and American listeners. Brain and Language, 78(3), 332–348. Wang, Y., Sereno, J. A., Jongman, A., & Hirsch, J. (2003). fMRI evidence for cortical modification during learning of Mandarin lexical tone. Journal of Cognitive Neuroscience, 15(7), 1019–1027. Wong, P. C. M. (2002). Hemispheric specialization of linguistic pitch patterns. Brain Research Bulletin, 59(2), 83–95.

5 Native and Nonnative Processing of Acoustic and Phonological …

99

Wong, P. C. M., Parsons, L. M., Martinez, M., & Diehl, R. L. (2004). The role of the insular cortex in pitch pattern perception: The effect of linguistic contexts. Journal of Neuroscience, 24(41), 9153–9160. Wong, P. C. M., Perrachione, T. K., & Parrish, T. B. (2007). Neural characteristics of successful and less successful speech and word learning in adults. Human Brain Mapping, 28(10), 995–1006. Wong, P. C. M., Warrier, C. M., Penhune, V. B., Roy, A. K., Sadehh, A., Parrish, T. B., et al. (2008). Volume of left Heschl’s gyrus and linguistic pitch learning. Cerebral Cortex, 18(4), 828–836. Xi, J., Zhang, L., Shu, H., Zhang, Y., & Li, P. (2010). Categorical perception of lexical tones in Chinese revealed by mismatch negativity. Neuroscience, 170(1), 223–231. Xu, Y. (1997). Contextual tonal variations in Mandarin. Journal of Phonetics, 25(1), 61–83. Xu, Y., Gandour, J. T., & Francis, A. L. (2006). Effects of language experience and stimulus complexity on the categorical perception of pitch direction. Journal of the Acoustical Society of America, 120(2), 1063–1074. Xu, Y., Gandour, J. T., Talavage, T., Wong, D., Dzemidzic, M., Tong, Y., et al. (2006). Activation of the left planum temporale in pitch processing is shaped by language experience. Human Brain Mapping, 27(2), 173–183. Yu, K., Wang, R., Li, L., & Ping, L. (2014). Processing of acoustic and phonological information of lexical tones in Mandarin Chinese revealed by mismatch negativity. Frontiers in Human Neuroscience, 8(3), 729. Yu, K., Zhou, Y., Li, L., Su, J., Wang, R., & Li, P. (2017). The interaction between phonological information and pitch type at pre-attentive stage: An ERP study of lexical tones. Language, Cognition and Neuroscience, 32(9), 1164–1175. Yang, J., Gates, K. M., Molenaar, P., & Li, P. (2015). Neural changes underlying successful second language word learning: An fMRI study. Journal of Neurolinguistics, 33, 29–49. Zatorre, R. J., & Belin, P. (2001). Spectral and temporal processing in human auditory cortex. Cerebral Cortex, 11(10), 946–953. Zatorre, R. J., Belin, P., & Penhune, V. B. (2002). Structure and function of auditory cortex: Music and speech. Trends in Cognitive Sciences, 6(1), 37–46. Zatorre, R. J., & Gandour, J. T. (2008). Neural specializations for speech and pitch: Moving beyond the dichotomies. Philosophical Transactions of the Royal Society B: Biological Sciences, 363(1493), 1087–1104. Zhang, L., Shu, H., Zhou, F., Wang, X., & Li, P. (2010). Common and distinct neural substrates for the perception of speech rhythm and intonation. Human Brain Mapping, 31(7), 1106–1116. Zhang, L., Xi, J., Xu, G., Shu, H., Wang, X., & Li, P. (2011). Cortical dynamics of acoustic and phonological processing in speech perception. PLoS ONE, 6(6), e20963. Zhang, L., Xi, J., Wu, H., Shu, H., & Li, P. (2012). Electrophysiological evidence of categorical perception of Chinese lexical tones in attentive condition. NeuroReport, 23(1), 35–39. Zhang, L., Li, Y., Wu, H., Li, X., Shu, H., Zhang, Y., & Li, P. (2016). Effects of semantic context and fundamental frequency contours on Mandarin speech recognition by second language learners. Frontiers in Psychology, 7, 908. Zhang, Y. (2016). Categorical Perception. In R. Sybesma., W. Behr., Y. Gu., Z. Handel., J. Huang., J. Myers. (Eds.), Encyclopedia of Chinese Language and Linguistics. Leiden, Netherlands: Brill.

Chapter 6

Neurophysiological Studies of Mandarin Lexical Tone Acquisition in Early Childhood Chia-Ying Lee and Ying-Ying Cheng

Abstract Mismatch negativity (MMN) is an event-related potential (ERP) component used as an index for automatic auditory change detection. MMN can be elicited even when the participant does not pay attention to the stimuli (e.g., while they are reading a book or watching a silent movie). Thus, MMN serves as an excellent tool for assessing auditory discrimination, especially in infants and children with limited attention or motivation. Although MMN is well established in adults, the polarity and latency of mismatch responses (MMRs) in infants are highly inconsistent across studies. This chapter aims to provide a comprehensive review of a series of MMN studies for Mandarin lexical tone and to understand the effects of age and degree of deviance on MMRs in infancy and early childhood. The findings here suggest that MMN and positive MMR index different functional characteristics and may provide information on when and how speech perception becomes automatic at different developmental stages in children. The transition from positive to negative MMRs may serve as a neural marker for the early identification of atypical language development in children.

6.1 Background The ability to produce and understand language is a distinctive characteristic that separates humans from other species. There are around 5000–7000 spoken languages in the world. Each language uses a specific set of phonemes (such as vowels and consonants) to form syllables. It is generally agreed that infants are born with the capacity to learn any language in the world. However, exposure to an ambient language, starting in the womb, may alter phonetic perception shortly after birth. For C.-Y. Lee (B) · Y.-Y. Cheng The Institute of Linguistics, Academia Sinica, Taipei, Taiwan e-mail: [email protected] C.-Y. Lee Institute of Cognitive Neuroscience, National Central University, Jhongli City, Taoyuan County, Taiwan © Springer Nature Singapore Pte Ltd. 2020 H.-M. Liu et al. (eds.), Speech Perception, Production and Acquisition, Chinese Language Learning Sciences, https://doi.org/10.1007/978-981-15-7606-5_6

101

102

C.-Y. Lee and Y.-Y. Cheng

example, Moon, Lagercrantz, and Kuhl (2013) found that infants at birth responded to native and non-native vowels differently and suggested that neonates were capable of learning phonetic contrasts in the womb. A large body of evidence has shown that infants between 6 and 12 months of age show improved perceptual sensitivity to native phonetic contrasts and reduced sensitivity relative to non-native ones (Cheour et al., 1998; Kuhl, Williams, Lacerda, Stevens, & Lindblom, 1992; Polka & Werker, 1994; Rivera-Gaxiola, Silva-Pereyra, & Kuhl, 2005; Werker & Tees, 1984). These findings suggest that, during the first year of life, early experiences with language shape the perception and production of speech sounds, allowing infants to become language-bound listeners. Moreover, speech perception ability measured in early infancy may predict later language development (Guttorm, Leppänen, Hamalainen, Eklund, & Lyytinen, 2010; Guttorm et al., 2005; Molfese & Molfese, 1985, 1997). Thus, a better understanding of the milestones of language acquisition may shed some light on early identification of and intervention for language and reading problems in children. Recent advances in the use of noninvasive brain techniques in cognitive neuroscience, including electroencephalography (EEG)/event-related potentials (ERPs), magnetoencephalography (MEG), and functional magnetic resonance imaging (fMRI), have allowed researchers to examine language processing in an infant’s brain to investigate language acquisition in the initial developmental stage of life. This chapter aims to review a series of electrophysiological studies to investigate developmental changes in discriminating the lexical tone, an essential suprasegmental feature for syllables in tonal languages, from early infancy to childhood. The ultimate goal of this chapter is to provide a better understanding of developmental changes in speech perception in early childhood and the relationship between these changes and later language and reading acquisition.

6.2 Mismatch Negativity (MMN) and Positive Mismatch Responses (P-MMR) Näätänen et al. (1997) demonstrated that speech-sound representations can be probed using a cortical response called mismatch negativity (MMN), which is an ERP component that reflects automatic auditory change detection. MMN is typically obtained in a passive auditory oddball paradigm, wherein a deviant in certain aspects of sound features occurs infrequently in a sequence of repetitive homogeneous (standard) stimuli (Näätänen, Kujala, & Winkler, 2011; Näätänen, Paavilainen, Rinne, & Alho, 2007). In adults, MMN is usually observed as a frontal distributed negativity peaking between 100 and 250 ms after stimulus onset by subtracting ERPs for the standard stimuli from those for the deviant. MMN is mainly generated in the auditory cortex for any discriminable change in simple acoustic and complex phonetic stimulus features (Alho, 1995; Hsu, Lin, Hsu, & Lee, 2014). MMN amplitude increases, whereas peak latency decreases, with the increasing magnitude of stimulus deviation

6 Neurophysiological Studies of Mandarin Lexical Tone Acquisition …

103

(Näätänen et al., 2007; Sams, Paavilainen, Alho, & Naatanen, 1985). The accuracy of behavioral sound discrimination has been demonstrated to be strongly correlated with the MMN amplitude in normal and clinical populations (Kraus et al., 1993; Lang et al., 1995). Most importantly, the MMN can be elicited even when a participant is not paying attention to stimuli (e.g., when reading a book or watching a silent movie). Thus, MMN has been suggested as an index for auditory change detection in the pre-attentive stage and may serve as an excellent tool for assessing auditory discrimination, especially for infants and children with limited attention or motivation. By using the MMN paradigm, Cheour and colleagues 1998 reported that Finnish 12-month-old infants revealed an enhanced MMN response to their native vowel contrast, compared with their MMN response to the non-native Estonian vowel contrast, even when the non-native contrast had a more distinct acoustic difference (Cheour et al., 1998). This finding is congruent with behavioral evidence and suggests that language-specific speech-sound memory traces develop between 6 and 12 months of age. It is worth noting that, although studies have reported MMN in infancy in response to pitch changes (Cheour et al., 2002), duration changes (Brannon, Libertus, Meck, & Woldorff, 2008; Brannon, Roussel, Meck, & Woldorff, 2004), and phonetic changes (Cheour-Luhtanen et al., 1995; Cheour et al., 1998; Kushnerenko et al., 2001; Martynova, Kirjavainen, & Cheour, 2003), infant MMN usually persists for a longer interval in a relatively late time window in comparison with the typical MMN in adults. Meanwhile, many studies have reported the positive mismatch response (P-MMR) between 200 and 400 ms, instead of MMN, in response to speech and non-speech changes in infants (Dehaene-Lambertz & Baillet, 1998; Dehaene-Lambertz & Dehaene, 1994; Friederici, Friedrich, & Weber, 2002; Jing & Benasich, 2006; Leppänen, Eklund, & Lyytinen, (1997); Morr, Shafer, Kreuzer, & Kurtzberg, 2002; Novitski, Huotilainen, Tervaniemi, Näätänen, & Fellman, 2007). For example, Leppänen et al. (1997) observed P-MMR peak between 250 and 350 ms in newborns in response to the pure tone change. Dehaene-Lambertz and Dehaene (1994) reported that 3-month-old infants showed P-MMR peaking at approximately 390 ms in response to initial consonant changes (/ba/vs. /ga/). Friederici et al. (2002) examined the mismatch response to syllables varying in vowel durations (short/ba/ vs. long/ba:/) in 2-month-old infants and found P-MMR peaking at approximately 400 ms, especially when considering long syllables as the deviance. The nature of P-MMR remains unclear. Some researchers have associated the presence of P-MMR to maturational factors, since P-MMRs were mainly found in infancy. Leppanen et al. (2004) reported that newborns with more mature traits of cardiac measures tend to exhibit larger P-MMR peaking at the latency of 150–375 ms. He, Hotson, and Trainor (2007) examined the brain responses in 2- to 4-month-old infants to infrequent pitch changes of piano tones. Their data showed an increase in the left-lateralized positive slow wave at 2 and 3 months of age, whereas, a faster, adult-like MMN was only presented in 3- to 4-month-old infants. Trainor et al. (2003) reported that the mismatch response transformed from positive to negative, between 2 and 6 months of age. Kushnerenko, Ceponiene, Balan, Fellman and Näätänen (2002) longitudinally traced the development of pitch change detection in a group

104

C.-Y. Lee and Y.-Y. Cheng

of infants from birth until 12 months of age. Their data showed that the adult-like MMN stabilized between 3 and 6 months of age, although substantial variability in MMN within the same infant across ages was observed. These findings imply that the adult-like MMN becomes prominent, whereas the P-MMR diminishes with advancing age and neural maturation, which may play a role in the developmental changes of mismatch responses to auditory change detection. Other studies have suggested that the presence of P-MMR may reflect a difficulty in auditory discrimination, without the maturation factor being the sole cause, since, P-MMR is not restricted to infancy, and additionally, has been reported in young children and adults. For example, Maurer and colleagues reported 6- to 7year-old children exhibiting P-MMR to frequency (1000 Hz vs. 1060 or 1030 Hz) and phoneme (’ba’ vs. ’ta’ or ’da’) deviances, which is substantially smaller, and presented in a shorter inter-stimulus interval (ISI) than those deviances used in most studies. Additionally, children at risk of dyslexia tend to exhibit a more positive PMMR than their age-matched controls (Maurer, Bucher, Brem, & Brandeis, 2003). Ahmmed, Clarke and Adams (2008) reported 7- to 11-year-old children showing P-MMR to 2% frequency deviance relative to a 1000 Hz standard, with relatively short ISI (400 ms). Children with specific language impairment required more than 10% deviance for generating MMN, whereas their age-matched controls, without any impairments, required only 2 to 5% deviance for generating comparable MMN amplitudes. Moreover, P-MMR may be observed in adults when the deviance is extremely difficult to discriminate. Kuo and colleagues (2014) examined the impact of spectral resolution on MMN by using naturally spoken Mandarin tones and cochlear implant (CI) simulations, with variations in the number of frequency channels as the stimuli. The one-channel CI-simulated Mandarin tone deviance elicited P-MMR in adults with normal hearing, whereas CI simulation with more than eight channels and speech-sound deviance elicited typical MMN (Kuo, Lee, Chen, Liu, & Cheng, 2014). The results indicated that stimulus-related factors such as short ISI, and smaller deviance, limiting the discriminability between standard and deviant frequency, determine the elicitation of P-MMR. Morr et al. (2002) reported the small deviance (1000 vs. 1200 Hz) elicited P-MMR in infants younger than 12-month-old and the adult-like MMN could not be found until 4 years old. Conversely, the large deviance (1000 vs. 2000 Hz) elicited adult-like MMN in most of the participants from the youngest age group (2–7 months of age). The coexistence of P-MMR small deviance and MMN to large deviance at the same age supports that the magnitude of deviance is one of the decisive factors of MMR polarity. Furthermore, children with specific language impairment or hereditary risk of dyslexia tend to elicit P-MMR or smaller MMN, which suggests that the polarity of MMRs, additionally, depends on the linguistic characteristics of individuals (Ahmmed et al., 2008; Datta, Shafer, Morr, Kurtzberg, & Schwartz, 2010). Garcia-Sierra and colleagues quantified the number of words encountered daily by infants to investigate how language input modulates speech perception using MMR as the readout. Their results indicate that 11- to 14-month-old infants exposed to lower language input elicited P-MMR to relatively difficult English /ta/ vs. /pa/ contrast, whereas those exposed to higher language input elicited MMN (Garcia-Sierra, Ramirez-Esparza, & Kuhl, 2016). In summary,

6 Neurophysiological Studies of Mandarin Lexical Tone Acquisition …

105

P-MMR is mainly found in young infants or children with disadvantaged language background, especially, for discriminating smaller deviances. These factors should be taken together into consideration when investigating the development trajectories of MMR/P-MMR patterns for Mandarin lexical tone changes.

6.3 MMN Studies of Mandarin Lexical Tone Acquisition Mandarin Chinese is a tonal language that utilizes pitch variations at the syllable level for determining lexical meaning. For example, by applying one of four tones confers distinct meanings to the syllable yi, e.g., yi1 ‘clothes’; yi2 ‘aunt’; yi3 ‘chair’; yi4 ‘easy’. The four lexical tones are categorized phonologically as a high-level tone (T1), a high-rising tone (T2), a low-dipping tone (T3), and a high-falling tone (T4). A number of studies have suggested that pitch contour and pitch height are crucial for characterizing Mandarin tones (Gandour, 1983; Gandour & Harshman, 1978; Jokisch & Jensen, 2007; Klimesch, Sauseng, & Hanslmayr, 2006; Lin et al., 2007). In terms of pitch contour and direction, T2 and T3 are more acoustically similar than T1 and T3. Behavioral studies with tonal discrimination and identification tasks have confirmed that T2 and T3 are more often confused with each other compared with other tonal pairs (Gandour, 1983; Gandour & Harshman, 1978). Previous ERP studies have used the MMN paradigm to investigate brain responses to lexical tone changes. Luo et al. (2006) first examined MMN responses of the initial consonants (bai1, sai1, dai1, tai1) and the lexical tone changes (bai1, bai2, bai3, bai4) in native Mandarin speakers. An opposite pattern of hemisphere lateralization was found for the MMN elicited in response to the lexical tone and the consonant contrasts. Regardless of the magnitude of deviance, tonal changes elicited MMN of a larger magnitude and stronger dipole strength in the right hemisphere, whereas the segmental changes elicited greater responses in the left hemisphere. Moreover, ERP studies have indicated that the magnitude of deviance in lexical tone changes affects the latency and amplitude of MMNs (Chandrasekaran, Krishnan, & Gandour, 2007; Cheng et al., 2013; Lee et al., 2012; Tsang, Jia, Huang, & Chen, 2011). For example, the acoustically distinct T1/T3 contrast yielded a larger MMN with an earlier peak latency than the acoustically similar T2/T3 contrast. However, the effect of acoustic similarity on MMN was exclusively found in native Chinese speakers, but not in native English speakers (Chandrasekaran et al., 2007). Hsu et al. (2014) performed MEG to investigate the neural substrates underlying the MMN elicited by Mandarin lexical tone changes. Infrequent deviants, T1 and T2, were embedded in a chain of a frequent standard, T3, to induce large and small lexical tone changes, respectively. Consistent with ERP studies, the magnetic mismatch response (MMNm) to lexical tone changes was sensitive to the size of deviance. To be more specifically, the acoustically distinct T1/T3 contrast elicited an earlier and larger MMNm than did the acoustically similar T2/T3 contrast. This confirmed that the T1/T3 contrast was easier to discriminate and, therefore, revealed a much more pronounced MMNm response than did the T2/T3 contrast. However, such an

106

C.-Y. Lee and Y.-Y. Cheng

effect of the size of deviance was found in the left hemisphere, but not the right hemisphere, suggesting a left hemispheric dominance of the MMNm on Mandarin lexical tone changes (but see Yu, Wan, & Li, Ch. 4 this volume). The left lateralization of the lexical tone MMNm was further supported by the distributed source analysis of the MMNm generator, particularly for the large deviant (T1/T3 contrast). It was concluded that a native Mandarin speaker’s MMNm response to lexical tone changes was initially generated in the superior temporal gyrus (STG) in both hemispheres. A greater left lateralization in the STG and middle temporal gyrus was found in hearing large deviance, indicating a left hemisphere dominance for detecting large lexical tone changes. This study also revealed that the laterality decreased as the differences between the standard and deviant sounds became less discriminable. In addition, the neural generators of a lexical tone MMNm could be seen in several frontal regions. For example, activities in the left anterior insula and right anterior cingulate cortex were only involved in MMNm responses to the T1/T3 contrast, but not in those to the T1/T3 contrast; these findings suggested that these two regions are involved in the switching of attention to the salient changes. In contrast, right ventral orbital frontal cortex activation was only found in the T2/T3 contrast, but not the T1/T3 contrast, and has been associated with involuntary amplification or functional inhibition mechanisms. Most studies on the speech perception of infants focused on the phonological development of consonants and vowels. There is scarce information on when and how infants learn lexical tones. Only a few studies have addressed this issue by using the speech discrimination paradigm to investigate the lexical tone perception in infants with different language backgrounds (tonal versus non-tonal language exposure) (see also Tsao and Liu, Chap. 10. this volume). Although some mixed findings have been reported, it is generally agreed that the development of lexical tone perception in infants undergoes a progression similar to that charted for consonants and vowels. For example, Mattock and Burnham (2006) tested Thai lexical tone discrimination in Chinese- and English-learning infants at 6 and 9 months of age by the conditioned head-turn procedure. They found that Chinese infants remain sensitive to Thai tonal contrasts at both time points, while English infants showed declined discrimination of lexical tones at 9 months of age. Their findings suggested that tonal language-learning infants displayed a perceptual narrowing for lexical tones within the first year of life, even for tones not from their native language Yeung, Chen, and Werker (2013) examined English-, Cantonese-, and Mandarinlearning infants’ discriminability of Cantonese tones. All three groups exhibited a distinct preference at 4 months of age. Consistent with the results reported by Mattock and Burnham (2006), English-learning infants were not able to discriminate the same tone contrast at 9 months of age. However, Mandarin- and Cantonese-learning infants displayed language-specific differences in tone preferences at both age points, suggesting that the language-selective perception of lexical tones with tonal language learners emerged from at least 4 months of age. Notably, acoustic distinctiveness plays an important role in the acquisition of lexical tones. Studies on the development of speech production have suggested that T1 and T4 production are mastered earlier than T2 and T3 (Hua & Dodd, 2000; Li

6 Neurophysiological Studies of Mandarin Lexical Tone Acquisition …

107

& Tompson, 1977; Lin, Huang, Huang, & Hsuan, 2008). Three-year-old Mandarinspeaking children easily confuse T3 with T2 in the picture-pointing task (Wong, Schwartz, & Jenkins, 2005). By using the head-turn procedure, Tsao (2008) reported that 12-month-old infants discriminated the T1/T3 contrast more accurately than the T2/T3 and T2/T4 contrasts. These findings suggested that acoustic discriminability plays an important role in the development of lexical tone sensitivity. In addition, Tsao (2017) explored the developmental trend of discriminating T1/T3, T2/T3, and T2/T4 contrasts in Mandarin-learning infants at 6–8 and 10–12 months of age. The data revealed that Mandarin-learning infants could discriminate all three contrasts at both age points. However, they showed improved sensitivity in discriminating the T1/T3, the most salient tone contrast, at 10–12 months of age, whereas no such improvement was found for the less salient contrasts (T2/T3). Altogether, tone language-learning infants developed more accurate representations of the lexical tone around 12 months of age. Furthermore, the language background of infants and acoustic salience of speech units both play critical roles in modulating the time courses of lexical tone acquisition. Only a few studies have addressed the development of lexical tone sensitivity with ERPs. Lee et al. applied the multi-deviant oddball paradigm with T3 as the standard and T2 and T1 as small and large deviants, respectively, to explore how neural maturation and acoustic saliency modulate MMRs in adulthood, infancy, and childhood (Cheng et al., 2013, 2015; Hsu, Lee, & Liang, 2016; Lee et al., 2012). Cheng et al. (2013) applied the same set of stimuli to further explore MMRs to lexical tone changes in newborns and 6-month-old infants. ERP data from newborns were collected within 13 days of birth, while they were asleep. For newborns, the large deviant (T1/T3 contrast) elicited P-MMRs in 300–500 ms on the left frontal site (F3), while the small deviants (T2/T3 contrast) did not result in any significant MMR. Sixmonth-old infants were subgrouped according to their status during ERP recording; ten infants were awake and 13 were sleeping. The data from awake 6-month-old infants showed significant MMN to T1/T3 in 150–250 ms and P-MMR to T2/T3 in 300–450 ms. With respect to sleeping 6-month-old infants, no significant MMR was found. The data, obtained using electrophysiological recording techniques that do not require infants’ overt responses, suggested that the acoustic saliency effect could be evident as early as at birth. In addition, MMRs to a large deviant lexical tone contrast transited from P-MMR at birth to an adult-like MMN at 6 months of age. These findings suggested that 6-month-old infants were able to automatically discriminate Tone 1 and Tone 3, and the polarity transition of MMRs may be used to reflect the maturation of speech perception. Cheng’s follow-up study examined the development of MMRs to Mandarin lexical tones from 12 to 24 months of age (Cheng & Lee, 2018). As the adult-like MMN to T1/T3 contrast has been found at 6 months of age, the sustained MMNs presented at 12, 18 and 24 months were naturally expected. The most critical observation is when the T2/T3 contrast elicits the adult-like MMN. The data revealed that T2/T3 contrast elicited P-MMRs at 12 and 18 months of age but showed no significant MMR at 24 months of age. In fact, by using the same set of stimuli, the large deviant T1/T3 contrast elicited MMN at 4, 5, and 6 years of age, but the small deviant T2/T3 contrast

108

C.-Y. Lee and Y.-Y. Cheng

only elicited P-MMR in the 5- and 6-year-old groups. Yang et al. (2015) also applied the same paradigm to examine the MMR to lexical tone changes in children with or without attention-deficit/hyperactivity disorder (ADHD). Both children with ADHD (mean age 9.15 years) and their age-matched group (mean age 10.82 years) showed typical MMN for the large deviant (T1/T3). However, for the small deviant (T2/T3), the control group showed P-MMR between 200 and 350 ms, while the ADHD group revealed no MMR but late discriminative negativity (LDN). Liu, Chen, and Tsao (2014) examined the developmental changes in MMRs to the synthesized lexical tone pair /i2/ and /i3/ in adults, preschoolers (mean age 3.40 years), and school-aged children (mean age 8.57 years). Although the data from adults showed the typical MMN at 185–335 ms, the two groups of children did not show MMR but showed LDN in the later time window. The stimuli used by Liu et al. (2014) were comparable to the small deviant used by Cheng et al. (2013) and Lee et al. (2012). Taken together, the findings indicate that the acoustically salient contrast T1/T3 elicited typical MMN in infants at as early as 6 months of age, while the acoustically similar contrast T2/T3 elicited P-MMR from birth to 18 months of age and then elicited no MMR from 2 to 4 years of age. For children aged 5–10 years, whether T2/T3 could elicit significant MMR remains controversial across studies. The absence of MMR at certain ages has been reported in other studies. For example, Morr et al. (2002) reported that the large frequency change (1000 vs. 2000 Hz) elicited MMN in infants aged 3–47 months. However, the smaller frequency change (1000 vs. 1200 Hz) elicited P-MMR in groups aged under 12 months, but no significant MMR could be found between 13 and 47 months. To account for the presence and diminution of P-MMRs in development, Shafer, Yu, and Datta (2010) claimed that the P-MMR indexed detection and encoding of the acoustic properties of a stimulus in afferent (input) connections to the primary auditory cortex. Thus, the P-MMR may reflect a greater recovery of P1 from refractoriness of the neural populations firing to the deviant compared with the standard (Kushnerenko et al., 2007; Shafer et al., 2010). He et al. (2007) showed that, although these two types of mismatch responses could coexist in three-month-old infants, they could be separated using different settings of band filters. Their data demonstrated that the adult-like MMN became apparent while the slow positivity diminished as infants gradually mature from two to four months of age (He, Hotson, & Trainor, 2007). Therefore, the positivity could potentially mask the negativity owing to overlapping latencies. Studies have shown that the P-MMR typically decreased in amplitudes with increasing age and was generally absent by 8 years of age (Datta et al., 2010; Shafer et al., 2010), except for very fine discriminations (Ahmmed et al., 2008). The presence of P-MMR in infants might not be due to a larger P-MMR but due to the absence of MMN overlapped with the P-MMR. This also suggests that, at some point in time, the two types of responses might cancel each other and show a null effect for a specific contrast at a specific time point during maturation.

6 Neurophysiological Studies of Mandarin Lexical Tone Acquisition …

109

6.4 MMRs to Lexical Tone Changes in Children with Difficulties in Learning to Read Reading has been one of the most remarkable skills for human beings for acquiring and exchanging information in daily life. Unfortunately, across languages, approximately 2–10% of the children experience difficulties in learning to read, a condition called developmental dyslexia, despite normal intelligence and good educational opportunities. Empirical evidence suggests that phonological recoding is the most critical component in learning to read, especially in the early phases of reading acquisition (Share, 1995; Sprenger-Charolles, Siegel, Bechennec, & Serniclaes, 2003). It is widely believed that impaired phonological processing is the key deficit in developmental dyslexia (Snowling, 2000). Indeed, phonetic perception in infancy has been associated with later language development (Kuhl, Conboy, Padden, Nelson, & Pruitt, 2005; Tsao, Liu, & Kuhl, 2004), which suggests that perceptual learning provides a foundation for later and more abstract language learning (Werker & Yeung, 2005). Kraus et al. (1996) reported a nearly absent mismatch response to deviant consonant–vowel syllables in learningimpaired children compared to normal controls. Moreover, this result was correlated with behavioral discrimination of rapid changing speech stimuli (Kraus et al., 1996). Schulte-Korne, Deimel, Bartling, & Remschmidt, 2001 observed that the MMN to speech sounds (/ba/vs. /da/) in boys with dyslexia was attenuated but not absent (Schulte-Korne et al. (2001)). Bonte et al. used the MMN paradigm to study the implicit processing of phonological regularities in children with and without reading difficulties (Bonte, Mitterer, Zellagui, Poelmans, & Blomert, 2005; Bonte, Poelmans, & Blomert, 2007). An enhanced MMN response to non-words with a high versus low phonotactic probability was found in children with normal reading abilities. However, children with dyslexia did not show this sensitivity to phonotactic probability. These findings imply that MMN might serve as a neural marker for early identification of children at risk of language delay and reading difficulty. Chinese is characterized as a morphosyllabic writing system. The basic Chinese written unit, namely the character, is constructed by a combination of stroke patterns and radicals within a constant square-shaped space. As Chinese orthography contains no representations at the phonemic level, some researchers believe that there is a closer connection between graphic forms and meanings in Chinese and that phonological knowledge may not be crucial in learning to read. Some studies have suggested that Chinese dyslexia may arise from deficits in visual-spatial analysis (Huang & Hanley, 1995; Siok, Spinks, Jin, & Tan, 2009); in contrast, other studies have demonstrated that both rapid automatized naming and phonological awareness performance predict Chinese children’s reading performance, even after controlling for participants’ IQ, parent’s education, and socioeconomic status (Ho, Chan, Chung, Lee, & Tsang, 2007; Ho, Leung, & Cheung, 2011; Hua Shu, McBride-Chang, Wu, & Liu, 2006). Older readers with dyslexia performed poorer in phonemic awareness than the typically developing young readers with matched reading abilities did (Goswami et al., 2010). Zhang et al. (2012) examined the categorical perception of lexical tone

110

C.-Y. Lee and Y.-Y. Cheng

in children with or without developmental dyslexia. Both typically developing and dyslexic groups showed MMN to the across- and within-category deviants. However, the categorical perception effect—that is, the enhanced MMN to the across-category deviants compared with that to the within-category deviants—was found in the typically developing group, but not in children with dyslexia. These data also indicate that children with dyslexia may have a general deficit in the categorical perception of lexical tones. Children with dyslexia and those with ADHD have some similar learning characteristics. As observed for children with ADHD, those with dyslexia may exhibit inattention or distractibility in the classroom because reading activities are demanding, thus resulting in fatigue and inability to sustain concentration for the entire class. Furthermore, children with ADHD may also have reading dysfluency which negatively affect reading comprehension and cause academic failure that similar to that in dyslexic children. Therefore, it may be difficult for parents and teachers to distinguish between ADHD and dyslexia, especially for those who have little experience with these two types of learning disorders. Yang et al. (2015) measured the MMRs to lexical and pure tone changes in children with or without ADHD. Specifically, different ERP could be used to index the auditory changes detection at different stages of attentional control, including the pre-attentive change detection indexed by MMR, the involuntary orientation of attention indexed by P3a, and the orientation of attention for further evaluation reflected by LDN. Unlike dyslexic children who show reduced MMN to speech lexical tone (Zhang et al., 2012), children with or without ADHD show no MMN differences. This finding suggests that children with ADHD have no problems in developing phonological representations for lexical tones to induce MMRs in the pre-attentive stage. However, children with ADHD did show attenuated P3a and enhanced LDN to both the pure tone and lexical tone changes than did their controls, which indicated their deficits in involuntary attention switching and voluntary attentional reorienting while processing auditory deviations. This speculation was further supported by a significant correlation analysis in which children with higher ADHD tendency, indexed by parents’ and teachers’ ratings of children’s ADHD symptoms, tended to show a greater attenuation in P3a amplitude. In addition, Chen, Tsao, and Liu (2016) conducted a longitudinal study to investigate the development of MMR to lexical tone changes (T2/T3) in late-talking children and children with typical language development (TLD) at 3, 5, and 6 years of age. The late-talking children were subdivided into persistent language delay (PLD) and late bloomer (LB) groups based on their language performance at 4 years old. The group difference was mainly found at 3 years old, in which the typically developing children showed no typical MMN in the early time window (185–335 ms), while both late-talking groups (PLD and BL) showed P-MMR. Congruent with the study of Liu et al. (2014), LDN, but no adult-like MMN was observed in TLD children in the later time window at all ages indicating that children at these ages could not automatically discriminate between T2 and T3 at the pre-attentive stage. For the PLD group, the P-MMR was present at 3 years old, and no MMR was observed at later ages. Given that P-MMR has been used to reflect speakers’ immature or poor phonetic representations, the data suggested that the development of fine-grained lexical tone

6 Neurophysiological Studies of Mandarin Lexical Tone Acquisition …

111

representations were delayed in children with PLD between 3 and 5 years old. Critically, the MMR measured in children aged 3 years correlated with the language outcome in those aged at 6 years, thus suggesting that brain responses of lexical tone discrimination may predict children’s later language performance. Taken together, these findings implied that these ERP components, MMR, P3a, and LDN may serve as neural markers for early identification of children at risk of language and reading development impairment and for the differential diagnosis for ADHD and dyslexia.

6.5 Conclusion This chapter provides an overview of a series of studies that applied the multipledeviant paradigm to investigate the developmental trajectories of MMR to Mandarin lexical tones in adults and in developing populations including infants, toddlers, preschoolers, and school-age children with or without learning disabilities. The data from adults demonstrated that the amplitude and latency of MMN could be modulated by the size of the deviances. Specifically, acoustically distinct T1/T3 contrast yielded a larger MMN with an earlier peak latency than did the acoustically similar T2/T3 contrast (Chandrasekaran et al., 2007; Cheng et al., 2013; Hsu et al., 2016). Moreover, the neural substrates of MMNs to lexical tone changes were mainly generated from the left hemisphere (Hsu et al., 2014). The data measured in early infancy showed that the T1/T3 contrast elicited P-MMR in newborns. The adult-like MMN first presented in infants at 6 months of age and was sustained in infants at 12, 18, and 24 months of age and in preschoolers from 4 to 6 years of age. Small deviant T2/T3 did not elicit MMR in newborns. P-MMR to T2/T3 was found in infants at 6, 12, and 10 months, but not in those aged 24 months. These studies observed the coexistence of MMN and P-MMR in the same age group when children responded to different magnitude of deviances. The findings of the transition from a predominantly positive to a predominantly negative response supported the existence of multiple MMN mechanisms. The adult-like MMN became more prominent, whereas the P-MMR diminished as age increased. Given that P-MMR was more likely to be observed in infants at younger age, especially in response to less discriminable changes, literatures have suggested that the presence of P-MMR reflects an immature brain response to changes. Conversely, a more mature brain response was reflected by MMN, which has been used to index the automatic auditory change detection. Our data suggested that the MMR (positive to negative) might provide information on whether speech perception by children is an automatic process at various developmental stages. Furthermore, a few studies have reported the absence of MMR to lexical tone changes in children with language delay or reading difficulties. These findings are compatible with the behavioral evidence that for Mandarin Chinese, the awareness of lexical tone showed a relatively stronger association with Chinese reading, than did the awareness of initial consonants (McBride-Chang et al., 2008; Shu, Peng, & McBride-Chang, 2008), suggesting that suprasegmental perception may be particularly important in exploring Chinese reading development. Further studies to examine the development

112

C.-Y. Lee and Y.-Y. Cheng

of MMRs to different Mandarin syllabic features in typically and atypically developing children will be critical for the early identification of children with language impairments.

References Ahmmed, A. U., Clarke, E. M., & Adams, C. (2008). Mismatch negativity and frequency representational width in children with specific language impairment. Developmental Medicine and Child Neurology, 50(12), 938–944. https://doi.org/10.1111/j.1469-8749.2008.03093.x. Alho, K. (1995). Cerebral generators of mismatch negativity (MMN) and its magnetic counterpart (MMNm) elicited by sound changes. Ear and Hearing, 16(1), 38–51. Bonte, M. L., Mitterer, H., Zellagui, N., Poelmans, H., & Blomert, L. (2005). Auditory cortical tuning to statistical regularities in phonology. Clinical Neurophysiology, 116(12), 2765–2774. https://doi.org/10.1016/j.clinph.2005.08.012. Bonte, M. L., Poelmans, H., & Blomert, L. (2007). Deviant neurophysiological responses to phonological regularities in speech in dyslexic children. Neuropsychologia, 45(7), 1427–1437. https:// doi.org/10.1016/j.neuropsychologia.2006.11.009. Brannon, E. M., Libertus, M. E., Meck, W. H., & Woldorff, M. G. (2008). Electrophysiological measures of time processing in infant and adult brains: Weber’s Law holds. Journal of Cognitive Neuroscience, 20(2), 193–203. Brannon, E. M., Roussel, L. W., Meck, W. H., & Woldorff, M. (2004). Timing in the baby brain. Cognitive Brain Research, 21(2), 227–233. Chandrasekaran, B., Krishnan, A., & Gandour, J. T. (2007). Mismatch negativity to pitch contours is influenced by language experience. Brain Research, 1128(1), 148–156. https://doi.org/10.1016/ j.brainres.2006.10.064. Chen, Y., Tsao, F. M., & Liu, H. M. (2016). Developmental changes in brain response to speech perception in late-talking children: A longitudinal MMR study. Developmental Cognitive Neuroscience, 19, 190–199. https://doi.org/10.1016/j.dcn.2016.03.007. Cheng, Y. Y., & Lee, C. Y. (2018). The development of mismatch responses to Mandarin lexical tone in 12- to 24-month-old infants. Frontiers in Psychology, 9, 448. https://doi.org/10.3389/fpsyg. 2018.00448. Cheng, Y. Y., Wu, H. C., Tzeng, Y. L., Yang, M. T., Zhao, L. L., & Lee, C. Y. (2013). The development of mismatch responses to Mandarin lexical tones in early infancy. Developmental Neuropsychology, 38(5), 281–300. https://doi.org/10.1080/87565641.2013.799672. Cheng, Y. Y., Wu, H. C., Tzeng, Y. L., Yang, M. T., Zhao, L. L., & Lee, C. Y. (2015). Feature-specific transition from positive mismatch response to mismatch negativity in early infancy: Mismatch responses to vowels and initial consonants. International Journal of Psychophysiology, 96(2), 84–94. https://doi.org/10.1016/j.ijpsycho.2015.03.007. Cheour-Luhtanen, M., Alho, K., Kujala, T., Sainio, K., Reinikainen, K., Renlund, M., …, Näätänen, R. (1995). Mismatch negativity indicates vowel discrimination in newborns. Hearing Research, 82(1), 53–58. Cheour, M., Ceponiene, R., Lehtokoski, A., Luuk, A., Allik, J., Alho, K., & Naatanen, R. (1998). Development of language-specific phoneme representations in the infant brain. Nature Neuroscience, 1(5), 351–353. Cheour, M., Ceponiene, R., Leppanen, P., Alho, K., Kujala, T., Renlund, M., …, Naatanen, R. (2002). The auditory sensory memory trace decays rapidly in newborns. Scandinavian Journal of Psychology, 43(1), 33–39. Datta, H., Shafer, V. L., Morr, M. L., Kurtzberg, D., & Schwartz, R. G. (2010). Electrophysiological indices of discrimination of long-duration, phonetically similar vowels in children with typical

6 Neurophysiological Studies of Mandarin Lexical Tone Acquisition …

113

and atypical language development. Journal of Speech Language and Hearing Research, 53(3), 757–777. https://doi.org/10.1044/1092-4388(2009/08-0123). Dehaene-Lambertz, G., & Baillet, S. (1998). A phonological representation in the infant brain. NeuroReport, 9(8), 1885–1888. Dehaene-Lambertz, G., & Dehaene, S. (1994). Speed and cerebral correlates of syllable discrimination in infants. Nature, 370(6487), 292–295. Friederici, A. D., Friedrich, M., & Weber, C. (2002). Neural manifestation of cognitive and precognitive mismatch detection in early infancy. NeuroReport, 13(10), 1251–1254. Gandour, J. T. (1983). Tone perception in far eastern-languages. Journal of Phonetics, 11(2), 149– 175. Gandour, J. T., & Harshman, R. A. (1978). Crosslanguage differences in tone perception: A multidimensional scaling investigation. Language and Speech, 21(1), 1–33. https://doi.org/10.1177/ 002383097802100101. Garcia-Sierra, A., Ramirez-Esparza, N., & Kuhl, P. K. (2016). Relationships between quantity of language input and brain responses in bilingual and monolingual infants. International Journal of Psychophysiology, 110, 1–17. https://doi.org/10.1016/j.ijpsycho.2016.10.004. Goswami, U., Wang, S. H. L., Cruz, A., Fosker, T., Mead, N., & Huss, M. (2010). Languageuniversal sensory deficits in developmental dyslexia: English, Spanish, and Chinese. Journal of Cognitive Neuroscience., 23(2), 325–337. https://doi.org/10.1162/jocn.2010.21453. Guttorm, T. K., Leppänen, P. H., Hamalainen, J. A., Eklund, K. M., & Lyytinen, H. J. (2010). Newborn event-related potentials predict poorer pre-reading skills in children at risk for dyslexia. Journal of Learning Disability, 43(5), 391–401. Guttorm, T. K., Leppänen, P. H., Poikkeus, A. M., Eklund, K. M., Lyytinen, P., & Lyytinen, H. (2005). Brain event-related potentials (ERPs) measured at birth predict later language development in children with and without familial risk for dyslexia. Cortex, 41(3), 291–303. He, C., Hotson, L., & Trainor, L. J. (2007). Mismatch responses to pitch changes in early infancy. Journal of Cognitive Neuroscience, 19(5), 878–892. https://doi.org/10.1162/jocn.2007.19.5.878. Ho, C. S., Chan, D. W., Chung, K. K., Lee, S.-H., & Tsang, S.-M. (2007). In search of subtypes of Chinese developmental dyslexia. Journal of Experimental Child Psychology, 97(1), 61–83. https://doi.org/10.1016/j.jecp.2007.01.002. Ho, C. S., Leung, M. T., & Cheung, H. (2011). Early difficulties of Chinese preschoolers at familial risk for dyslexia: Deficits in oral language, phonological processing skills, and print-related skills. Dyslexia, 17(2), 143–164. https://doi.org/10.1002/dys.429. Hsu, C. H., Lee, C. Y., & Liang, W. K. (2016). An improved method for measuring mismatch negativity using ensemble empirical mode decomposition. Journal of Neuroscience Methods, 264, 78–85. https://doi.org/10.1016/j.jneumeth.2016.02.015. Hsu, C. H., Lin, S. K., Hsu, Y. Y., & Lee, C. Y. (2014). The neural generators of the mismatch responses to mandarin lexical tones: An MEG study. Brain Research. https://doi.org/10.1016/j. brainres.2014.07.023. Hua, Z., & Dodd, B. (2000). The phonological acquisition of Putonghua (modern standard Chinese). Journal of Child Language, 27(1), 3–42. Huang, H.-S., & Hanley, J. R. (1995). Phonological awareness and visual skills in learning to read Chinese and English. Cognition, 54(1), 73–98. DOI: 001002779400641W [pii]. Jing, H., & Benasich, A. A. (2006). Brain responses to tonal changes in the first two years of life. Brain Development, 28(4), 247–256. Jokisch, D., & Jensen, O. (2007). Modulation of gamma and alpha activity during a working memory task engaging the dorsal or ventral stream. Journal of Neuroscience, 27(12), 3244–3251. Klimesch, W., Sauseng, P., & Hanslmayr, S. (2006). EEG alpha oscillations: The inhibition–timing hypothesis. Brain Research Reviews, 53(1), 63–88. Kraus, N., McGee, T. J., Carrell, T. D., Zecker, S. G., Nicol, T. G., & Koch, D. B. (1996). Auditory neurophysiologic responses and discrimination deficits in children with learning problems. Science, 273(5277), 971–973.

114

C.-Y. Lee and Y.-Y. Cheng

Kraus, N., Micco, A. G., Koch, D. B., McGee, T., Carrell, T., Sharma, A., Weingarten, C. Z. (1993). The mismatch negativity cortical evoked potential elicited by speech in cochlear-implant users. Hearing Research, 65(1–2), 118–124. Kuhl, P. K., Conboy, B. T., Padden, D., Nelson, T., & Pruitt, J. (2005). Early speech perception and later language development: Implications for the “critical period.” Language Learning and Development, 1(3–4), 237–264. https://doi.org/10.1080/15475441.2005.9671948. Kuhl, P. K., Williams, K. A., Lacerda, F., Stevens, K. N., & Lindblom, B. (1992). Linguistic experience alters phonetic perception in infants by 6 months of age. Science, 255(5044), 606–608. Kuo, Y. C., Lee, C. Y., Chen, M. C., Liu, T. L., & Cheng, S. K. (2014). The impact of spectral resolution on the mismatch response to Mandarin Chinese tones: An ERP study of cochlear implant simulations. Clinical Neurophysiology, 125(8), 1568–1575. https://doi.org/10.1016/j.cli nph.2013.11.035. Kushnerenko, E., Ceponiene, R., Balan, P., Fellman, V., & Näätänen, R. (2002). Maturation of the auditory change detection response in infants: a longitudinal ERP study. NeuroReport, 13(15), 1843–1848. Kushnerenko, E., Cheour, M., Ceponiene, R., Fellman, V., Renlund, M., Soininen, K., …, Näätänen, R. (2001). Central auditory processing of durational changes in complex speech patterns by newborns: an event-related brain potential study. Developmental Neuropsychology, 19(1), 83–97. Kushnerenko, E., Winkler, I., Horváth, J., Näätänen, R., Pavlov, I., Fellman, V., et al. (2007). Processing acoustic change and novelty in newborn infants. European Journal of Neuroscience, 26, 265–274. Lang, A. H., Eerola, O., Korpilahti, P., Holopainen, I., Salo, S., & Aaltonen, O. (1995). Practical issues in the clinical application of mismatch negativity. Ear and Hearing, 16(1), 118–130. Lee, C.-Y., Yen, H.-L., Yeh, P.-W., Lin, W.-H., Cheng, Y.-Y., Tzeng, Y.-L., & Wu, H.C. (2012). Mismatch responses to lexical tone, initial consonant, and vowel in Mandarinspeaking preschoolers. Neuropsychologia, 50(14), 3228–3239. https://doi.org/10.1016/j.neurop sychologia.2012.08.025. Leppänen, P. H., Eklund, K. M., & Lyytinen, H. (1997). Event-related brain potentials to change in rapidly presented acoustic stimuli in newborns. Developmental Neuropsychology, 13(2), 175–204. Leppanen, P. H., Guttorm, T. K., Pihko, E., Takkinen, S., Eklund, K. M., & Lyytinen, H. (2004). Maturational effects on newborn ERPs measured in the mismatch negativity paradigm. Experimental Neurology, 190(Suppl 1), S91-101. https://doi.org/10.1016/j.expneurol.2004.06.002. Li, C. N., & Tompson, S. A. (1977). The acquisition of tone in Mandarin-speaking children. Journal of Child Language, 4, 185–199. Lin, B. G., Huang, Y. C., Huang, K. C., & Hsuan, C. H. (2008). Language disorder scale for preschoolers-revised. Taipei, Taiwan: Department of Special Education, National Taiwan Normal University. Lin, Y. Y., Hsiao, F. J., Shih, Y. H., Yiu, C. H., Yen, D. J., Kwan, S. Y., Ho, L. T. (2007). Plastic phase-locking and magnetic mismatch response to auditory deviants in temporal lobe epilepsy. Cerebral Cortex, 17(11), 2516–2525. https://doi.org/10.1093/cercor/bhl157. Liu, H. M., Chen, Y., & Tsao, F. M. (2014). Developmental changes in mismatch responses to mandarin consonants and lexical tones from early to middle childhood. PLoS ONE, 9(4), e95587. https://doi.org/10.1371/journal.pone.0095587. Luo, H., Ni, J. T., Li, Z. H., Li, X. O., Zhang, D. R., Zeng, F. G., & Chen, L. (2006). Opposite patterns of hemisphere dominance for early auditory processing of lexical tones and consonants. Proceedings of National Academy of Sciences USA, 103(51), 19558–19563. https://doi.org/10. 1073/pnas.0607065104. Martynova, O., Kirjavainen, J., & Cheour, M. (2003). Mismatch negativity and late discriminative negativity in sleeping human newborns. Neuroscience Letters, 340(2), 75–78. Mattock, K., & Burnham, D. (2006). Chinese and english infants’ tone perception: Evidence for perceptual reorganization. Infancy, 10(3), 241–265. https://doi.org/10.1207/S15327078in1 003_3.

6 Neurophysiological Studies of Mandarin Lexical Tone Acquisition …

115

Maurer, U., Bucher, K., Brem, S., & Brandeis, D. (2003). Altered responses to tone and phoneme mismatch in kindergartners at familial dyslexia risk. NeuroReport, 14(17), 2245–2250. https:// doi.org/10.1097/01.wnr.0000096518.69073.a7. McBride-Chang, C., Tong, X., Shu, H., Wong, A. M. Y., Leung, K. W., & Tardif, T. (2008). Syllable, phoneme, and tone: Psycholinguistic units in early Chinese and english word recognition. Scientific Studies of Reading, 12(2), 171–194. Molfese, D. L., & Molfese, V. J. (1985). Electrophysiological indices of auditory discrimination in newborn infants: The bases for predicting later language development? Infant Behavior and Development, 8, 197–211. Molfese, D. L., & Molfese, V. J. (1997). Discrimination of language skills at five years of age using event-related potentials recorded at birth. Devlopmental Neuropsychology, 13(2), 135–156. Moon, C., Lagercrantz, H., & Kuhl, P. K. (2013). Language experienced in utero affects vowel perception after birth: a two-country study. Acta Paediatrica, 102(2), 156–160. https://doi.org/ 10.1111/apa.12098. Morr, M. L., Shafer, V. L., Kreuzer, J. A., & Kurtzberg, D. (2002). Maturation of mismatch negativity in typically developing infants and preschool children. Ear and Hearing, 23(2), 118–136. Näätänen, R., Lehtokoski, A., Lennes, M., Cheour, M., Huotilainen, M., & Iivonen, A. M. (1997). Language-specific phoneme representations revealed by electric and magnetic brain responses. Nature, 385(6615), 432–434. Näätänen, R., Kujala, T., & Winkler, I. (2011). Auditory processing that leads to conscious perception: a unique window to central auditory processing opened by the mismatch negativity and related responses. Psychophysiology, 48(1), 4–22. Näätänen, R., Paavilainen, P., Rinne, T., & Alho, K. (2007). The mismatch negativity (MMN) in basic research of central auditory processing: a review. Clinical Neurophysiology, 118(12), 2544–2590. Novitski, N., Huotilainen, M., Tervaniemi, M., Näätänen, R., & Fellman, V. (2007). Neonatal frequency discrimination in 250–4000-Hz range: Electrophysiological evidence. Clinical Neurophysiology, 118(2), 412–419. Polka, L., & Werker, J. F. (1994). Developmental changes in perception of nonnative vowel contrasts. Journal of Experimental Psychology: Human Perception and Performance, 20(2), 421–435. Rivera-Gaxiola, M., Silva-Pereyra, J., & Kuhl, P. K. (2005). Brain potentials to native and nonnative speech contrasts in 7- and 11-month-old American infants. Developmental Science, 8(2), 162–172. https://doi.org/10.1111/j.1467-7687.2005.00403.x. Sams, M., Paavilainen, P., Alho, K., & Naatanen, R. (1985). Auditory frequency discrimination and event-related potentials. Electroencephalography and Clinical Neurophysiology, 62(6), 437–448. Schulte-Korne, G., Deimel, W., Bartling, J., & Remschmidt, H. (2001). Speech perception deficit in dyslexic adults as measured by mismatch negativity (MMN). International Journal of Psychophysiology, 40(1), 77–87. Shafer, V. L., Yu, Y. H., & Datta, H. (2010). Maturation of speech discrimination in 4- to 7-yrold children as indexed by event-related potential mismatch responses. Ear and Hearing, 31(6), 735–745. https://doi.org/10.1097/AUD.0b013e3181e5d1a7. Share, D. L. (1995). Phonological recoding and self-teaching: sine qua non of reading acquisition. Cognition, 55(2), 151–218; discussion 219–226. DOI: 0010-0277(94)00645-2 [pii]. Shu, H., McBride-Chang, C., Wu, S., & Liu, H.-Y. (2006). Understanding Chinese developmental dyslexia: Morphological awareness as a core cognitive construct. Journal of Educational Psychology, 98(1), 122–133. https://doi.org/10.1037/0022-0663.98.1.122. Shu, H., Peng, H., & McBride-Chang, C. (2008). Phonological awareness in young Chinese children. Developmental Science, 11(1), 171–181. Siok, W.-T., Spinks, J. A., Jin, Z., & Tan, L.-H. (2009). Developmental dyslexia is characterized by the co-existence of visuospatial and phonological disorders in Chinese children. Current Biology, 19(19), R890-892. https://doi.org/10.1016/j.cub.2009.08.014. Snowling, M. J. (2000). Dyslexia (2nd ed.). Oxford, UK: Blackwell.

116

C.-Y. Lee and Y.-Y. Cheng

Sprenger-Charolles, L., Siegel, L. S., Bechennec, D., & Serniclaes, W. (2003). Development of phonological and orthographic processing in reading aloud, in silent reading, and in spelling: A four-year longitudinal study. Journal of Experimental Child Psychology, 84(3), 194–217. Trainor, L., McFadden, M., Hodgson, L., Darragh, L., Barlow, J., Matsos, L., & Sonnadara, R. (2003). Changes in auditory cortex and the development of mismatch negativity between 2 and 6 months of age. International Journal of Psychophysiology, 51(1), 5–15. Tsang, Y. K., Jia, S., Huang, J., & Chen, H. C. (2011). ERP correlates of pre-attentive processing of cantonese lexical tones: The effects of pitch contour and pitch height. Neuroscience Letters, 487(3), 268–272. https://doi.org/10.1016/j.neulet.2010.10.035. Tsao, F. M. (2008). The effect of acoustic similarity on lexical-tone perception of one-year-old mandarin-learning infants. Chinese Journal of Psychology, 50(2), 111–124. Tsao, F. M. (2017). Perceptual improvement of lexical tones in infants: Effects of tone language experience. Frontiers in Psychology, 8(558), 558. https://doi.org/10.3389/fpsyg.2017.00558. Tsao, F. M., Liu, H. M., & Kuhl, P. K. (2004). Speech perception in infancy predicts language development in the second year of life: A longitudinal study. Child Development, 75(4), 1067– 1084. https://doi.org/10.1111/j.1467-8624.2004.00726.x. Werker, J. F., & Tees, R. C. (1984). Phonemic and phonetic factors in adult cross-language speech perception. Journal of the Acoustical Society of America, 75(6), 1866–1878. Werker, J. F., & Yeung, H. H. (2005). Infant speech perception bootstraps word learning. Trends in Cognitive Sciences, 9(11), 519–527. https://doi.org/10.1016/j.tics.2005.09.003. Wong, P., Schwartz, R. G., & Jenkins, J. J. (2005). Percpetion and production of lexical tones by 3-year-old Mandarin-speaking children. Journal of Speech, Language, and Hearing Research, 48, 1065–1079. Yang, M. T., Hsu, C. H., Yeh, P. W., Lee, W. T., Liang, J. S., Fu, W. M., & Lee, C. Y. (2015). Attention deficits revealed by passive auditory change detection for pure tones and lexical tones in ADHD children. Frontiers in Human Neuroscience, 9, 470. https://doi.org/10.3389/fnhum.2015.00470. Yeung, H. H., Chen, K. H., & Werker, J. F. (2013). When does native language input affect phonetic perception? The precocious case of lexical tone. Journal of Memory and Language, 68(2), 123– 139. https://doi.org/10.1016/j.jml.2012.09.004. Zhang, Y., Zhang, L., Shu, H., Xi, J., Wu, H., Zhang, Y., & Li, P. (2012). Universality of categorical perception deficit in developmental dyslexia: An investigation of Mandarin Chinese tones. Journal of Child Psychology and Psychiatry, 53(8), 874–882. https://doi.org/10.1111/j.1469-7610.2012. 02528.x.

Chapter 7

Neural Processing of Tone Sandhi in Production and Perception: The Case of Mandarin Tone 3 Sandhi Claire H. C. Chang and Wen-Jui Kuo

Abstract Language-specific and context-dependent phonological rules of lexical tone are prevalent in tone languages. Such rules are commonly referred to as tone sandhi. One of the most studied sandhi rules is Mandarin Tone 3 sandhi. In Mandarin, Tone 3 followed by another Tone 3 is pronounced as Tone 2 (33 → 23). In this chapter, we reviewed our current understanding of the processing of Tone 3 sandhi. Two important and relatively well-investigated questions are whether Tone 3 sandhi involves on-line tone substitution in speech production and whether the auditory representations of Tone 2 and Tone 3 are less distinct from each other due to the acquisition of Tone 3 sandhi. Recent behavioral studies demonstrated that in the lexical decision task, only Tone 3 had a facilitation effect on targets carrying tone sequence 33, while in the picture-naming task, a facilitation effect was found with both Tone 2 and Tone 3. These results supported that Tone 3 sandhi involves on-line tone substitution, in line with fMRI studies showing that Tone 3 sandhi resulted in higher activation in the right pIFG, which is known to engage in articulatory representations and their sequencing. Regarding tone perception, previous behavioral studies showed that the acquisition of Tone 3 sandhi led to worse performance at discriminating Tone 2 and Tone 3. Further, the contrast between Tone 2 and Tone 3 is consistently reported to elicit reduced MMN compared to other tone pairs only in native speakers. One explanation of these findings is that the auditory representations of Tone 2 and Tone 3 activated each other due to Tone 3 sandhi. Namely, high-level phonological rule could modulate pre-attentive auditory processing. In the future, the role of linguistic context in the processing of tone sandhi needs more investigation, especially regarding how listeners retrieve the correct word/morpheme based on the contextual information.

C. H. C. Chang Princeton Neuroscience Institute, Princeton University, Princeton, USA W.-J. Kuo (B) Institute of Neuroscience, National Yang-Ming University, Taipei, Taiwan e-mail: [email protected] Brain Research Center, National Yang-Ming University, Taipei, Taiwan © Springer Nature Singapore Pte Ltd. 2020 H.-M. Liu et al. (eds.), Speech Perception, Production and Acquisition, Chinese Language Learning Sciences, https://doi.org/10.1007/978-981-15-7606-5_7

117

118

C. H. C. Chang and W.-J. Kuo

Fig. 7.1 Pitch contours of the four Mandarin lexical tones and their disyllable sequences (Chang & Kuo, 2016) (The low-falling contour of monosyllable Tone 3 in this figure is different from the falling–rising pattern in standard Mandarin, but consistent with other studies in Taiwanese Mandarin (Chang, 2010; Li, Xiong, & Wang, 2006), which might reflect the influence from Taiwanese dialect). Tone 3 sandhi is applied to disyllable Tone 3 (33 → 23)

7.1 Introduction Phonological context-dependent tone substitution is widely found in East Asian languages (Chen, 2000) and often referred to as tone sandhi. A well-known example is Mandarin Tone 3 sandhi. Mandarin has four tones. Each syllable carries one tone. Tone 3 (T3) is pronounced as Tone 2 (T2) when it is followed by another Tone 3 (33 → 23) (Fig. 7.1) (Chao, 1948) (see also Chap. 2 in this volume). Tone 3 sandhi is a language-specific phonological rule similar to the a/an alternation in English (an apple vs. a dog) but much more frequent. The accumulated frequency of words inducing Tone 3 sandhi is around 1.6% in Mandarin (Academia Sinica Balanced Corpus of Modern Chinese: https://asbc.iis.sinica.edu.tw/index_readme. htm), approximating the frequency of the word “in” in English (1.5% according to Corpus of Contemporary American English: https://www.wordfrequency.info/fre e.asp). Sandhi Tone 3 and Tone 2 are perceptually indistinguishable (Peng, 2000; Wang & Li, 1967). Because tone is used to distinguish words in tone languages, tone sandhi can result in word/morpheme ambiguity, e.g., 馬臉/ma3 ljεn3/ → 麻臉/ma2 ljεn3/1 (horse face → Hemp face). In other words, for the speakers, the word they have in mind is different from the word they pronounce. For the listeners, the pronounced word is different from what they subjectively perceive. Why does a phonological rule that induces word/morpheme ambiguity come to exist in the first place? In context, the ambiguity could be resolved with phonological, semantic, and syntactic information (Speer, Shih, & Slowiaczek, 1989, 2016), similar to the disambiguation of homophones. It is possible that tone sandhi increased the ease 1 We

use international phonetic symbol (IPA) to transcribe syllable throughout this chapter.

7 Neural Processing of Tone Sandhi in Production …

119

of articulation or perception in the past but became overgeneralized and interpreted to be a categorical rule over time (Anderson, 1981; Blevins, 2006; Ohala, 1993). The pitch patterns of lexical tones in East Asian languages have undergone diachronic change and varied between dialects. Tone sandhi might have remained as a categorical phonological rule even after losing its phonetic function because of tone pattern change (Zhang & Lai, 2010). It is worth noticing that not all sandhi rules involve the substitution of phonological representations. For example, the half Tone 3 rule in Mandarin simplifies the pitch contour of T3 but does not result in categorical change or morpheme/word ambiguity. The half Tone 3 rule is believed to reflect the universal demand on the ease of articulation (Xu, 2004), whose application is less dependent on language experience. Indeed, the application of Tone 3 sandhi has been reported to appear later and less accurate during development and it is also hard for second language learners (Chen, Wee, Tong, Ma, & Li, 2016). In this chapter, we focus on Mandarin Tone 3 sandhi, one of the most studied sandhi rules. Speech production and perception involve different neural processing. Therefore, we discuss tone sandhi in production and perception respectively. A rough delineation of the processing of speech production and perception according to current speech models is as below (Golfinopoulos, Tourville, Guenther, & Gol, 2010; Hickok & Poeppel, 2007a; Indefrey & Levelt, 2004; Price, 2010). In speech production, the motor representations of speech sounds are activated and sequenced in the posterior inferior frontal gyrus (pIFG) and premotor areas and executed in the motor cortex. The auditory feedback of the articulation is then processed in superior temporal gyrus (STG) as part of the self-monitoring process. In speech perception, the auditory inputs activate the categorical auditory representations in the STG/STS, which in turn lead to the retrieval of the lexical representations in the lower part of the temporal lobe. Motor representations are not necessarily involved in speech perception (Scott, McGettigan, & Eisner, 2009). The traditional description of Tone 3 sandhi (33 → 23) is more from the production perspective. If Tone 3 sandhi does involve the substitution of motor representations of tones, the literature suggests that pIFG/premotor areas should be engaged, since they are responsible for the storage and sequencing of categorical motor representations (Golfinopoulos et al., 2010; Hickok & Poeppel, 2007a; Indefrey & Levelt, 2004; Price, 2010). In this case, the next question is how the discrepancy between the underlying and surface tones escapes self-monitoring. Concerning tone perception, the behavioral finding that native Mandarin speakers were prone to confuse T2 and T3 even under monosyllable condition raises the question of whether Tone 3 sandhi, a high-level phonological rule, can modulate early auditory processing. Furthermore, the morpheme/word ambiguity resulted from the application of Tone 3 sandhi must be resolved in the later stage based on contextual information, including the following tone, word boundary, phrase structure, etc. (see Chap. 3 in this volume for a discussion on the role of linguistic context on tone perception), and we still know very little about the neural mechanism underlying this disambiguation process. These issues are discussed in the following sections.

120

C. H. C. Chang and W.-J. Kuo

7.2 Tone 3 Sandhi in Speech Production 7.2.1 Behavioral Studies The claim that Tone 3 sandhi is a language-specific phonological rule that substitutes the underlying Tone 3 by the surface Tone 2 is supported by the finding that T3, but not T2, primed targets carrying tone sequence 33 in the lexical decision task (Chien, Sereno, & Zhang, 2016), while in the picture-naming task, T2 and T3 both induced a facilitation effect (Nixon, Chen, & Schiller, 2015). Chien et al. (2016) conducted an auditory-auditory priming lexical decision experiment using disyllabic word targets and legal monosyllable primes of T1, T2, or T3 (e.g., /fu1/, /fu2/, and /fu3/). The prime preceded the target by 250 ms. The critical targets consisted of two Tone 3 syllables (e.g., /fu3 tao3/ 輔導). They demonstrated that T3 significantly facilitated targets carrying tone sequence 33. Namely, these targets had shorter reaction times (RTs) with T3 prime than T1 prime. No facilitation effect was found for T2 prime. These findings indicated that only the underlying T3 but not the surface T2 was involved in the lexical decision task. A similar effect has also been reported for Taiwanese tone sandhi pair (Chien, Sereno, & Zhang, 2017). In contrast, Nixon et al. (2015) adopted the picture naming instead of the lexical decision task. The participants were asked to name a picture, and a word distractor was presented visually 0 ms or 83 ms after the picture. The target pictures had disyllable names. The distractors were semantically and orthographically unrelated to the targets, while the phonological relationship between the picture names and the distractor words was manipulated. Experiment 1 used monosyllable distractors (e.g., 驢、屢、綠). For picture names consisting of two T3 syllables, a facilitation effect was found for both T2 and T3 distractors. Namely, naming RTs were shorter in trials with T2 and T3 distractors than trials with control (T1/T4) distractors, indicating that the production of tone sequence 33 involved the phonological representations of both T2 and T3. In Experiment 2, the first syllable of the picture names carried either T2 or T3 and the tone of the second syllable was not limited to Tone 3 (2X vs. 3X, e.g., 浮標 vs. 武器). The distractors were disyllabic words carrying tone sequences 33 (e.g., 雨傘) or control sequences (1X or 4X, e.g., 夫婦 and 噪音). Distractors carrying tone sequence 33 facilitated the naming of both sequence 2X and 3X, indicating that distractor words carrying tone sequence 33 activated the phonological representations of both T2 and T3. The effect of distractor type (Exp. 1) or target type (Exp. 2) did not interact with the onset time of the distractors. Taken together, these findings indicated that for words carrying tone sequence 33, only T3 is stored in the lexicon, while the phonological representations of T2 and T3 were both activated for the production of tone sequence 33.

7 Neural Processing of Tone Sandhi in Production …

121

7.2.2 Neuroimaging and Electrophysiological Studies Where is Tone 3 sandhi implemented in the brain during speech production? Using functional magnetic imaging (fMRI), Chang and Kuo (2016) and Chang et al. (2014) examined the production of sequences of the four Mandarin lexical tones. The participants were required to pronounce visually displayed phonetic symbols in the scanning sessions. Sixteen tonal syllables were used (four tones x four vowels /a/, /i/, /u/, and /y/). Tones in one sequence were borne by the same vowel. Larger brain activations in the right pIFG for Tone 3 sequence (e.g., 33 > 11, 22, 44) was found. It was suggested that right pIFG was involved in the implementation of Tone 3 sandhi. It has been debated whether the underlying and the surface tones are both stored for words involving tone sandhi (e.g., Hsieh, 1970; Tsay & Myers, 1996) or only the underlying tone is stored, which is substituted by the surface tone on-line before articulation. Brain imaging literature suggests that the phonological representations of words reside in the temporal lobe (Hickok & Poeppel, 2007a; Indefrey & Levelt, 2004), while the frontal lobe is engaged in on-line phonological processing and articulation. Therefore, the finding of higher IFG activation during sandhi tone production supports that Mandarin Tone 3 sandhi requires on-line tone substitution, consistent with recent behavioral studies (Chien et al., 2016; Nixon et al., 2015). One concern with this interpretation is that T3 might be physically harder to pronounce because it has the most complicated contour (falling–rising) among the four Mandarin lexical tones, at least in standard Mandarin. However, in that case, extra right IFG activation for Tone 3 should also be observed with monosyllable stimuli. Chang et al. included both monosyllable and disyllable conditions. Tone 3 sandhi only applied under the disyllable condition. They found higher brain activations for Tone 3 only under the disyllable condition, indicating that the higher activation in the right IFG for sequence 33 did not only reflect the inherent physical difficulty in producing Tone 3. Because repeated sequence 33 was pronounced as mixed sequence 23 on the surface, another concern is that right pIFG is involved in the production of any mixed sequence, no matter whether Tone 3 sandhi is applied or not. Mixed sequences might increase the processing loading for tone retrieval and sequencing. Mixed sequences might also require extra computation for co-articulation and change of pitch direction (Xu & Emily Wang, 2001; Xu & Xu, 2005). Chang et al. (2014) contrasted “genuine” mixed sequences (twelve of them, e.g., 2413) and sandhi sequence 3333 against repeated sequences (1111, 2222, and 4444) respectively. Additional activation in the right posterior IFG was only observed for sequence 3333. Chang et al. also manipulated the requirement on the overt oral response in order to distinguish the pre-articulatory planning and the motor execution stages of speech production. Higher right pIFG response to sequence 33 was observed only under overt production condition, indicating that the application of Tone 3 sandhi depends on overt production. The implementation of Tone 3 sandhi during speech production has also been investigated with event-related potential (ERP) technique. Zhang et al. (2015) directly

122

C. H. C. Chang and W.-J. Kuo

compared the production of tone sequence 23 and 33. Since both sequences were pronounced as 23 on the surface, the difference between them cannot be due to articulatory or acoustic difference and is more likely to reflect the implementation of Tone 3 sandhi. It was reported that sequence 33 elicited larger P2 (230–320 ms) than sequence 23, consistent with the claim that Tone 3 sandhi requires additional processing. Furthermore, this effect was found under both real word, and pseudoword conditions (legal vs. illegal syllable), supporting that Tone 3 sandhi involves on-line computation instead of the retrieval of an alternative phonological representation of a word. One advantage of the ERP method is its higher temporal resolution. However, in this study, the participants were required to repeat the auditorily presented stimuli covertly upon hearing the second syllable and to produce them overtly upon seeing a visual cue 1000–1600 ms after the offset of the auditory stimuli. The ERPs were timelocked to the onset of the second syllable of the stimuli. Because of the experimental procedure used, this study might be less informative about the time course of natural speech production. The right auditory cortex is known to be specialized in pitch perception (Jamison, Watkins, Bishop, & Matthews, 2006; Poeppel, 2003; Schönwiesner, 2005; Shtyrov, Kujala, Palva, Ilmoniemi, & Näätänen, 2000; Zatorre, 2001). The right IFG could be recruited for tone processing through its interaction with the right auditory cortex (Kell, Morillon, Kouneiher, & Giraud, 2011; Pulvermüller, Kiff, & Shtyrov, 2012). Based on findings in pitch without linguistic function (Jamison et al., 2006; Poeppel, 2003; Schönwiesner, 2005; Shtyrov et al., 2000; Zatorre, 2001), a functional asymmetry between the left and right auditory cortices has been proposed. Zatorre (2001) suggested that the left auditory areas have a better temporal resolution, while the right auditory areas have a better spectral resolution. The asymmetric sampling in time hypothesis, on the other hand, proposed that the left auditory areas extract information from short (~20–40 ms) temporal integration windows, while the right auditory areas extract information from long (~150–250 ms) integration windows (Poeppel, 2003). During speech production, the interaction between the frontal and temporal regions is necessary for self-monitoring and error correction (Guenther, Ghosh, & Tourville, 2006; Hickok, 2012; Hickok & Poeppel, 2007b), namely, to identify the discrepancy between the expected output and the auditory feedback. If the auditory feedback deviates from the expectation, the mapping between phonological representations and motor commands needs to be adjusted accordingly. Therefore, the interaction between the motor system in the frontal areas and the auditory system in the temporal areas is crucial for speech production, especially during development or when speech production is perturbated (Flagmeier et al., 2014). Since right auditory cortex specializes in pitch perception, right IFG might be recruited for tone processing through its interaction with right auditory cortex via right arcuate fasciculus. Using fMRI, Liu et al. (2006) compared the production of Mandarin tones and vowels in the character-naming and the pinyin-naming tasks. Both included sixteen tonal syllables (4 tones × 4 vowels /A/, /ђ/, /i/, and /u/ for the pinyin-naming task/ʂ A/, /ʂ ђ/, /ʂ ɻ̩ /, and /ʂ u/for the character-naming task). Higher brain activations in the right IFG for tone production than vowel production

7 Neural Processing of Tone Sandhi in Production …

123

were found in both tasks, while higher activations for vowel than tone were found exclusively in the left hemisphere. These findings support that right IFG is more important for tone production. Further, structural and functional anomalies in right IFG (Albouy et al., 2013; Hyde et al., 2007; Hyde, Zatorre, & Peretz, 2011), right STG (Albouy et al., 2013; Zhang, Peng, Shao, & Wang, 2017), and the right frontal– temporal pathway (Loui, Alsop, & Schlaug, 2009; Wang, Zhang, Wan, & Peng, 2017) have been reported in patients with congenital amusia (Peretz, 2013), an impairment to process music melody as well as lexical tone (Jiang, Hamm, Lim, Kirk, & Yang, 2012; Liu et al., 2012, 2016; Nan, Sun, & Peretz, 2010; Tillmann et al., 2011). In the case of Tone 3 sandhi, the updated phonological representation/motor command must help to generate the prediction on auditory feedback, so the discrepancy between underlying and surface tones would not alert the self-monitoring system during speech production. This scenario is consistent with the finding of a larger right pIFG response to 33 sequence only when overt production was required (Chang & Kuo, 2016). In parallel to the finding in tone, Loui, Li, & Schlaug, (2011) have created a pitch-based artificial rule and found that the participants’ learning performance positively correlated with the volumes of the right arcuate fasciculus connecting the right IFG and the superior temporal lobe. In sum, studies in Tone 3 sandhi production suggest that Tone 3 sandhi implementation requires extra on-line computation in the right IFG, which might be involved through interaction with the right auditory cortex.

7.3 Tone 3 Sandhi in Tone Perception 7.3.1 Behavioral Studies Tone 3 sandhi results in a discrepancy between the pronounced and perceived tones without alerting the attention and the self-monitoring systems. That raises a naïve question: are the auditory representations of T2 and T3 less distinctive from each other after the acquisition of Tone 3 sandhi? Among the six possible tone pairs in Mandarin, T2-T3 was often reported to be one of the most difficult pairs to distinguish even for non-native speakers (Hao, 2018; Huang & Johnson, 2011; So & Best, 2014). Because non-native speakers do not know the sandhi rule, acoustic similarity is more likely the reason for their difficulty. However, several behavioral studies have demonstrated that T2 and T3 were even more similar to each other for native speakers than for non-native speakers (Chen, Liu, & Kager, 2015, 2016; Huang & Johnson, 2011). Huang & Johnson (2011) recruited both Chinese and English speakers in two Mandarin tone discrimination experiments. One used speech sound stimuli (legal Mandarin monosyllable /pa/) and the other used sine-wave stimuli. All six possible tone pairs were included. Using speech sound, Chinese speakers generally discriminated Mandarin tones faster than English speakers. They were significantly slower than the English group only

124

C. H. C. Chang and W.-J. Kuo

at discriminating T2 and T3. T2-T3 also elicited the longest RT among all tone pairs in the Chinese but not the English group. Further, such group difference in discriminating T2 and T3 was not observed under non-speech condition. Chen, Liu et al. (2016) recruited Dutch and Chinese speakers in a Mandarin tone discrimination experiment. Their task was to discriminate T3 from T2 and T4 from T1 under the monosyllable and the disyllable conditions. The monosyllable stimuli were legal Mandarin syllables. The disyllable stimuli consisted of two legal monosyllables that did not form a real word. The results showed that Dutch speakers outperformed Chinese speakers at discriminating tone sequence 33 from sequences containing T2 (23, 32, and 22) (77% accuracy for the Chinese group and 82% for Dutch). Such group difference was not found under the monosyllable condition (2 vs. 3) or in the T1-T4 pair (1 vs. 4 under the monosyllable condition. 44 vs. 41, 11, and 14 under the disyllable condition). A similar result was also reported in Chen et al. (2015). The results of Huang and Johnson (2011) and Chen et al. (2015, 2016) are surprising because acquiring a language unusually leads to better performance in discriminating acoustically similar but linguistically distinctive sounds. These results indicate that acoustic similarity and Tone 3 sandhi might both account for the difficulty of discriminating T2 and T3 (Hume & Johnson, 2001). Huang and Johnson (2011) used monosyllable stimuli, implying that the confusion between T2 and T3 occurred in the early context-independent stage of auditory processing, while Chen, Liu et al. (2016) reported lower accuracy in native speakers than non-native speakers at discriminating T2 and T3 only under the disyllable condition, indicating that a viable context for Tone 3 sandhi is critical. The early automatic stage of auditory processing can be examined by the mismatch negativity (MMN) paradigm, which is discussed in the next section.

7.3.2 Neuroimaging and Electrophysiological Studies MMN is an ERP component often elicited using the oddball paradigm, in which a standard sound is displayed with higher probability and a deviant sound with lower probability (Näätänen, Paavilainen, Rinne, & Alho, 2007). MMN is found around 100–300 ms after stimulus onset in the difference waveform of the deviant minus the standard and believed to reflect the automatic detection of sound change, namely the difference between the memory trace of the standard and the current deviant input. Phonological rules of phoneme change such as place assimilation (e.g., /d/ to /b/ in “bad boy”) have been reported to modulate MMN (Mitterer & Blomert, 2003; Mitterer, Csépe, Honbolygo, & Blomert, 2006; Sun et al., 2015; Tavabi, Elling, Dobel, Pantev, & Zwitserlood, 2009). Namely, MMN elicited by phoneme change was reduced if the change could be explained by place assimilation rule. Previous studies in Mandarin have demonstrated that MMN elicited by the contrast between T2 and T3 was lower in amplitude and longer in peak latency than the non-sandhi tone pairs (e.g., T1-T3) (Chandrasekaran, Gandour, & Krishnan, 2007;

7 Neural Processing of Tone Sandhi in Production …

125

Chandrasekaran, Krishnan, & Gandour, 2007; Cheng et al., 2013; Li & Chen, 2015; see also Chap. 6 in this volume). Similar results were also reported in the magnetoencephalographic counterpart of MMN (Hsu, Lin, Hsu, & Lee, 2014). Chandrasekaran, Gandour, et al. (2007) recruited both English and Chinese speakers and included three Mandarin tone pairs (T1-T3, T2-T3, and T1-T2). The Chinese group showed larger MMN amplitude than the English group for T1-T2 and T1-T3, indicating higher sensitivity to tone difference in native speakers. As for T2-T3, no language group effect was found. Further, for the Chinese group, the MMN amplitude of T2-T3 was significantly smaller than T1-T2 and T1-T3, while no tone pair difference was found for the English group. Similar findings were also reported in Chandrasekaran, Krishnan, et al. (2007), which compared T2-T3 and T1-T3. Taking the MMN amplitude as an index of the dissimilarity between tones, these results indicated that T2 and T3 were more similar to each other than they were to T1 for the Chinese but not the English group. T1 has a flat pitch contour, while T2 and T3 both have non-flat pitch contours. Chandrasekaran, Gandour, et al. (2007) thus suggested that native speakers are more sensitive to the distinction between flat and non-flat tones and that explained their findings. This acoustic similarity account is the most commonly held one for the weaker MMN elicited by the contrast between T2 and T3 (Chandrasekaran, Gandour, et al. 2007; Chandrasekaran, Krishnan, et al. 2007; Cheng et al., 2013; Hsu et al., 2014; Yu, Shafer, & Sussman, 2017). However, although T2 and T3 both have non-flat pitch contours, they differ in the direction (Fig. 7.1) and previous studies have reported that Mandarin speakers were more sensitive to pitch direction than English speakers (Gandour, 1983, 1984). In addition, acoustic similarity can barely explain the behavioral findings that T2 and T3 were perceptually more similar to native speakers than to non-native speakers (A. Chen et al., 2015, 2016; Huang & Johnson, 2011). No matter how similar two speech sounds in a language are along a specific acoustic dimension, it is unlikely that learning the language could increase the difficulty in distinguishing them. Therefore, the alternative Tone 3 sandhi account is worth more consideration and examination (Li & Chen, 2015). The MMN response has been proposed to reflect the discrepancy between the deviant sound and the short-term memory trace of the standard sound (Näätänen et al., 2007). If the Mandarin T3 standard activates the phonological representations of both T3 and T2, then the deviant T2 may result in less discrepancy. Yet another explanation for the reduced MMN elicited by the contrast between T2 and T3 comes from the underspecification theory (Archangeli, 1988). According to this theory, some phonemes are not fully represented in memory, and that is why they are often replaced or assimilated by other phonemes. Reduced MMN has been reported using underspecified vowel as the standard sound and suggested to reflect less conflict at the phonological level (Cornell, Lahiri, & Eulitz, 2011; Eulitz & Lahiri, 2004; Scharinger, Monahan, & Idsardi, 2016). Politzer-Ahles et al. (2016) recruited both native and non-native Mandarin speakers. Hypothesizing that T3 is phonological underspecified, they predicted reduced MMN when T3 served as the standard compared to when it served as the deviant in the Mandarin group. They reasoned that the phonological representation of a standard sound lasts longer than its surface features. Therefore, when an underspecified sound serves as the standard,

126

C. H. C. Chang and W.-J. Kuo

its phonological representation conflicts less with the incoming deviant sound. On the other hand, when an underspecified sound serves as the deviant, its acoustic features conflict with the fully specified phonological representation of the standard sound. The predicted effect was observed in Experiment 3, which included all six possible Mandarin tone pairs. However, a closer examination into all the tone pairs containing T3 (T1-T3, T2-T3, T4-T3) showed a significant asymmetry only in the T2-T3 pair. Namely, standard T3 and deviant T2 elicited smaller MMN than standard T2 and deviant T3. In addition, such asymmetry was also reported in non-native speakers in Experiment 1 & 2 and tone pairs without T3 (for pair T2-T4, smaller MMN was found when T2 served as the standard) in Experiment 1. Therefore, the interpretation of these results is not clear. As far as we know, none of the previous imaging studies in tone perception has directly compared sandhi and non-sandhi conditions. Nevertheless, studies focusing on the lateralization of tone perception serve to clarify the role of right IFG. It has been suggested that speech production and perception involve similar neural circuits (D’Ausilio et al., 2009; Galantucci, Fowler, & Turvey, 2006; Meister, Wilson, Deblieck, Wu, & Iacoboni, 2007; Scott et al., 2009). Since right IFG has been reported to engage in tone production (Chang & Kuo, 2016; Chang et al., 2014; Liu et al., 2006), its role in tone perception is worth examining. The fMRI study of Li et al. (2010) adopted an auditory matching task. The participants were presented with a sequence of three legal Mandarin syllables and asked to judge whether any of them matches the following monosyllable probe, e.g., /pau1 xuђn4 mu2/-/tʂʅ1/ (a yes trial in the tone matching task). The position of the target within the trisyllable sequence was randomly assigned, in order to increase the processing loading of brain regions involved in phonological encoding and working memory. Taking fixed target position condition as the baseline, they found higher activations in the right pIFG and right inferior parietal lobule in tone matching task than in consonant or rime matching task. Right IFG activation has also been reported in tone judgment task with visually presented Chinese characters (whether the reading of the character has Tone 4), using arrow judgment task as the baseline condition (Kwok et al., 2015). These findings showed that right IFG also plays a role in tone perception tasks. It is worth noticing that the finding that right hemisphere is more important for the processing of tone than the other phonological units (Li et al., 2010; Liu et al., 2006; Luo et al., 2006) does not necessarily contradict with the argument that experience in tone language leads to more reliance on the left hemisphere and the left-lateralization of tone processing (Zatorre & Gandour, 2008; see also Chap. 5 in this volume). Increased activation in left frontal, parietal, and insular regions have been reported in studies comparing native versus non-native speakers (Gandour et al., 2003, 2000; Hsieh, Gandour, Wong, & Hutchins, 2001; Klein, Zatorre, Milner, & Zhao, 2001; Wong, Parsons, Martinez, & Diehl, 2004) and tone vs. non-speech pitch (Gandour et al., 2000; Hsieh et al., 2001; Wong et al., 2004) in auditory discrimination task. Here we point out that such results are not incompatible with the finding of higher reliance on the right hemisphere in the processing of tone than the other phonological units, as demonstrated in Fig. 7.2.

7 Neural Processing of Tone Sandhi in Production …

127

Fig. 7.2 Hypothetical brain responses to consonant and tone in the left and right hemisphere with and without experience in tone languages

To the aim of examining the phonological processing of tone in natural speech perception, most existing neuroimaging studies suffered from the confound of lexical processing or task-relevant effect. Because all legal monosyllables in Mandarin have corresponding words/morphemes, using legal monosyllable stimuli inevitably introduced the confound of lexical processing in the contrast between native and nonnative speakers and the contrast between tone versus non-speech pitch (Gandour et al., 2003, 2000; Hsieh et al., 2001; Klein et al., 2001; Kwok et al., 2015; Nan & Friederici, 2013; Wong et al., 2004). Further, all active tasks introduced taskspecific effect, e.g., verbal working memory and selective attention, especially when using passive listening condition as the baseline (Gandour et al., 2003; Hsieh et al., 2001; Wong et al., 2004), in which case task-specific component was more likely to survive baseline subtraction. Future imaging studies need to take these issues into consideration. In brief, existing behavior and ERP evidence imply that T2 and T3 are less distinct from each other in the pre-attentive stage of auditory processing, but more researches are needed to better disentangle the acoustic similarity account from the Tone 3 sandhi account. In the future, how the listeners overcome the discrepancy between surface and underlying tones based on contextual information in the later stage of auditory processing, so to retrieve the right word/morpheme, needs to be investigated for a deeper understanding of Tone 3 sandhi.

7.4 General Discussion This chapter reviews our current understanding of Tone 3 sandhi, including its implementation during speech production and, regarding tone perception, whether the acquisition of Tone 3 sandhi affects the pre-attentive auditory processing of tone.

128

C. H. C. Chang and W.-J. Kuo

The results of existing behavioral studies supported that the underlying T3 is stored in the lexicon (Chien et al., 2016) and the representations of T2 and T3 are both activated for the production of T3 sequences (Nixon et al., 2015). fMRI studies of tone production (Chang & Kuo, 2016; Chang et al., 2014; Liu et al., 2006) have demonstrated that right pIFG was involved in the processing of tone, supporting that Tone 3 sandhi involves on-line substitution of neural representations. Since the right auditory cortex is known to be specialized in pitch perception, right IFG might be recruited for tone processing through its interaction with the right auditory cortex. One way to further examine the frontal–temporal interaction in tone production is to perturbate the auditory feedback, which supposedly increases the loading on the self-monitoring system. Larger activation in the right IFG activation and bilateral temporal cortices (Fu et al., 2006) and increased functional connectivity in the right temporal-frontal loop (Flagmeier et al., 2014) have been reported with pitch-shifted auditory feedback in English. In the fMRI study of Fu et al. (2006), the participants were asked to pronounce visually presented real words. Their speech was lowered in pitch by 4 semitones under self-distorted condition. Compared to self-undistorted condition, distorted feedback elicited higher activations in bilateral temporal cortices and right IFG. However, perturbation involving vowel change was also reported to increase IFG activation bilaterally (Niziolek & Guenther, 2013; Zheng et al., 2013) or in the right hemisphere (Tourville, Reilly, & Guenther, 2008). Direct comparison of different types of perturbation, e.g., consonant, vowel, non-lexical pitch, lexical tone, etc., might help to clarify whether right IFG is more engaged in self-monitoring during tone production. As for tone perception Huang & Johnson (2011), and Chen (2015, 2016) demonstrated that native speakers were slower or less accurate in discriminating T2 and T3 than non-native speakers. One explanation is that acquiring Tone 3 sandhi leads to the co-activation of T2 and T3, which is consistent with the finding of reduced MMN elicited by the contrast between T2 and T3 in the native speakers (Chandrasekaran, Gandour, et al., 2007; Chandrasekaran, Krishnan, et al., 2007; Cheng et al., 2013; Hsu et al., 2014; Li & Chen, 2015). However, in studies comparing tone pairs, the effect of Tone 3 sandhi could hardly be disentangled from that of acoustic similarity or underspecified phonological representation since Mandarin only has four tones and six possible tone pairs. To further investigate how language-specific phonological rule modifies auditory processing, alternative solutions include the comparison between participants with different language backgrounds (Chang, Lin, & Kuo, 2019) and systematic manipulation of linguistic context and inter-stimulus-interval (ISI). Previous behavioral studies suggested that the influence of language experience on tone perception might be context-dependent. English speakers discriminated Mandarin tones carried by sine waves (non-linguistic context) better than Chinese speakers (Huang & Johnson, 2011). Chen, Liu et al. (2016) reported that Dutch speakers outperformed Chinese speakers at discriminating T2 and T3 carried by disyllabic stimuli, which provided a viable context for Mandarin Tone 3 sandhi (33 → 23). The interaction between linguistic context and phonological rule in the MMN paradigm has been studied using segments. Sun et al. (2015) examined the MMN elicited by /f/ to /v/ change in French. French /f/ is a voiceless sound, while /v/ is a

7 Neural Processing of Tone Sandhi in Production …

129

voiced one. The change from /f/ to /v/ is legal when a voiced obstruent consonant follows /f/. Utilizing this optional but language-specific voicing assimilation rule, Sun et al. (2015) compared ERP elicited by /f/ to /v/ change under viable (/ofbe/ → /ovbe/) and unviable context (/ofne/ → /obne/). The ERP analysis was time-locked to the onset of /f/ or /v/. They found MMN and P300 only for voicing change under context unviable for the voicing assimilation rule, supporting that the representations of /f/ and /v/ were both activated when the context was viable for the voicing assimilation rule. These results demonstrated that linguistic context influenced the effect of phonological rule on MMN. ISI has also been proposed to influence the effect of language experience on MMN. Yu et al. (2017) manipulated ISI and suggested that long ISI could diminish the effect of short-term sensory memory trace and thus reveal the processing of the long-term phonological representations. They used disyllable stimuli that differed only in the first tone and reported that MMN elicited by tone change was evident in both Chinese and English groups under short ISI condition, while under long ISI condition, only the Chinese group showed the MMN response. These findings supported that ISI could be used to examine the influence of language experience and to disentangle the acoustic/phonetic and the phonological stages of auditory processing. Another interesting result from Yu et al. (2017) is that, unlike previous MMN studies using monosyllable stimuli (Chandrasekaran, Gandour, et al., 2007; Chandrasekaran, Krishnan, et al., 2007; Cheng et al., 2013; Hsu et al., 2014; Li & Chen, 2015), the contrast between T2 and T3 did not yield reduced MMN or lower discrimination accuracy, which might result from the inviable context for Tone 3 sandhi, i.e., tone sequence 31. Yu et al. (2017) compared the discrimination of tone sequence 31 from sequence 21 and 11. The Chinese group showed similar accuracies (both above 90%) and outperformed the English group under both conditions. When using sequence 31 as the standard, MMN elicited by deviant 21 was as strong as MMN elicited by deviant 11, with either short ISI or long ISI. Such results are in line with the idea that viable linguistic context is crucial for Tone 3 sandhi effect. In the future, the role of linguistic context on the production and perception of Tone 3 sandhi needs more systematic investigations. Furthermore, the nature of tone sandhi depends on the exact rule in question (Chien et al., 2017; Myers & Tsay, 2003; Xu, 2004; Zhang & Lai, 2010; Zhang & Liu, 2016) and varies between languages (Chen, 2000; Tsay & Myers, 1996). This chapter focuses on Mandarin Tone 3 sandhi. Weather a general neural mechanism is shared across sandhi rules and tone languages requires further tests in the future (Chang et al., 2019; Chien et al., 2017).

130

C. H. C. Chang and W.-J. Kuo

References Albouy, P., Mattout, J., Bouet, R., Maby, E., Sanchez, G., Aguera, P. E., Tillmann, B. (2013). Impaired pitch perception and memory in congenital amusia: The deficit starts in the auditory cortex. Brain, 136(5), 1639–1661. https://doi.org/10.1093/brain/awt082. Anderson, S. R. (1981). Why phonology isn’t “natural.” Linguistic Inquiry, 12(4), 493–539. Retrieved from https://escholarship.org/uc/item/7b6962b4#page-39. Archangeli, D. (1988). Aspects of underspecification theory. Phonology, 5(02), 183–207. https:// doi.org/10.1017/S0952675700002268 Blevins, J. (2006). A theoretical synopsis of evolutionary phonology. Theoretical Linguistics, 32(2), 117–166. https://doi.org/10.1515/TL.2006.009 Chandrasekaran, B., Gandour, J. T., & Krishnan, A. (2007). Neuroplasticity in the processing of pitch dimensions: A multidimensional scaling analysis of the mismatch negativity. Restorative Neurology and Neuroscience, 25(3–4), 195–210. https://doi.org/10.1016/j.ygyno.2014.12.035. Pharmacologic Chandrasekaran, B., Krishnan, A., & Gandour, J. T. (2007). Mismatch negativity to pitch contours is influenced by language experience. Brain Research, 1128(1), 148–156. https://doi.org/10.1016/ j.brainres.2006.10.064 Chang, C. Y. (2010). Dialect differences in the production and perception of Mandarin Chinese tones. The Ohio State University. Chang, C. H. C., & Kuo, W. J. (2016). The neural substrates underlying the implementation of phonological rule in lexical tone production: An fMRI study of the tone 3 sandhi phenomenon in Mandarin Chinese. PLoS ONE. https://doi.org/10.1371/journal.pone.0159835 Chang, C. H. C., Lee, H. J., Tzeng, O. J. L., & Kuo, W.-J. (2014). Implicit target substitution and sequencing for lexical tone production in Chinese: An fMRI study. PLoS ONE, 9(1). https://doi. org/10.1371/journal.pone.0083126. Chang, C. H. C., Lin, T. H., & Kuo, W. J. (2019). Does phonological rule of tone substitution modulate mismatch negativity? Journal of Neurolinguistics, 51, 63–75. https://doi.org/10.1016/ j.jneuroling.2019.01.001 Chao, Y. R. (1948). Mandarin primer. Cambridge (UK): Harvard University Press. https://doi.org/ 10.4159/harvard.9780674732889. Chen, A., Liu, L., & Kager, R. (2015). Cross-linguistic perception of Mandarin tone sandhi. Language Sciences, 48, 62–69. https://doi.org/10.1016/j.langsci.2014.12.002 Chen, A., Liu, L., & Kager, R. (2016). Cross-domain correlation in pitch perception, the influence of native language. Language, Cognition and Neuroscience, 31(6), 751–760. https://doi.org/10. 1080/23273798.2016.1156715 Chen, M. Y. (2000). Tone sandhi: Patterns across Chinese dialects. Cambridge University Press. Chen, N. F., Wee, D., Tong, R., Ma, B., & Li, H. (2016). Large-scale characterization of nonnative Mandarin Chinese spoken by speakers of European origin: Analysis on iCALL. Speech Communication, 84, 46–56. https://doi.org/10.1016/j.specom.2016.07.005 Cheng, Y.-Y., Wu, H.-C., Tzeng, Y.-L., Yang, M.-T., Zhao, L.-L., & Lee, C.-Y. (2013). The development of mismatch responses to Mandarin lexical tones in early infancy. Developmental Neuropsychology, 38(5), 281–300. https://doi.org/10.1080/87565641.2013.799672 Chien, Y.-F., Sereno, J. A., & Zhang, J. (2016). Priming the representation of Mandarin tone 3 sandhi words. Language, Cognition and Neuroscience, 31(2), 179–189. https://doi.org/10.1080/ 23273798.2015.1064976 Chien, Y.-F., Sereno, J. A., & Zhang, J. (2017). What’s in a word: Observing the contribution of underlying and surface representations. Language and Speech, 60(4), 643–657. https://doi.org/ 10.1177/0023830917690419 Cornell, S. A., Lahiri, A., & Eulitz, C. (2011). “What you encode is not necessarily what you store”: Evidence for sparse feature representations from mismatch negativity. Brain Research, 1394, 79–89. https://doi.org/10.1016/J.BRAINRES.2011.04.001

7 Neural Processing of Tone Sandhi in Production …

131

D’Ausilio, A., Pulvermüller, F., Salmas, P., Bufalari, I., Begliomini, C., & Fadiga, L. (2009). The motor somatotopy of speech perception. Current Biology, 19(5), 381–385. https://doi.org/10. 1016/j.cub.2009.01.017 Eulitz, C., & Lahiri, A. (2004). Neurobiological evidence for abstract phonological representations in the mental lexicon during speech recognition. Journal of Cognitive Neuroscience, 16, 577–583. https://doi.org/10.1162/089892904323057308 Flagmeier, S. G., Ray, K. L., Parkinson, A. L., Li, K., Vargas, R., Price, L. R., Robin, D. A. (2014). The neural changes in connectivity of the voice network during voice pitch perturbation. Brain and Language, 132, 7–13. https://doi.org/10.1016/j.bandl.2014.02.001. Fu, C. H. Y., Vythelingum, G. N., Brammer, M. J., Williams, S. C. R., Amaro, E., Andrew, C. M., McGuire, P. K. (2006). An fMRI study of verbal self-monitoring: Neural correlates of auditory verbal feedback. Cerebral Cortex, 16(7), 969–977. https://doi.org/10.1093/cercor/bhj039. Galantucci, B., Fowler, C. A., & Turvey, M. T. (2006). The motor theory of speech perception reviewed. Psychonomic Bulletin & Review, 13(3), 361–377. https://doi.org/10.3758/BF03193857 Gandour, J. T. (1983). Tone perception in far eastern-languages. Journal of Phonetics, 11(2), 149– 175. Gandour, J. T. (1984). Tone dissimilarity judgments by Chinese Listeners. Journal of Chinese Linguistics, 12(2), 235–261. Retrieved from https://www.jstor.org/stable/23767002. Gandour, J. T., Dzemidzic, M., Wong, D., Lowe, M., Tong, Y., Hsieh, L., Lurito, J. (2003). Temporal integration of speech prosody is shaped by language experience: an fMRI study. Brain and Language, 84(3), 318–336. https://doi.org/10.1016/S0093-934X(02)00505-9. Gandour, J. T., Wong, D., Hsieh, L., Weinzapfel, B., Lancker, D. V., & Hutchins, G. D. (2000). A crosslinguistic PET study of tone perception. Journal of Cognitive Neuroscience, 12(1), 207–222. https://doi.org/10.1162/089892900561841 Golfinopoulos, E., Tourville, J. A. A., Guenther, F. H. H., & Gol, E. (2010). The integration of largescale neural network modeling and functional brain imaging in speech motor control. NeuroImage, 52(3), 862–874. https://doi.org/10.1016/j.neuroimage.2009.10.023 Guenther, F. H., Ghosh, S. S., & Tourville, J. A. (2006). Neural modeling and imaging of the cortical interactions underlying syllable production. Brain and Language, 96(3), 280–301. https://doi.org/ 10.1016/j.bandl.2005.06.001 Hao, Y.-C. (2018). Second language perception of Mandarin vowels and tones. Language and Speech, 61(1), 135–152. https://doi.org/10.1177/0023830917717759 Hickok, G. (2012). Computational neuroanatomy of speech production. Nature Reviews Neuroscience, 13(2), 135–145. https://doi.org/10.1038/nrn3158 Hickok, G., & Poeppel, D. (2007a). The cortical organisation for speech processing. Nature, 8(May), 393–402. https://doi.org/10.7554/eLife.14521 Hickok, G., & Poeppel, D. (2007b). The cortical organization of speech processing. Nature Reviews Neuroscience, 8(5), 393–402. https://doi.org/10.1038/nrn2113 Hsieh, H.-I. (1970). The psychological reality of tone sandhi rules in Taiwanese. In Papers From the 6th Annual Regional Meeting of the Chicago Linguistic Society (pp. 489–503). Retrieved from https://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:The+psycho logical+reality+of+tone+sandhi+rules+in+Taiwanese#0. Hsieh, L., Gandour, J. T., Wong, D., & Hutchins, G. D. (2001). Functional heterogeneity of inferior frontal gyrus is shaped by linguistic experience. Brain and Language, 76(3), 227–252. https:// doi.org/10.1006/brln.2000.2382 Hsu, C. H., Lin, S. K., Hsu, Y. Y., & Lee, C. Y. (2014). The neural generators of the mismatch responses to Mandarin lexical tones: An MEG study. Brain Research, 1582, 154–166. https:// doi.org/10.1016/j.brainres.2014.07.023 Huang, T., & Johnson, K. (2011). Language specificity in speech perception: Perception of mandarin tones by native and nonnative listeners. Phonetica, 67(4), 243–267. https://doi.org/10.1159/000 327392 Hume, E., & Johnson, K. (2001). A model of the interplay of speech perception and phonology. Studies on the Interplay of Speech Perception and Phonology, 55, 1–22.

132

C. H. C. Chang and W.-J. Kuo

Retrieved from https://corpus.linguistics.berkeley.edu/~kjohnson/papers/Hume_Johnson2001. pdf%5Cnpapers://e7d065ae-9998-4287-8af0-c9fa85af8e96/Paper/p23053. Hyde, K. L., Lerch, J. P., Zatorre, R. J. R., Griffiths, T. D., Evans, A. C., & Peretz, I. (2007). Cortical thickness in congenital Amusia: When less is better than more. Journal of Neuroscience, 27(47), 13028–13032. https://doi.org/10.1523/JNEUROSCI.3039-07.2007 Hyde, K. L., Zatorre, R. J., & Peretz, I. (2011). Functional MRI evidence of an abnormal neural network for pitch processing in congenital Amusia. Cerebral Cortex, 21(2), 292–299. https://doi. org/10.1093/cercor/bhq094 Indefrey, P., & Levelt, W. J. M. (2004). The spatial and temporal signatures of word production components. Cognition, 92(1–2), 101–144. https://doi.org/10.1016/j.cognition.2002.06.001 Jamison, H. L., Watkins, K. E., Bishop, D. V. M., & Matthews, P. M. (2006). Hemispheric specialization for processing auditory nonspeech stimuli. Cerebral Cortex, 16(9), 1266–1275. https:// doi.org/10.1093/cercor/bhj068 Jiang, C., Hamm, J. P., Lim, V. K., Kirk, I. J., & Yang, Y. (2012). Impaired categorical perception of lexical tones in Mandarin-speaking congenital amusics. Memory & Cognition, 40(7), 1109–1121. https://doi.org/10.3758/s13421-012-0208-2 Kell, C. A., Morillon, B., Kouneiher, F., & Giraud, A. L. (2011). Lateralization of speech production starts in sensory cortices—A possible sensory origin of cerebral left dominance for speech. Cerebral Cortex, 21(4), 932–937. https://doi.org/10.1093/cercor/bhq167 Klein, D., Zatorre, R. J. R., Milner, B., & Zhao, V. (2001). A Cross-linguistic PET study of tone perception in Mandarin Chinese and English speakers. NeuroImage, 13(4), 646–653. https://doi. org/10.1006/nimg.2000.0738 Kwok, V. P. Y., Wang, T., Chen, S., Yakpo, K., Zhu, L., Fox, P. T., & Tan, L.-H. (2015). Neural signatures of lexical tone reading. Human Brain Mapping, 36(1), 304–312. https://doi.org/10. 1002/hbm.22629 Li, A., Xiong, Z., & Wang, X. (2006). Contrastive study on tonal patterns between accented and standard Chinese. In Proceedings of the 5th International Symposium on Chinese Spoken Language 2006 (pp. 157–168). Li, X., & Chen, Y. (2015). Representation and processing of lexical tone and tonal variants: Evidence from the mismatch negativity. PLoS ONE, 10(12), 1–24. https://doi.org/10.1371/journal.pone.014 3097 Li, X., Gandour, J. T., Talavage, T., Wong, D., Hoffa, A., Lowe, M., & Dzemidzic, M. (2010). Hemispheric asymmetries in phonological processing of tones versus segmental units. NeuroReport, 21(10), 690–694. https://doi.org/10.1097/WNR.0b013e32833b0a10 Liu, F., Chan, A. H. D., Ciocca, V., Roquet, C., Peretz, I., & Wong, P. C. M. (2016). Pitch perception and production in congenital amusia: Evidence from Cantonese speakers. The Journal of the Acoustical Society of America, 140(1), 563. https://doi.org/10.1121/1.4955182 Liu, F., Jiang, C., Thompson, W. F., Xu, Y., Yang, Y., & Stewart, L. (2012). The mechanism of speech processing in congenital amusia: Evidence from Mandarin speakers. PLoS ONE, 7(2), e30374. https://doi.org/10.1371/journal.pone.0030374 Liu, L., Peng, D., Ding, G., Jin, Z., Zhang, L., Li, K., & Chen, C. (2006). Dissociation in the neural basis underlying Chinese tone and vowel production. NeuroImage, 29(2), 515–523. https://doi. org/10.1016/j.neuroimage.2005.07.046 Loui, P., Alsop, D., & Schlaug, G. (2009). Tone deafness: A new disconnection syndrome? Journal of Neuroscience, 29(33), 10215–10220. https://doi.org/10.1523/JNEUROSCI.1701-09.2009 Loui, P., Li, H. C., & Schlaug, G. (2011). White matter integrity in right hemisphere predicts pitchrelated grammar learning. NeuroImage, 55(2), 500–507. https://doi.org/10.1016/j.neuroimage. 2010.12.022 Luo, H., Ni, J.-T., Li, Z.-H., Li, X.-O., Zhang, D.-R., Zeng, F.-G., & Chen, L. (2006). Opposite patterns of hemisphere dominance for early auditory processing of lexical tones and consonants. Proceedings of the National Academy of Sciences, 103(51), 19558–19563. https://doi.org/10. 1073/pnas.0607065104

7 Neural Processing of Tone Sandhi in Production …

133

Meister, I. G., Wilson, S. M., Deblieck, C., Wu, A. D., & Iacoboni, M. (2007). The essential role of premotor cortex in speech perception. Current Biology, 17(19), 1692–1696. https://doi.org/10. 1016/j.cub.2007.08.064 Mitterer, H., & Blomert, L. (2003). Coping with phonological assimilation in speech perception: Evidence for early compensation. Perception & Psychophysics, 65(6), 956–969. https://doi.org/ 10.3758/BF03194826 Mitterer, H., Csépe, V., Honbolygo, F., & Blomert, L. (2006). The recognition of phonologically assimilated words does not depend on specific language experience. Cognitive Science, 30(3), 451–479. https://doi.org/10.1207/s15516709cog0000_57 Myers, J., & Tsay, J. (2003). Investigating the phonetics of Mandarin tone sandhi. Taiwan Journal of Linguistics, 1(1), 29–68. https://doi.org/10.6519/TJL.2003.1(1).2. Näätänen, R., Paavilainen, P., Rinne, T., & Alho, K. (2007). The mismatch negativity (MMN) in basic research of central auditory processing: A review. Clinical Neurophysiology, 118(12), 2544–2590. https://doi.org/10.1016/j.clinph.2007.04.026 Nan, Y., & Friederici, A. D. (2013). Differential roles of right temporal cortex and broca’s area in pitch processing: Evidence from music and mandarin. Human Brain Mapping, 34(9), 2045–2054. https://doi.org/10.1002/hbm.22046 Nan, Y., Sun, Y., & Peretz, I. (2010). Congenital amusia in speakers of a tone language: Association with lexical tone agnosia. Brain, 133(9), 2635–2642. https://doi.org/10.1093/brain/awq178 Nixon, J. S., Chen, Y., & Schiller, N. O. (2015). Multi-level processing of phonetic variants in speech production and visual word processing: Evidence from Mandarin lexical tones. Language, Cognition and Neuroscience, 30(5), 491–505. https://doi.org/10.1080/23273798.2014.942326 Niziolek, C. A., & Guenther, F. H. (2013). Vowel category boundaries enhance cortical and behavioral responses to speech FEEDBACK alterations. Journal of Neuroscience, 33(29), 12090–12098. https://doi.org/10.1523/JNEUROSCI.1008-13.2013 Ohala, J. J. (1993). Coarticulation and phonology. Language and speech (Vol. 36). https://doi.org/ 10.1177/002383099303600303. Peng, S.-H. (2000). Lexical versus “phonological” representations of Mandarin sandhi tones. In M. B. Broe & J. B. Pierrehumbert (Eds.), Papers in laboratory phonology 5: Acquisition and the lexicon (1st ed., pp. 152–167). Cambridge (UK): Cambridge University Press. Peretz, I. (2013). The biological foundations of music: Insights from congenital amusia. The psychology of music (3rd ed.). Elsevier Inc. https://doi.org/10.1016/B978-0-12-381460-9.000 13-4. Poeppel, D. (2003). The analysis of speech in different temporal integration windows: Cerebral lateralization as “asymmetric sampling in time.” Speech Communication, 41(1), 245–255. https:// doi.org/10.1016/S0167-6393(02)00107-3 Politzer-Ahles, S., Schluter, K., Wu, K., & Almeida, D. (2016). Asymmetries in the perception of Mandarin tones: Evidence from mismatch negativity. Journal of Experimental Psychology: Human Perception and Performance, 42(10), 1547–1570. https://doi.org/10.1037/xhp0000242 Price, C. J. (2010). The anatomy of language: A review of 100 fMRI studies published in 2009. Annals of the New York Academy of Sciences, 1191(1), 62–88. https://doi.org/10.1111/j.17496632.2010.05444.x Pulvermüller, F., Kiff, J., & Shtyrov, Y. (2012). Can language-action links explain language laterality? An ERP study of perceptual and articulatory learning of novel pseudowords. Cortex, 48(7), 871–881. https://doi.org/10.1016/j.cortex.2011.02.006 Scharinger, M., Monahan, P. J., & Idsardi, W. J. (2016). Linguistic category structure influences early auditory processing: Converging evidence from mismatch responses and cortical oscillations. NeuroImage, 128, 293–301. https://doi.org/10.1016/j.neuroimage.2016.01.003 Schönwiesner, M. (2005). Hemispheric asymmetry for spectral and temporal processing in the human antero-lateral auditory belt cortex. European Journal of Neuroscience, 22, 1521–1528. Retrieved from https://onlinelibrary.wiley.com/doi/10.1111/j.1460-9568.2005.04315.x/full.

134

C. H. C. Chang and W.-J. Kuo

Scott, S. K., McGettigan, C., & Eisner, F. (2009). A little more conversation, a little less action– candidate roles for the motor cortex in speech perception. Nature Reviews Neuroscience, 10(4), 295–302. https://doi.org/10.1038/nrn2603 Shtyrov, Y., Kujala, T., Palva, S., Ilmoniemi, R. J., & Näätänen, R. (2000). Discrimination of speech and of complex nonspeech sounds of different temporal structure in the left and right cerebral hemispheres. NeuroImage, 12(6), 657–663. https://doi.org/10.1006/nimg.2000.0646 So, C. K., & Best, C. T. (2014). Phonetic influences on english and french listeners’ assimilation of mandarin tones to native prosodic categories. Studies in Second Language Acquisition, 36(2), 195–221. https://doi.org/10.1017/S0272263114000047 Speer, S. R., Shih, C.-L., &Slowiaczek, M. L. (2016). Prosodic structure in language understanding: Evidence from tone sandhi in Mandarin. https://doi.org/10.1177/002383098903200403. Speer, S. R., Shih, C. L., & Slowiaczek, M. L. (1989). Prosodic structure in language understanding: Evidence from tone sandhi in mandarin. Language and Speech, 32(4), 337–354. https://doi.org/ 10.1177/002383098903200403 Sun, Y., Giavazzi, M., Adda-decker, M., Barbosa, L. S., Kouider, S., Bachoud-Lévi, A. C., Peperkamp, S. (2015). Complex linguistic rules modulate early auditory brain responses. Brain and Language, 149(2009), 55–65. https://doi.org/10.1016/j.bandl.2015.06.009. Tavabi, K., Elling, L., Dobel, C., Pantev, C., & Zwitserlood, P. (2009). Effects of place of articulation changes on auditory neural activity: A magnetoencephalography study. PLoS ONE, 4(2). https:// doi.org/10.1371/journal.pone.0004452. Tillmann, B., Burnham, D., Nguyen, S., Grimault, N., Gosselin, N., & Peretz, I. (2011). Congenital amusia (or tone-deafness) interferes with pitch processing in tone languages. Frontiers in Psychology, 2(JUN), 120. https://doi.org/10.3389/fpsyg.2011.00120. Tourville, J. A., Reilly, K. J., & Guenther, F. H. (2008). Neural mechanisms underlying auditory feedback control of speech. NeuroImage, 39(3), 1429–1443. https://doi.org/10.1016/j.neu roimage.2007.09.054 Tsay, J., &Myers, J. (1996). Taiwanese tone sandhi as allomorph selection. In Proceedings of Annual Meeting of the Berkeley Linguistics Society. Wang, J., Zhang, C., Wan, S., & Peng, G. (2017). Is congenital amusia a disconnection syndrome? A study combining tract- and network-based analysis. Frontiers in Human Neuroscience, 11(September), 1–11. https://doi.org/10.3389/fnhum.2017.00473 Wang, W. S.-Y., & Li, K.-P. (1967). Tone 3 in Pekinese. Journal of Speech and Hearing Research, 10(3), 629–636. Retrieved from https://jslhr.asha.org/cgi/content/abstract/10/3/629. Wong, P. C. M., Parsons, L. M., Martinez, M., & Diehl, R. L. (2004). The role of the insular cortex in pitch pattern perception: The effect of linguistic contexts. Journal of Neuroscience, 24(41), 9153–9160. https://doi.org/10.1523/JNEUROSCI.2225-04.2004 Xu, Y. (2004). Understanding tone from the perspective of production and perception. Language and Linguistics, 5(4), 757–797. Xu, Y., & Emily Wang, Q. (2001). Pitch targets and their realization: Evidence from Mandarin Chinese. Speech Communication, 33(4), 319–337. https://doi.org/10.1016/S0167-6393(00)000 63-7 Xu, Y., & Xu, C. X. (2005). Phonetic realization of focus in English declarative intonation. Journal of Phonetics, 33(2), 159–197. https://doi.org/10.1016/j.wocn.2004.11.001 Yu, Y. H., Shafer, V. L., & Sussman, E. S. (2017). Neurophysiological and behavioral responses of Mandarin lexical tone processing. Frontiers in Neuroscience, 11, 95. https://doi.org/10.3389/ fnins.2017.00095 Zatorre, R. J. R. (2001). Spectral and temporal processing in human auditory cortex. Cerebral Cortex, 11(10), 946–953. https://doi.org/10.1093/cercor/11.10.946 Zatorre, R. J. R., & Gandour, J. T. (2008). Neural specializations for speech and pitch: Moving beyond the dichotomies. Philosophical Transactions of the Royal Society B: Biological Sciences, 363(1493), 1087–1104. https://doi.org/10.1098/rstb.2007.2161

7 Neural Processing of Tone Sandhi in Production …

135

Zhang, C., Peng, G., Shao, J., & Wang, W. S.-Y. (2017). Neural bases of congenital amusia in tonal language speakers. Neuropsychologia, 97(July 2016), 18–28. https://doi.org/10.1016/j.neu ropsychologia.2017.01.033. Zhang, C., Xia, Q., & Peng, G. (2015). Mandarin third tone sandhi requires more effortful phonological encoding in speech production: Evidence from an ERP study. Journal of Neurolinguistics, 33, 149–162. https://doi.org/10.1016/j.jneuroling.2014.07.002 Zhang, J., & Lai, Y. (2010). Testing the role of phonetic knowledge in Mandarin tone sandhi. Phonology, 27(01), 153. https://doi.org/10.1017/S0952675710000060 Zhang, J., & Liu, J. (2016). The productivity of variable disyllabic tone sandhi in Tianjin Chinese. Journal of East Asian Linguistics, 25(1), 1–35. https://doi.org/10.1007/s10831-015-9135-0 Zheng, Z. Z., Vicente-Grabovetsky, A., MacDonald, E. N., Munhall, K. G., Cusack, R., & Johnsrude, I. S. (2013). Multivoxel patterns reveal functionally differentiated networks underlying auditory feedback processing of speech. Journal of Neuroscience, 33(10), 4339–4348. https://doi.org/10. 1523/JNEUROSCI.6319-11.2013

Part III

Domain-General Transfer and Cross-Modal Integration

Chapter 8

The Effect of Musical Experience and Congenital Amusia on Lexical Tone Perception, Production, and Learning: A Review Jia Hoong Ong, Shen Hui Tan, Alice H. D. Chan, and Francis C. K. Wong Abstract Adults who are naïve to tone languages show a large variability in their ability to perceive, produce, and learn lexical tones, the building blocks of tone languages such as Mandarin and Cantonese. This review will focus on examining the variability from a musical perspective by reviewing behavioural and neuroimaging studies that compare listeners with extensive musical training, listeners with musical disorders such as amusia, and naïve/control listeners. Such a comparison allows us to determine whether there are any cross-domain transfer effects, and if so, what aspects in lexical tone perception, production, and learning are affected and why such facilitation or hindrance may occur. Understanding this will not only deepen our understanding of the commonalities and differences between language and music but also have implications for tone language learning. The review concludes with several future directions in this area of research.

8.1 Introduction Consider some of the challenges faced by native English-speaking adults learning Mandarin for the first time. One such challenge relates to the building blocks of the language: Mandarin, a tone language, uses consonants, vowels, and pitch (socalled lexical tones) to signal a change in meaning. This is in contrast to English, a non-tone language, in which pitch does not contrast lexical meaning and is mostly confined over an utterance for supra-lexical purposes (e.g., contrasting a statement and a question with falling and rising pitch contour, respectively) and for emotional expression (e.g., using a flat, monotonous pitch contour to indicate sadness). Given these differences, it is understandable that native speakers of a non-tone language J. H. Ong · S. H. Tan · A. H. D. Chan · F. C. K. Wong (B) Linguistics and Multilingual Studies, School of Humanities, Nanyang Technological University, Singapore, Singapore e-mail: [email protected] J. H. Ong e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 H.-M. Liu et al. (eds.), Speech Perception, Production and Acquisition, Chinese Language Learning Sciences, https://doi.org/10.1007/978-981-15-7606-5_8

139

140

J. H. Ong et al.

(hereafter ‘non-tone language speakers’) may find it challenging to perceive, produce, and learn tone languages. Yet, some seem to excel, and others seem to struggle. What could explain such variation in performance? Two such factors identified in research are musical factors: musical experience and musical disorder, which will be examined in this review. As these factors belong to a different domain (music), researchers term such influence as ‘cross-domain transfer’ with a positive and negative transfer indicating a carried-over benefit and disadvantage, respectively. The chapter will be organized as follows: first, we will briefly describe lexical tones and why non-tone language speakers have difficulty with lexical tones. We will then review behavioural and cognitive neuroscience studies investigating how musical experience and a musical disorder termed congenital amusia might affect perception, production, and learning of lexical tones. Then, we will discuss possible explanations for cross-domain transfer, and we conclude the chapter by suggesting future directions for the field.

8.2 Lexical Tones Many languages use pitch to signal a change in lexical meaning, for example, in Mandarin, /ma/ when spoken in a high-level tone (Tone 1; 妈) means ‘mother’, whereas when spoken in a falling tone (Tone 4; 骂), it means ‘to scold’. These two words are primarily differentiated by pitch, the psychological correlate of the signal’s fundamental frequency (F0). Other secondary acoustic cues such as amplitude, voice quality, and duration may also be used to differentiate lexical tones (Liu & Samuel, 2004). Much like how different languages have different phonological inventories, tone languages differ in their tonal inventories too. Some tone languages, such as Yoruba, a West African language, only use level tones (e.g., low, mid, and high). Others use a combination of level and dynamic tones: the Mandarin tonal inventory consists of a high-level tone and three dynamic tones—rising, dipping, and falling (Maddieson, 1984). Successful learning of a tone language would partly require one to be adept in attending to the relevant pitch dimensions (e.g., pitch height, pitch direction, pitch onset, average pitch, etc.) that are important in the target language. Native speakers of a tone language (hereafter ‘tone language speakers’) do indeed show such sensitivity. For example, Cantonese speakers attend to pitch height more so than Mandarin speakers, presumably because level tones are contrastive in Cantonese but not in Mandarin (Gandour, 1984). This suggests that specific tone language experience affects one’s cue weighting in processing lexical tones (e.g., Chandrasekaran, Krishnan, & Gandour, 2007; Krishnan, Gandour, Xu, & Suresh, 2017). It follows, then, that part of the difficulty faced by non-tone language speakers with lexical tones could be due to the use of ineffective pitch dimensions to characterize the tonal inventory being learned (Burnham & Francis, 1997; Cabrera et al., 2015). Non-tone language speakers appear to attend more to universal or psychoacoustic pitch dimensions such as average pitch and duration (Gandour & Harshman,

8 The Effect of Musical Experience and Congenital Amusia on Lexical …

141

1978), which may not sufficiently differentiate lexical tones in a tone language. Interestingly, non-tone language speakers do not have difficulty perceiving lexical tones from birth. Numerous studies have demonstrated that infants are born with the ability to discriminate native and non-native speech sounds (Kuhl, 2004). However, their ability to discriminate speech sounds narrows to those that are contrastive in their native language by the first year of life—a phenomenon termed ‘perceptual attunement’. Previous studies have found that perceptual attunement for vowels occurs first at around 6 months of age (Kuhl, Williams, Lacerda, Stevens, & Lindblom, 1992), consonants between 10 and 12 months of age (Werker & Tees, 1984), and lexical tones between 6 and 9 months of age (Liu & Kager, 2014; Mattock & Burnham, 2006; Mattock, Molnar, Polka, & Burnham, 2008; Werker, Yeung, & Yoshida, 2012). Thus, non-tone language speakers become poorer in detecting lexical tones (and other non-native speech sounds) as they gain more experience with their own native language. The numerous findings on perceptual attunement, while well described, do not explain how the phenomenon develops (Tsao & Liu, Chap. 9, this volume). One current proposal is that perceptual attunement is a consequence of learning native phonological categories, which is said to be achieved via distributional learning (Werker et al., 2012; Tsao & Liu, Chap. 9, this volume). Distributional learning refers to a learning mechanism with which learners keep track of speech sounds in their linguistic environment and form phonological categories based on the most frequently occurring speech sounds (Escudero & Williams, 2014; Maye, Werker, & Gerken, 2002; Ong, Burnham, & Escudero, 2016). As infants form phonological categories, they may learn what features (e.g., duration, pitch height, etc.) are contrastive and important in their language. This may explain why, for example, Cantonese speakers attend to pitch height in lexical tones more so than Mandarin speakers, since there are three level tones in Cantonese and only one in Mandarin (Gandour, 1984). Similarly, English speakers have difficulty differentiating Mandarin Tone 2 (rising tone) and Tone 3 (dipping tone; Wang, Spence, Jongman, & Sereno, 1999), presumably because they did not learn to attend to the relevant pitch dimension (in this case, pitch direction).

8.3 The Effect of Musical Experience In this section, we will review studies that investigated the relationship between musical experience, in the form of extensive musical training, and the perception, production, and learning of lexical tones. While the primary focus will be on nontone language speakers, we will also consider in each subsection whether musical training may benefit tone language speakers by comparing their performance with tone language speakers with no musical training to investigate whether musical training provides tone language speakers with an additional benefit.

142

J. H. Ong et al.

8.3.1 Perception In light of how speech and music share pitch as a fundamental acoustic dimension, a considerable amount of literature has been devoted to examining whether and how pitch processing mechanisms may be shared between the two domains. Such investigations have largely converged to indicate a facilitating effect of musical training on the perception of linguistic pitch. At the sentential level, French musicians detected small prosodic pitch incongruities better than non-musicians, as reflected in lower error rates and shorter onset latencies of their event-related potentials in response to sentence-final pitch violations in their native language (Schön, Magne, & Besson, 2004) as well as in a non-native language (Marques, Moreno, Castro, & Besson, 2007). This musical advantage in detecting prosodic pitch violations was also seen among French musician children using the same experimental paradigm (Magne, Schön, & Besson, 2006). At the lexical level, English-speaking musicians were found to be faster and also more accurate than non-musicians in discriminating and identifying non-native Mandarin Chinese tones (Alexander, Wong, & Bradlow, 2005) and in discriminating non-native Thai tones (Burnham, Brooker, & Reid, 2014). Such a musical advantage was also observed among non-tone language speakers when lexical tones were spoken by multiple speakers and had limited acoustic information (Lee & Hung, 2008), and when the tones were embedded across different contexts such as in speech, filtered speech, and violin analogues (Burnham et al., 2014). Thus, these findings qualify this advantage by suggesting that it is neither language-specific nor merely acoustic, given that musicians were able to generalize lexical pitch patterns across different tone languages, different speakers, and also different contexts (speech, non-speech; degraded acoustic stimuli). Neural evidence further bolsters these behavioural findings of an advantage of musical training in lexical tone perception: English-speaking musicians had more robust and faithful brainstem encoding of Mandarin Chinese tones compared to their non-musician counterparts (Bidelman, Gandour, & Krishnan, 2011; Wong, Skoe, Russo, Dees, & Kraus, 2007), and they also exhibited greater sensitivity in response to non-speech forms of Mandarin Chinese tone contours as measured using mismatch negativity responses in a passive oddball paradigm (Chandrasekaran, Krishnan, & Gandour, 2009). Furthermore, musicians’ enhancements in the brainstem pitch tracking of non-native Mandarin Chinese tones have also been found to correlate with their age of onset and years of musical training (Wong et al., 2007), lending additional support for the facilitating effects of musical training on lexical tone perception. The existing literature is thus in consensus, both behaviourally and neurally, that longterm musical experience enhances sensitivity to the perception of prosodic and lexical pitch. In particular, it supports the notion that perceptual attenuation to non-native lexical tones is mitigable with long-term musical experience. Whereas it is clear that musical experience benefits non-tone language speakers in perceiving lexical tones, studies examining whether the benefit extends to tone

8 The Effect of Musical Experience and Congenital Amusia on Lexical …

143

language musicians have been equivocal. Some reported that tone language musicians showed similar behavioural performance as tone language non-musicians (Mok & Zuo, 2012; Tang, Xiong, Zhang, Dong, & Nan, 2016; Zhao & Kuhl, 2015b; Zheng & Samuel, 2018). Others, particularly those that measured performance using neural measures such as electroencephalograhy (EEG), revealed some differences such as enhanced event-related potentials by tone language musicians (Nan et al., 2018; Tang et al., 2016 but see Maggu, Wong, Antoniou et al., 2018a, 2018b). Furthermore, tone language musicians showed increased sensitivity to within-category tones, that is, tones that are perceived to belong to the same category (e.g., Tone 2) but are nonetheless slightly different acoustically (Wu et al., 2015). Taken together, this suggests that musical experience benefits tone language speakers slightly by enhancing their perception to subtle pitch differences which may only be evident using sensitive measures.

8.3.2 Production Compared to perception, there are fewer studies that investigated the influence of musical experience on lexical tone production. Of the few that did, the general consensus is that those with musical training produced better lexical tones as rated by native speakers than those without musical training (Gottfried & Xu, 2008; Kirkham, Lu, Wayland, & Kaan, 2011; Schwanhäußer, 2007). However, the source of this musicians’ advantage is unclear. Schwanhäußer (2007) reported that inherent musical aptitude, rather than duration of musical training, predicted lexical tone production among Australian English speakers. Complementing this, musical tone ability as measured using three musical aptitude tasks correlated with lexical tone production among non-tone language non-musicians (Li & Dekeyser, 2017). It should be noted that musical aptitude and duration of musical training are highly correlated and thus there is likely a complex interaction between the three variables, which merits further research.

8.3.3 Learning There are generally two types of lexical tone learning in the literature: learning to perceive (discriminate and/or identify) lexical tone categories and learning soundmeaning mapping (e.g., learning the associations between lexical tones and objects). Concerning the former, among non-tone language speakers, there appears to be no additional musical advantage: musicians and non-musicians improve after training to the same degree (Wayland, Herrera, & Kaan, 2010; Zhao & Kuhl, 2015a). On the other hand, musicians tend to show greater learning than non-musicians in sound-meaning mapping (Cooper & Wang, 2012; Dittinger et al., 2016; Wong & Perrachione, 2007). Subsequent research has revealed that both musicality (as

144

J. H. Ong et al.

measured using musical aptitude tasks and background questionnaire) and pitch ability (as measured using pitch acuity and pitch memory tasks) predict soundmeaning mapping performance (Bowles, Chang, & Karuzis, 2016), which suggests that the musical advantage seen in sound-meaning mapping is due to both perceptual and general cognitive factors such as memory. As for tone language speakers, there appears to be no musical advantage in terms of learning to identify non-native lexical tones (Cooper & Wang, 2010) and sound-meaning mapping (Cooper & Wang, 2012; Maggu, Wong, Liu, & Wong, 2018). This implies that musical experience does not confer any additional benefit for tone language speakers in learning lexical tones.

8.4 The Effect of Congenital Amusia Congenital amusia, or more commonly known as tone deafness, refers to a lifelong neurological disorder that affects approximately 4–5% of the population (Kalmus & Fry, 1980). It is important to note that congenital amusia does not carry the same connotation as being ‘tone deaf’ as used in everyday context: whereas a lay person might be quick to label themselves as ‘tone deaf’ because they cannot sing well, they are likely still able to differentiate two notes that are close in pitch (e.g., C and C#). Amusics, on the other hand, have difficulty processing musical pitch and have a larger pitch discrimination threshold (i.e., require a larger pitch change in order to detect a difference in pitch, which on average, is at the order of 2 semitones, or a difference between the notes C and D Foxton, Dean, Gee, Peretz, & Griffiths, 2004). Thus, amusics tend to do poorly on tasks involving melodic perception, such as detecting an anomalous note in a melody, but their rhythmic perception is relatively intact (Hyde & Peretz, 2004), suggesting that their impairment is restricted to pitch. The question, then, is whether their pitch processing deficit is specific to just music (domain-specific) or beyond (domain-general).

8.4.1 Perception One of the earliest studies investigating the domain specificity of the pitch processing deficit among amusics found that amusics were not impaired in their discrimination of linguistic pitch (Ayotte, Peretz, & Hyde, 2002). In one of the tasks, French amusics were presented with pairs of spoken sentences that were either the same in intonation (both either questions or statements) or different (a statement and a question). Importantly, the only difference between each pair was the pitch contour of the final word, which either had a falling contour for statements (‘He speaks French.’) or a rising contour for questions (‘He speaks French?’). Amusics and controls performed equally well on discriminating the intonations. The same group of participants also performed a similar task with non-speech analogues of the same spoken sentences. The non-speech analogues were formed by stripping the linguistic

8 The Effect of Musical Experience and Congenital Amusia on Lexical …

145

content of the sentences and replacing the overall pitch contour with discrete tones. On these non-speech analogues, amusics performed worse than controls. Together, the results suggest that the pitch deficit seen in amusics does not extend to intonation, at least not when there is linguistic information. Another study extending the findings of Ayotte et al. (2002) found that the difficulty amusics faced in the nonspeech condition are not due to the use of discrete tones; the amusics were similarly impaired when gliding tones were used as non-speech analogues (Patel, Foxton, & Griffiths, 2005). However, subsequent studies found that non-tone language amusics are impaired in perceiving intonation relative to controls (Hamann, Exter, Pfeifer, & KrauseBurmester, 2012; Hutchins, Gosselin, & Peretz, 2010; Liu, Patel, Fourcin, & Stewart, 2010; Patel, Wong, Foxton, Lochy, & Peretz, 2008). British English amusics, for example, were less successful than controls at discriminating intonations of statements and questions when presented in three different contexts: gliding tone sequences, natural speech, and nonsense speech (Liu et al., 2010), suggesting a general pitch deficit in amusia. The inconsistent findings may be partly attributed to the choice of stimuli. The final pitch glides in spoken sentences in Ayotte et al. (2002) were considerably larger than those in Liu et al. (2010), and thus, the glides may be above the amusics’ pitch threshold. While this possibility was not examined directly in Ayotte et al. (2002), some evidence for this was found in other studies. For example, in Liu et al. (2010), amusics’ performance on discriminating intonations was negatively correlated with their pitch direction threshold (i.e., the smallest pitch change needed to detect a change in glide direction), such that the smaller their threshold, the better their discrimination. Moreover, using a continuum ranging from a statement to a question differing only in pitch contour of the final word, Hutchins et al. (2010) found that amusics needed a larger pitch difference to classify an utterance as a question than controls. Thus, it appears that under laboratory conditions where one can properly control and manipulate the stimuli, non-tone language amusics do seem to show a deficit in linguistic pitch perception. Similar to linguistic pitch at the sentential level, non-tone language amusics are also impaired in perceiving lexical tones relative to controls (Nguyen, Tillmann, Gosselin, & Peretz, 2009; Tillmann, Burnham et al., 2011; Tillmann, Rusconi et al., 2011). It should be noted that there is a large variation in amusics’ performance, with most performing within the range of the control group. When the amusics were further divided to those with relatively higher pitch height threshold (i.e., the smallest pitch change needed to detect a change in pitch height) and those with relatively lower pitch height threshold, Tillmann, Burnham et al. (2011) found that the overall performance of the latter was better. Furthermore, some differences in stimuli type were observed between the two groups: whereas those with lower thresholds showed no difference in performance between the verbal (Thai lexical tones) and the musical (musical analogues of Thai lexical tones), those with higher thresholds showed better performance on the verbal than the musical, mirroring the results of Ayotte et al. (2002). These findings suggest that large individual differences exist among nontone language amusics in their ability to perceive lexical tones, which may be partly attributed to their pitch threshold.

146

J. H. Ong et al.

Relative to non-tone language amusics, tone language amusics tend to show attenuated pitch processing impairment (Wong et al., 2012). In a large-scale study with over 400 participants from Hong Kong and over 150 participants from Canada, there were less percentage of participants in Hong Kong who would be classified as ‘amusic’ as defined by more than two standard deviation from the Global Score of the Montreal Battery of Evaluation of Amusia, MBEA, which was calculated for the two groups of participants separately. This is despite the fact that the MBEA Global Score was higher for the Hong Kong participants on average. Thus, assuming all other things being equal, tone language experience may protect one from musical pitch processing impairment. Furthermore, some suggest that tone language amusics may show different neural deficits than non-tone language amusics (Zhang, Peng, Shao, & Wang, 2017). Whereas previous studies have implicated a lack of activation in the right inferior frontal gyrus (IFG) among non-tone language amusics relative to their non-amusic counterparts when listening to musical sequences (Hyde, Zatorre, & Peretz, 2011), tone language amusics show similar activation patterns in the right IFG compared to tone language non-amusics (Zhang et al., 2017). However, because there is no direct comparison between the tone language experience and amusia, differences in activation pattern could be task- or stimuli-related: the musical sequences in Hyde et al. (2011) were longer and more melodic-like than the musical pitch interval used in Zhang et al. (2017). These findings thus suggest that tone language experience provides a slight ‘buffer’ to the pitch processing impairment seen among amusics, but the exact reason for this is unclear. Holding tone language experience constant, tone language amusics are disadvantaged in perceiving lexical tones relative to tone language non-amusics. This is particularly the case when they have to discriminate between pairs of lexical tones that are minimally contrastive (i.e., differ only in their pitch contour; Jiang, Hamm, Lim, Kirk, & Yang, 2012; Liu et al., 2016; Liu, Jiang et al., 2012), but less of a group difference is observed for identification tasks (Jiang et al., 2012; Liu, Jiang et al., 2012). These differences could be partly attributed to task demands: whereas discrimination requires one to make an acoustic judgment, identification relies more on long-term phonetic representations. In addition, Mandarin-speaking amusics require a larger threshold in order to identify pitch direction for discrete pitches than for gliding pitches, whereas Mandarin-speaking non-amusics show similar threshold patterns for both ( Liu, Xu, Patel, Francart, & Jiang, 2012). These findings suggest that tone language amusics are disadvantaged in lexical tone perception, which is evident in tasks that require fine-grained pitch processing. Neuroimaging studies have found that the disadvantage faced by tone language amusics may reside in the later stage of pitch processing. No group differences are observed during the early stages of pitch processing, such as during brainstem encoding of the lexical tones (Liu, Maggu, Lau, & Wong, 2015) or when lexical tones are processed preattentively (Zhang & Shao, 2018). During the later stage, neural differences emerge: tone language amusics tend to show smaller event-related potentials that are said to reflect attentive processing (e.g., P3a and P3b; Zhang & Shao, 2018), and they do not show activation in the right superior temporal gyrus (where the primary auditory cortex resides) when listening to pairs of lexical tones,

8 The Effect of Musical Experience and Congenital Amusia on Lexical …

147

unlike the control group (Zhang et al., 2017). However, it is worth noting that not all tone language amusics are impaired in lexical tone perception. Indeed, some researchers explicitly divide tone language amusics to two subgroups: those with difficulty with lexical tones and those without (Huang, Liu, Dong, & Nan, 2015; Huang, Nan, Dong, & Liu, 2015; Nan, Huang, Wang, Liu, & Dong, 2016; Nan, Sun, & Peretz, 2010). It is still currently unclear how and why these subgroups differ and whether tone language amusics with a deficit in lexical tone perception may also suffer from additional impairments that have not been identified.

8.4.2 Production Whereas it has been reported that amusics tend to sing poorly compared to controls (Ayotte et al., 2002), little is known about whether non-tone language amusics may be similarly impaired in their production of linguistic pitch. We are aware of only two studies that investigated this, with conflicting findings across both (Hutchins & Peretz, 2012; Liu et al., 2010). The production task was similar in both studies: participants heard a sentence, and they had to imitate it as best as they can. The type of sentences, however, was different. In Hutchins and Peretz (2012), participants had to imitate neutral sentences and the shifted versions of the same sentences where one of the syllables has its pitch shifted, much like a narrow focus (e.g., ‘He took the green bag.’ vs. ‘He took the GREEN bag’). In Liu et al. (2010), participants had to imitate the same sentences as statements and as questions (e.g., ‘He took the green bag’ vs. ‘He took the green bag?’). The amusics in Hutchins and Peretz (2012) produced the shifted sentences to the same degree as the control group, whereas those in Liu et al. (2010) were overall less accurate in producing the correct glides (rising for questions and falling for statements). The reason for the inconsistent findings is unclear; it may relate to differences in the strategies used to perceive and produce narrow focus sentences and statements/questions. However, one should be cautious at interpreting these findings until more data is available. Tone language amusics, however, do not seem to have difficulties producing intelligible lexical tones relative to tone language non-amusics, as rated by native speakers (Liu et al., 2016; Yang, Feng, Huang, Zhang, & Nan, 2014). However, some subtle differences have been observed via acoustic analysis of the production data. For example, Mandarin-speaking amusics were worse than controls in Mandarin speech imitation in terms of absolute and relative pitch matching (Liu et al., 2013). Nonetheless, Mandarin-speaking amusics and non-amusics were able to produce non-native tones to the same degree when analysed acoustically (Wang & Peng, 2014). These findings suggest that tone language amusics have relatively intact lexical tone production but there may be some subtle impairments relative to non-amusics, which may be modulated by the phonological status of the tones being produced.

148

J. H. Ong et al.

8.4.3 Learning We are not aware of any studies that have directly investigated whether non-tone language amusics can learn linguistic pitch or lexical tones. One study investigated whether Chinese amusics (a mix of Mandarin and Cantonese amusics) may improve on their pitch direction threshold after 10 sessions of training (Liu, Jiang, Francart, Chan, & Wong, 2017). At post-test, trained amusics showed significant improvement in their pitch direction threshold compared to untrained amusics, and the trained amusics had a similar pitch direction threshold as non-amusics by the end of training. However, this improvement in pitch direction threshold among the trained amusics did not translate to an improvement in the MBEA melodic tests. Nonetheless, the encouraging findings from this study suggest that amusics can improve on certain aspects of pitch processing following auditory training.

8.5 Explanations for Cross-Domain Transfer In this section, we will review several explanations for the positive and negative cross-domain transfer seen in the studies above. Note, however, that some of the explanations apply only to one type of transfer. Note also that these explanations are not necessarily mutually exclusive and it is likely that the cross-domain transfer observed in the studies reviewed above are due to a combination of these explanations (Moreno & Bidelman, 2014). One explanation relates to how sensitive speakers are to pitch. That is, the positive transfer between musical experience and lexical tones may be due to musicians having a ‘sharper’ ear, given their extensive experience perceiving subtle pitch (and other acoustic) differences, leading them to better encode the auditory signal. The source of the ‘sharper’ ear may result from neural plasticity due to extensive musical training (Herholz & Zatorre, 2012). Specifically, it has been suggested that the improved auditory processing by musicians may be due to strengthening of the corticofugal pathway, a top-down feedback connection from the cortex to the brainstem (Kraus & Chandrasekaran, 2010; Strait & Kraus, 2011). In other words, musicians learn to guide their attention to relevant and meaningful features in the acoustic signal, leading to better encoding of the signal to form a more elaborate percept (Besson, Chobert, & Marie, 2011; Musacchia, Sams, Skoe, & Kraus, 2007). Another proposal for the robust encoding comes from a learning perspective. Similar to the previous proposal, this account argues that cross-domain transfer may be a result of speakers’ ability to guide their attention to meaningful acoustic cues such as pitch in the signal as a consequence of learning the basic units of a system, such as phonological categories in speech or musical pitch in music (Ong, Burnham, Stevens, & Escudero, 2016). That is, as a result of learning to differentiate subtle pitch categories, musicians may become more aware of pitch in general.

8 The Effect of Musical Experience and Congenital Amusia on Lexical …

149

Such top-down guiding of attention is complemented by studies that have shown context-dependent performance in perceiving pitch. Studies have shown that the same pitch information may be processed differently depending on what the information means to the speaker. For example, processing of lexical tones appears to be lateralized to the left hemisphere among native speakers, but not among non-native speakers regardless of tone language experience (Van Lancker & Fromkin, 1973; Wang, Behne, Jongman, & Sereno, 2004). Furthermore, non-tone language speakers show differential discrimination performance on the same pitch contour presented either as Thai lexical tones or as violin notes, with performance generally better for non-speech stimuli (Burnham et al., 2014). Thus, there appears to be top-down linguistic influence in perceiving lexical tones, depending on how meaningful the pitch information is to the speaker. Another explanation for cross-domain transfer of pitch lies in the possibility of shared neural resources and mechanisms for pitch processing. Due to the commonalities of acoustic cues shared between music and speech (e.g., pitch, duration, etc.), speakers may draw on similar brain areas and neural resources to process the signal (Besson et al., 2011). Behavioural data suggests that intonation and melodic contour processing are correlated among English speakers, suggesting a shared underlying processing mechanism of pitch (Perrachione, Fedorenko, Vinke, Gibson, & Dilley, 2013). The positive transfer of pitch seen among musicians, then, may be due to more efficient employment of such resources to process pitch. Similarly, amusics experience negative transfer of pitch presumably because they do not engage these resources efficiently. Finally, another possibility for the observed cross-domain transfer of pitch is due to cognitive improvement (in the case of positive transfer) or impairment (in the case of negative transfer). Concerning cognitive improvement, various studies have shown that those with extensive musical training show enhanced cognitive functions such as auditory memory and attention (Strait & Kraus, 2011), executive functions (Degé, Kubicek, & Schwarzer, 2011; Schroeder, Marian, Shook, & Bartolotti, 2016), and verbal working memory (Talamini, Carretti, & Grassi, 2016). These domaingeneral cognitive enhancements may thus spill over to speech processing, including linguistic pitch and lexical tones: for example, in a discrimination task, musicians may be better able to remember the stimuli and consequently are better able to form judgments. Conversely, amusics appear to have worse pitch memory than controls (Gosselin, Jolicœur, & Peretz, 2009; Jiang, Lim, Wang, & Hamm, 2013; Tillmann, Lévêque, Fornoni, Albouy, & Caclin, 2016), which may partly explain their worse performance in pitch tasks. Interestingly, amusics show improved pitch memory that is comparable to that of the control group when mild electrical stimulation by means of transcranial alternating current stimulation (tACS) was applied to a brain region implicated in pitch memory (the right dorsolateral prefrontal cortex) (Schaal, Pfeifer, Krause, & Pollok, 2015). So far, the discussion has been on how extensive musical training may lead to a positive transfer from music to speech. To explain why musical training may drive such transfer, Patel (2011, 2012, 2014) proposed the OPERA hypothesis in which he argues that sensory and cognitive processing involved in music-to-speech

150

J. H. Ong et al.

transfer are enhanced in musicians when five criteria are met: (i) Overlap in anatomical structures subserving language and music; (ii) Precision in musical training is emphasized to a higher degree than in language; (iii) Emotion elicited in musical training is positive; (iv) Repetition of continuous musical activity; and (v) Attention is focused throughout the musical activity. While there are some preliminary studies supporting some aspects of the hypothesis (Patel, 2014), none, to our knowledge, have systematically manipulated each criterion to confirm the hypothesis.

8.6 Future Directions While the studies reviewed above have certainly increased our understanding of how musical factors influence lexical tone perception, production, and learning, there remains open questions for future research, some of which will be briefly outlined in this section. From the sections above, it is evident that certain areas are understudied (e.g., musical experience and lexical tone production, amusia and lexical tone learning, etc.). Further research is needed in these areas as they will help shed light on the extent of the influence of musical factors on lexical tones, which, as revealed above, is not a blanket improvement or impairment. Similarly, while the OPERA hypothesis provides a comprehensive proposal for why musical training leads to positive transfer, it remains to be seen whether the hypothesis is supported and if so, whether all the criteria listed have the same importance. So far, in this review, we have been vague about how musicians are defined, partly because there is no consensus in the literature. The definition of a musician is a complex issue as it is unclear what factors should be considered when one defines musicianship: age at which they start formal musical training? Duration of formal musical training? What if they learned multiple instruments for one period of time? These issues are also faced in other fields such as research on bilinguals (Surrain & Luk, 2017). Until there is a consensus of what dimensions should be considered, future studies should include both self-report (e.g., questionnaire regarding their musical background) and objective measures (e.g., performance on standardized music tests) to be more comprehensive in their definition of musicianship. Furthermore, future studies should consider using an individual differences approach to investigate their research question. For example, instead of placing participants into arbitrary groups, each participant should have their ‘musicianship’ measured (using a combination of self-report and objective scores) and that score should be used to predict lexical tone performance. Similarly, there is no standard definition of amusia in the literature. The current practice is to use the Montreal Battery of Evaluation of Amusia (MBEA; Peretz, Champod, & Hyde, 2003), which consists of subtests evaluating one’s musical abilities involving pitch and rhythm. Amusics, then, are defined by a cutoff score (e.g., scoring more than 2 standard deviations below the group mean) or defined as those

8 The Effect of Musical Experience and Congenital Amusia on Lexical …

151

that performed statistically worse than the control group on the MBEA. Particularly with the use of a cutoff score, some critics have argued that this method leads to an overdiagnosis of amusics (Henry & McAuley, 2010). Furthermore, there are inconsistencies across studies on which score the cutoff should be based on; whereas some researchers use the global score of MBEA (i.e., the composite score of all the subtests), others only use the pitch-based subtests scores. As a way to mitigate this issue, a protocol to standardize the administration of MBEA was recently presented (Vuvan et al., 2018). Moreover, it is argued that MBEA should be used as a screening tool, which is to be followed up by other examinations, rather than a diagnosis tool. Thus, future studies should be mindful of and detail exactly how amusics are defined in their study as well as supplement the MBEA cutoff score with other measures, such as a questionnaire on their musical history or neural methods. These issues are important to consider as it may the case that the current practice of defining amusia is not sensitive enough to differentiate the heterogeneity in pitch processing deficit seen among amusics, as reviewed in the previous section. It should also be noted that while musicians and amusics appear to be on different ends of the spectrum in terms of cross-domain transfer, this does not imply that the underlying causes for the transfer (positive for the musicians and negative for the amusics) are the same. Indeed, it is likely that the underlying causes are different for the two groups, since there are discrepancies in how their perception, production, and learning of lexical tones are affected (e.g., whereas musicians show enhanced production of linguistic pitch relative to non-musicians, amusics appear to have spared linguistic pitch production). The issue is further complicated when we consider how tone language experience may modulate these two musical factors in perceiving, producing, and learning lexical tones. Future research should systematically investigate the underlying cause(s) of such transfer using both behavioural and neural measures and with the complete set of groups (e.g., non-tone language nonmusicians, non-tone language musicians, tone language non-musicians, and tone language musicians). Due to the lack of direct comparison between tone language background and amusia status, it is unclear whether the source of negative transfer among tone language and non-tone language amusics is the same. Since it appears that nontone language amusics’ performance with lexical tones is dependent on their pitch acuity, their source of transfer may be more perceptual/sensory. On the other hand, it appears that tone language amusics’ performance with lexical tones is comparable to tone language non-amusics at the early stages of pitch processing, suggesting that their source of negative transfer might be more cognitive. Further studies are needed to investigate this as this would have implications for intervention for speakers with different language backgrounds.

152

J. H. Ong et al.

8.7 Conclusion In this chapter, we reviewed studies investigating the influence of two musical factors—musical experience and musical disorder—on lexical tone perception, production, and learning. Musical experience in the form of musical training provides non-tone language speakers with an advantage in perceiving, producing, and to some extent, learning lexical tones, but it provides no additional benefit beyond that provided by tone language experience for tone language speakers. Congenital amusia, a musical disorder, appears to negatively affect one’s perception of lexical tones but not their production. Preliminary evidence showed that amusics may benefit from auditory training to improve their pitch perception, suggesting that intervention may be possible. Various perceptual and cognitive explanations for cross-domain transfer were also reviewed, and future studies were suggested to address the outstanding questions and gaps in the field. From a broader perspective, understanding the relationship between musical factors and lexical tones will help reveal the commonalities and differences between language and music, with potential implications in education, neuroscience, and clinical science. Acknowledgements The research was supported by Ministry of Education Tier 1 (159/14) and Tier 2 (MOE2015-T2-1-120) grants awarded to A.H.D. Chan, and Ministry of Education Tier 1 (RG72/17) grant awarded to F.C.K. Wong.

References Alexander, J. A., Wong, P. C. M., & Bradlow, A. R. (2005). Lexical tone perception in musicians and non-musicians. Interspeech 2005 (pp. 397–400). Lisbon: ISCA Archive. Ayotte, J., Peretz, I., & Hyde, K. (2002). Congenital amusia: A group study of adults afflicted with a music-specific disorder. Brain, 125, 238–251. Retrieved from https://www.ncbi.nlm.nih.gov/ pubmed/11844725. Besson, M., Chobert, J., & Marie, C. (2011). Transfer of training between music and speech: Common processing, attention, and memory. Frontiers in Psychology, 2(94). https://doi.org/10. 3389/fpsyg.2011.00094 Bidelman, G. M., Gandour, J. T., & Krishnan, A. (2011). Cross-domain effects of music and language experience on the representation of pitch in the human auditory brainstem. Journal of Cognitive Neuroscience, 23(2), 425–434. https://doi.org/10.1162/jocn.2009.21362 Bowles, A. R., Chang, C. B., & Karuzis, V. P. (2016). Pitch ability as an aptitude for tone learning. Language Learning, 66(4), 774–808. https://doi.org/10.1111/lang.12159 Burnham, D., Brooker, R., & Reid, A. (2014). The effects of absolute pitch ability and musical training on lexical tone perception. Psychology of Music, 1–17. https://doi.org/10.1177/030573 5614546359. Burnham, D., & Francis, E. (1997). The role of linguistic experience in the perception of Thai tones. In A. S. Abramson (Ed.), Southeast Asian linguistic studies in honour of Vichin Panupong (pp. 29–47). Chulalongkorn University Press. Cabrera, L., Tsao, F.-M., Liu, H.-M., Li, L.-Y., Hu, Y.-H., Lorenzi, C., & Bertoncini, J. (2015). The perception of speech modulation cues in lexical tones is guided by early language-specific experience. Frontiers in Psychology, 6(1290), 1–14. https://doi.org/10.3389/fpsyg.2015.01290

8 The Effect of Musical Experience and Congenital Amusia on Lexical …

153

Chandrasekaran, B., Krishnan, A., & Gandour, J. T. (2007). Mismatch negativity to pitch contours is influenced by language experience. Brain Research, 1128(1), 148–156. https://doi.org/10.1016/ j.brainres.2006.10.064 Chandrasekaran, B., Krishnan, A., & Gandour, J. T. (2009). Relative influence of musical and linguistic experience on early cortical processing of pitch contours. Brain and Language, 108(1), 1–9. https://doi.org/10.1016/j.bandl.2008.02.001.Relative Cooper, A., & Wang, Y. (2010). The role of musical experience in Cantonese lexical tone perception by native speakers of Thai. In Speech Prosody 2010. Chicago, IL: ISCA Archive. Cooper, A., & Wang, Y. (2012). The influence of linguistic and musical experience on Cantonese word learning. The Journal of the Acoustical Society of America, 131, 4756. Degé, F., Kubicek, C., & Schwarzer, G. (2011). Music lessons and intelligence: A relation mediated by executive function. Music Perception, 29(2), 195–201. Dittinger, E., Barbaroux, M., D’Imperio, M., Jäncke, L., Elmer, S., & Besson, M. (2016). Professional music training and novel word learning: From faster semantic encoding to longer-lasting word representations. Journal of Cognitive Neuroscience, 28(10), 1584–1602. https://doi.org/10. 1162/jocn_a_00997 Escudero, P., & Williams, D. (2014). Distributional learning has immediate and long-lasting effects. Cognition, 133(2), 408–413. https://doi.org/10.1016/j.cognition.2014.07.002 Foxton, J. M., Dean, J. L., Gee, R., Peretz, I., & Griffiths, T. D. (2004). Characterization of deficits in pitch perception underlying “tone deafness.” Brain, 127(4), 801–810. https://doi.org/10.1093/ brain/awh105 Gandour, J. T. (1984). Tone dissimilarity judgments by Chinese listeners. Journal of Chinese Linguistics, 12(2), 235–261. Gandour, J. T., & Harshman, R. A. (1978). Crosslanguage differences in tone perception: A multidimensional scaling investigation. Language and Speech, 21(1), 1–33. https://doi.org/10.1177/ 002383097802100101 Gosselin, N., Jolicœur, P., & Peretz, I. (2009). Impaired memory for pitch in congenital amusia. Annals of the New York Academy of Sciences, 1169, 270–272. https://doi.org/10.1016/j.brainres. 2015.10.035 Gottfried, T. L., & Xu, Y. (2008). Effect of musical experience on Mandarin tone and vowel discrimination and imitation. The Journal of the Acoustical Society of America, 123(5), 3887. https:// doi.org/10.1121/1.2935823 Hamann, S., Exter, M., Pfeifer, J., & Krause-Burmester, M. (2012). Perceiving differences in linguistic and non-linguistic pitch: A pilot study With German congenital amusics. In F. Cambouropoulos, C. Tsougras, P. Mavromatis, & K. Pastiadis (Eds.), 12th International Conference on Music Perception and Cognition and 8th Triennial Conference of the European Society for the Cognitive Sciences of Music (pp. 398–405). Thessaloniki, Greece: Thessaloniki: Aristotle University of Thessaloniki. Retrieved from https://pure.uva.nl/ws/files/2083099/140539_ Hamann_et_al._2012_.pdf. Henry, M. J., & McAuley, J. D. (2010). On the prevalence of congenital amusia. Music Perception, 27(5), 413–418. Herholz, S. C., & Zatorre, R. J. (2012). Musical training as a framework for brain plasticity: Behavior, function, and structure. Neuron, 76(3), 486–502. https://doi.org/10.1016/j.neuron.2012.10.011 Huang, W.-T., Liu, C., Dong, Q., & Nan, Y. (2015). Categorical perception of lexical tones in Mandarin-speaking congenital amusics. Frontiers in Psychology, 6(829), 1–9. https://doi.org/10. 3758/s13421-012-0208-2 Huang, W.-T., Nan, Y., Dong, Q., & Liu, C. (2015). Just-noticeable difference of tone pitch contour change for Mandarin congenital amusics. The Journal of the Acoustical Society of America, 138(1), EL99–EL104. https://doi.org/10.1121/1.4923268. Hutchins, S., Gosselin, N., & Peretz, I. (2010). Identification of changes along a continuum of speech intonation is impaired in congenital amusia. Frontiers in Psychology, 1(236), 1–8. https:// doi.org/10.3389/fpsyg.2010.00236.

154

J. H. Ong et al.

Hutchins, S., & Peretz, I. (2012). Amusics can imitate what they cannot discriminate. Brain and Language, 123(3), 234–239. https://doi.org/10.1016/j.bandl.2012.09.011 Hyde, K. L., & Peretz, I. (2004). Brains that are out of tune but in time. Psychological Science, 15(5), 356–360. Hyde, K. L., Zatorre, R. J., & Peretz, I. (2011). Functional MRI evidence of an abnormal neural network for pitch processing in congenital amusia. Cerebral Cortex, 21(2), 292–299. https://doi. org/10.1093/cercor/bhq094 Jiang, C., Hamm, J. P., Lim, V. K., Kirk, I. J., & Yang, Y. (2012). Impaired categorical perception of lexical tones in Mandarin-speaking congenital amusics. Memory and Cognition, 40(7), 1109– 1121. https://doi.org/10.3758/s13421-012-0208-2 Jiang, C., Lim, V. K., Wang, H., & Hamm, J. P. (2013). Difficulties with pitch discrimination influences pitch memory performance: Evidence from congenital amusia. PLoS ONE, 8(10), 1–14. https://doi.org/10.1371/journal.pone.0079216 Kalmus, H., & Fry, D. B. (1980). On tune deafness (dysmelodia): Frequency, development, genetics and musical background. Annals of Human Genetics, 43(4), 369–382. Kirkham, J., Lu, S., Wayland, R., & Kaan, E. (2011). Comparison of vocalists and instrumentalists on lexical tone perception and production tasks. In 17th International Congress of Phonetic Sciences (ICPhS XVII) (pp. 1098–1101). Kraus, N., & Chandrasekaran, B. (2010). Music training for the development of auditory skills. Nature Reviews Neuroscience, 11(8), 599–605. Krishnan, A., Gandour, J. T., Xu, Y., & Suresh, C. H. (2017). Language-dependent changes in pitchrelevant neural activity in the auditory cortex reflect differential weighting of temporal attributes of pitch contours. Journal of Neurolinguistics, 41, 38–49. https://doi.org/10.1016/j.jneuroling. 2016.09.005 Kuhl, P. K. (2004). Early language acquisition: Cracking the speech code. Nature Reviews Neuroscience, 5(11), 831–843. https://doi.org/10.1038/nrn1533 Kuhl, P. K., Williams, K. A., Lacerda, F., Stevens, K. N., & Lindblom, B. (1992). Linguistic experience alters phonetic perception in infants by 6 months of age. Science, 255(5044), 606–608. Lee, C.-Y., & Hung, T.-H. (2008). Identification of Mandarin tones by English-speaking musicians and nonmusicians. The Journal of the Acoustical Society of America, 124(5), 3235–3248. https:// doi.org/10.1121/1.2990713 Li, M., & Dekeyser, R. (2017). Perception practice, production practice, and musical ability in L2 Mandarin tone-word learning. Studies in Second Language Acquisition, 39(4), 593–620. https:// doi.org/10.1017/S0272263116000358 Liu, F., Chan, A. H. D., Ciocca, V., Roquet, C., Peretz, I., & Wong, P. C. M. (2016). Pitch perception and production in congenital amusia: Evidence from Cantonese speakers. The Journal of the Acoustical Society of America, 140(1), 563–575. https://doi.org/10.1121/1.4955182 Liu, F., Jiang, C., Francart, T., Chan, A. H. D., & Wong, P. C. M. (2017). Perceptual learning of pitch direction in congenital amusia. Music Perception, 34(3), 335–351. https://doi.org/10.1525/ mp.2017.34.3.335 Liu, F., Jiang, C., Pfordresher, P. Q., Mantell, J. T., Xu, Y., Yang, Y., & Stewart, L. (2013). Individuals with congenital amusia imitate pitches more accurately in singing than in speaking: Implications for music and language processing. Attention, Perception & Psychophysics. https://doi.org/10. 3758/s13414-013-0506-1 Liu, F., Jiang, C., Thompson, W. F., Xu, Y., Yang, Y., & Stewart, L. (2012). The mechanism of speech processing in congenital amusia: Evidence from Mandarin speakers. PLoS ONE, 7(2), e30374. Liu, F., Maggu, A. R., Lau, J. C. Y., & Wong, P. C. M. (2015). Brainstem encoding of speech and musical stimuli in congenital amusia: Evidence from Cantonese speakers. Frontiers in Human Neuroscience, 8(January), 1029. https://doi.org/10.3389/fnhum.2014.01029 Liu, F., Patel, A. D., Fourcin, A., & Stewart, L. (2010). Intonation processing in congenital amusia: Discrimination, identification and imitation. Brain, 133(6), 1682–1693. https://doi.org/10.1093/ brain/awq089

8 The Effect of Musical Experience and Congenital Amusia on Lexical …

155

Liu, F., Xu, Y., Patel, A. D., Francart, T., & Jiang, C. (2012). Differential recognition of pitch patterns in discrete and gliding stimuli in congenital amusia: Evidence from Mandarin speakers. Brain and Cognition, 79(3), 209–215. https://doi.org/10.1016/j.bandc.2012.03.008 Liu, L., & Kager, R. (2014). Perception of tones by infants learning a non-tone language. Cognition, 133(2), 385–394. https://doi.org/10.1016/j.cognition.2014.06.004 Liu, S., & Samuel, A. G. (2004). Perception of Mandarin lexical tones when F0 information is neutralized. Language and Speech, 47(2), 109–138. https://doi.org/10.1177/002383090404700 20101 Maddieson, I. (1984). Patterns of sounds. New York: Cambridge University Press. Maggu, A. R., Wong, P. C. M., Antoniou, M., Bones, O., Liu, H., & Wong, F. C. K. (2018). Effects of combination of linguistic and musical pitch experience on subcortical pitch encoding. Journal of Neurolinguistics, 47, 145–155. https://doi.org/10.1016/j.jneuroling.2018.05.003 Maggu, A. R., Wong, P. C. M., Liu, H., & Wong, F. C. K. (2018). Experience-dependent influence of music and language on lexical pitch learning is not additive. In B. Yegnanarayana, C. Chandra Sekhar, S. Narayanan, S. Umesh, S. R. M. Prasanna, H. A. Murthy, & P. K. Ghosh (Eds.), Interspeech 2018 (pp. 3791–3794). Hyderabad, India: International Speech Communication Association (ISCA). https://doi.org/10.21437/Interspeech.2018-2104. Magne, C., Schön, D., & Besson, M. (2006). Musician children detect pitch violations in both music and language better than nonmusician children: Behavioral and electrophysiological approaches. Journal of Cognitive Neuroscience, 18(2), 199–211. https://doi.org/10.1162/089892 906775783660 Marques, C., Moreno, S., Castro, S. L., & Besson, M. (2007). Musicians detect pitch violation in a foreign language better than nonmusicians: Behavioral and electrophysiological evidence. Journal of Cognitive Neuroscience, 19(9), 1453–1463. https://doi.org/10.1162/jocn.2007.19.9. 1453 Mattock, K., & Burnham, D. (2006). Chinese and English infants’ tone perception: Evidence for perceptual reorganization. Infancy, 10(3), 241–265. Mattock, K., Molnar, M., Polka, L., & Burnham, D. (2008). The developmental course of lexical tone perception in the first year of life. Cognition, 106(3), 1367–1381. https://doi.org/10.1016/j. cognition.2007.07.002 Maye, J., Werker, J. F., & Gerken, L. (2002). Infant sensitivity to distributional information can affect phonetic discrimination. Cognition, 82(3), B101–B111. Mok, P. P. K., & Zuo, D. (2012). The separation between music and speech: Evidence from the perception of Cantonese tones. The Journal of the Acoustical Society of America, 132(4), 2711– 2720. https://doi.org/10.1121/1.4747010 Moreno, S., & Bidelman, G. M. (2014). Examining neural plasticity and cognitive benefit through the unique lens of musical training. Hearing Research, 308, 84–97. https://doi.org/10.1016/j.hea res.2013.09.012 Musacchia, G., Sams, M., Skoe, E., & Kraus, N. (2007). Musicians have enhanced subcortical auditory and audiovisual processing of speech and music. Proceedings of the National Academy of Sciences of the United States of America, 104(40), 15894–15898. Nan, Y., Huang, W. T., Wang, W. J., Liu, C., & Dong, Q. (2016). Subgroup differences in the lexical tone mismatch negativity (MMN) among Mandarin speakers with congenital amusia. Biological Psychology, 113, 59–67. https://doi.org/10.1016/j.biopsycho.2015.11.010 Nan, Y., Liu, L., Geiser, E., Shu, H., Gong, C. C., Dong, Q., & Desimone, R. (2018). Piano training enhances the neural processing of pitch and improves speech perception in Mandarin-speaking children. Proceedings of the National Academy of Sciences, 115(28), E6630–E6639. https://doi. org/10.1073/pnas.1808412115. Nan, Y., Sun, Y., & Peretz, I. (2010). Congenital amusia in speakers of a tone language: Association with lexical tone agnosia. Brain, 133(9), 2635–2642. https://doi.org/10.1093/brain/awq178 Nguyen, S., Tillmann, B., Gosselin, N., & Peretz, I. (2009). Tonal language processing in congenital amusia. Annals of the New York Academy of Sciences, 1169, 490–493. https://doi.org/10.1111/j. 1749-6632.2009.04855.x

156

J. H. Ong et al.

Ong, J. H., Burnham, D., & Escudero, P. (2015). Distributional learning of lexical tones: A comparison of attended vs. unattended listening. PLoS ONE, 10(7), e0133446. https://doi.org/10.1371/ journal.pone.0133446 Ong, J. H., Burnham, D., Stevens, C. J., & Escudero, P. (2016). Naïve learners show crossdomain transfer after distributional learning: The case of lexical and musical pitch. Frontiers in Psychology, 7(1189), 1–10. https://doi.org/10.3389/fpsyg.2016.01189 Patel, A. D. (2011). Why would musical training benefit the neural encoding of speech? The OPERA hypothesis. Frontiers in Psychology, 2. https://doi.org/10.3389/fpsyg.2011.00142 Patel, A. D. (2012). The OPERA hypothesis: Assumptions and clarifications. Annals of the New York Academy of Sciences, 1252, 124–128. https://doi.org/10.1111/j.1749-6632.2011.06426.x Patel, A. D. (2014). Can nonlinguistic musical training change the way the brain processes speech? The expanded OPERA hypothesis. Hearing Research, 308, 98–108. https://doi.org/10.1016/j.hea res.2013.08.011 Patel, A. D., Foxton, J. M., & Griffiths, T. D. (2005). Musically tone-deaf individuals have difficulty discriminating intonation contours extracted from speech. Brain and Cognition, 59, 310–313. https://doi.org/10.1016/j.bandc.2004.10.003 Patel, A. D., Wong, M., Foxton, J., Lochy, A., & Peretz, I. (2008). Speech intonation perception deficits in musical tone deafness (congenital amusia). Music Perception, 25(4), 357–368. Peretz, I., Champod, A. S., & Hyde, K. (2003). Varieties of musical disorder: The Montreal Battery of Evaluation of Amusia. Annals of the New York Academy of Sciences, 999(58–75). https://doi. org/10.1196/annals.1284.006. Perrachione, T. K., Fedorenko, E. G., Vinke, L., Gibson, E., & Dilley, L. C. (2013). Evidence for shared cognitive processing of pitch in music and language. PLoS ONE, 8(8), e73372. https:// doi.org/10.1371/journal.pone.0073372 Schaal, N. K., Pfeifer, J., Krause, V., & Pollok, B. (2015). From amusic to musical? Improving pitch memory in congenital amusia with transcranial alternating current stimulation. Behavioural Brain Research, 294, 141–148. https://doi.org/10.1016/j.bbr.2015.08.003 Schön, D., Magne, C., & Besson, M. (2004). The music of speech: Music training facilitates pitch processing in both music and language. Psychophysiology, 41(3), 341–349. https://doi.org/10. 1111/1469-8986.00172.x Schroeder, S. R., Marian, V., Shook, A., & Bartolotti, J. (2016). Bilingualism and musicianship enhance cognitive control. Neural Plasticity. https://doi.org/10.1155/2016/4058620 Schwanhäußer, B. (2007). Lexical tone perception and production: The role of language and musical background. University of Western Sydney. Strait, D. L., & Kraus, N. (2011). Playing music for a smarter ear: Cognitive, perceptual and neurobiological evidence. Music Perception, 29(2), 133–146. https://doi.org/10.1525/MP.2011. 29.2.133.Playing Surrain, S., & Luk, G. (2017). Describing bilinguals: A systematic review of labels and descriptions used in the literature between 2005–2015. Bilingualism, 1–15. https://doi.org/10.1017/S13667289 17000682. Talamini, F., Carretti, B., & Grassi, M. (2016). The working memory of musicians and nonmusicians. Music Perception, 34(2), 183–191. https://doi.org/10.1525/MP.2016.34.2.183 Tang, W., Xiong, W., Zhang, Y.-X., Dong, Q., & Nan, Y. (2016). Musical experience faciliates lexical tone processing among Mandarin speakers: Behavioral and neural evidence. Neuropsychologia. https://doi.org/10.1016/j.neuropsychologia.2016.08.003 Tillmann, B., Burnham, D., Nguyen, S., Grimault, N., Gosselin, N., & Peretz, I. (2011). Congenital amusia (or tone-deafness) interferes with pitch processing in tone languages. Frontiers in Psychology, 2. https://doi.org/10.3389/fpsyg.2011.00120 Tillmann, B., Lévêque, Y., Fornoni, L., Albouy, P., & Caclin, A. (2016). Impaired short-term memory for pitch in congenital amusia. Brain Research, 1640, 251–263. https://doi.org/10.1016/j.brainres. 2015.10.035

8 The Effect of Musical Experience and Congenital Amusia on Lexical …

157

Tillmann, B., Rusconi, E., Traube, C., Butterworth, B., Umiltà, C., & Peretz, I. (2011). Fine-grained pitch processing of music and speech in congenital amusia. The Journal of the Acoustical Society of America, 130(6), 4089–4096. https://doi.org/10.1121/1.3658447 Tsao, F.-M., & Liu, H.-M. (2020). Lexical tonal perception development in infancy. In H. M. Liu, F. M. Tsao, & P. Li (Eds.), Speech learning, perception, and production: Multidisciplinary approaches in Chinese language research (Chapter 9). The Springer series on Chinese Language Learning Sciences. Van Lancker, D., & Fromkin, V. A. (1973). Hemispheric specialization for pitch and “tone”: Evidence from Thai. Journal of Phonetics, 1, 101–109. Vuvan, D. T., Paquette, S., Mignault Goulet, G., Royal, I., Felezeu, M., & Peretz, I. (2018). The Montreal protocol for identification of amusia. Behavior Research Methods, 50(2), 662–672. https://doi.org/10.3758/s13428-017-0892-8 Wang, X., & Peng, G. (2014). Phonological processing in Mandarin speakers with congenital amusia. The Journal of the Acoustical Society of America, 136(6), 3360–3370. https://doi.org/ 10.1121/1.4900559 Wang, Y., Behne, D. M., Jongman, A., & Sereno, J. A. (2004). The role of linguistic experience in the hemispheric processing of lexical tone. Applied Psycholinguistics, 25(3), 449–466. Wang, Y., Spence, M. M., Jongman, A., & Sereno, J. A. (1999). Training American listeners to perceive Mandarin tones. The Journal of the Acoustical Society of America, 106, 3649. Wayland, R. P., Herrera, E., & Kaan, E. (2010). Effects of musical experience and training on pitch contour perception. Journal of Phonetics, 38(4), 654–662. https://doi.org/10.1016/j.wocn.2010. 10.001 Werker, J. F., & Tees, R. C. (1984). Cross-language speech perception: Evidence for perceptual reorganization during the first year of life. Infant Behavior and Development, 7(1), 49–63. Werker, J. F., Yeung, H. H., & Yoshida, K. A. (2012). How do infants become experts at nativespeech perception? Current Directions in Psychological Science, 21(4), 221–226. https://doi.org/ 10.1177/0963721412449459 Wong, P. C. M., Ciocca, V., Chan, A. H. D., Ha, L. Y. Y., Tan, L. H., & Peretz, I. (2012). Effects of culture on musical pitch perception. PLoS ONE, 7(4), 1–8. https://doi.org/10.1371/journal.pone. 0033424 Wong, P. C. M., & Perrachione, T. K. (2007). Learning pitch patterns in lexical identification by native English-speaking adults. Applied Psycholinguistics, 28(4), 565. Wong, P. C. M., Skoe, E., Russo, N. M., Dees, T. M., & Kraus, N. (2007). Musical experience shapes human brainstem encoding of linguistic pitch patterns. Nature Neuroscience, 10(4), 420–422. Wu, H., Ma, X., Zhang, L., Liu, Y., Zhang, Y., & Shu, H. (2015). Musical experience modulates categorical perception of lexical tones in native Chinese speakers. Frontiers in Psychology, 06(April), 1–7. https://doi.org/10.3389/fpsyg.2015.00436 Yang, W. X., Feng, J., Huang, W. T., Zhang, C. X., & Nan, Y. (2014). Perceptual pitch deficits coexist with pitch production difficulties in music but not Mandarin speech. Frontiers in Psychology, 4(1024), 1–10. https://doi.org/10.3389/fpsyg.2013.01024 Zhang, C., Peng, G., Shao, J., & Wang, W. S. (2017). Neural bases of congenital amusia in tonal language speakers. Neuropsychologia, 97, 18–28. https://doi.org/10.1016/j.neuropsychologia. 2017.01.033 Zhang, C., & Shao, J. (2018). Normal pre-attentive and impaired attentive processing of lexical tones in Cantonese-speaking congenital amusics. Scientific Reports, 8(8420). https://doi.org/10. 1038/s41598-018-26368-7. Zhao, T. C., & Kuhl, P. K. (2015). Effect of musical experience on learning lexical tone categories. The Journal of the Acoustical Society of America, 137(3), 1452–1463. https://doi.org/10.1121/1. 4913457

158

J. H. Ong et al.

Zhao, T. C., & Kuhl, P. K. (2015b). Higher-level linguistic categories dominate lower-level acoustics in lexical tone processing. The Journal of the Acoustical Society of America, 138(2), EL133– EL137. https://doi.org/10.1121/1.4927632. Zheng, Y., & Samuel, A. G. (2018). The effects of ethnicity, musicianship, and tone language experience on pitch perception. Quarterly Journal of Experimental Psychology, 71(12), 2627– 2642. https://doi.org/10.1177/1747021818757435

Chapter 9

Multi-Modal Perception of Tone Yue Wang, Joan A. Sereno, and Allard Jongman

Abstract This chapter surveys the role of visual cues in Chinese lexical tone production and perception, addressing the extent to which visual information involves either linguistically relevant cues to signal tonal category distinctions or is attentiongrabbing in general. Specifically, the survey summarizes research findings on which visual facial cues are relevant for tone production, whether these cues are adopted in native and non-native audio-visual tone perception, and whether visual hand gestures also affect tone perception. Production findings demonstrate that head, jaw, eyebrow, and lip movements are aligned with specific spatial and temporal pitch movement trajectories of different tones, suggesting linguistically meaningful associations of these visual cues to tone articulation. Perception findings consistently show that specific facial and hand gestures corresponding to pitch movements for individual tones do benefit tone intelligibility, and these benefits can be augmented by linguistic experience. Together, these findings suggest language-specific mechanisms in cross-modal tone production and perception.

9.1 Introduction Our understanding of the extent to which visual facial cues can aid speech communication is largely based on the perception of consonants and vowels. Studies dating back to at least the 1950s have demonstrated that segmental perception benefits from visual cues, especially when auditory distinctiveness decreases (e.g., Sumby & Pollack, 1954). Specifically, research has established that visual cues provided by speakers’ facial movements, particularly those resulting from vocal tract configurations, such as lip opening, rounding, and spreading, benefit segmental speech Y. Wang (B) Department of Linguistics, Simon Fraser University, Burnaby, BC, Canada e-mail: [email protected] J. A. Sereno · A. Jongman Department of Linguistics, University of Kansas, Lawrence, KS, USA © Springer Nature Singapore Pte Ltd. 2020 H.-M. Liu et al. (eds.), Speech Perception, Production and Acquisition, Chinese Language Learning Sciences, https://doi.org/10.1007/978-981-15-7606-5_9

159

160

Y. Wang et al.

perception (Kim & Davis, 2014; Perkell, Zandipour, Matthies, & Lane, 2002; Traunmüller & Öhrström, 2007). In contrast, studies on the role of visual facial cues to the perception of prosody, including lexical tone in Chinese, did not appear until the early 2000s, and the findings have been inconclusive. Many languages, including most Chinese languages (e.g., Cantonese, Mandarin) employ tones to convey lexical meaning, similar to the linguistic function of segmental phonemes. However, unlike phonemes, lexical tones are acoustically manifested primarily as changes in fundamental frequency (F0, perceived as pitch) as well as duration and amplitude, which are triggered by glottal and sub-glottal activities independent of vocal tract configurations (Fromkin, 1978; Howie, 1976; Lehiste, 1970; Yip, 2002). As such, although facial and even manual gestural movements have been shown to facilitate tone perception (e.g., Burnham et al., 2006; Chen & Massaro, 2008; Morrett & Chang, 2015), it is unclear whether such movements are linguistically meaningful cues to signal tonal category distinctions or general “attention-grabbing” cues. To address these issues, this chapter provides a survey of how visual cues in Chinese tone production coordinate with acoustic tonal features and integrate with auditory cues in tone perception. In particular, the survey summarizes research findings on (1) visual facial cues (head/jaw, eyebrow, and lip movements) identified as relevant for tone production, (2) the perceptual correlates of visual facial cues in native and non-native audio-visual tone perception, and (3) the role of visual hand gestures in audio-gestural tone perception. Bringing these findings together, the chapter concludes with a discussion of the extent to which cross-modal integration of sensory-motor information in tone production and perception reflects specific linguistically motivated cues to tonal distinctions or more generic attentional cues.

9.2 Identifying Facial Cues in Tone Production There is evidence that movements of the head, jaw, neck, eyebrows, as well as lips are associated with specific tonal or general prosodic production (Attina et al., 2010; Burnham, Ciocca, & Stokes, 2001a; Chen & Massaro, 2008; Kim, Cvejic, & Davis, 2014; Swerts & Krahmer, 2010; Munhall, Jones, Callan, Kuratate, & Vatikiotis-Bateson, 2004; Yehia, Kuratate, & Vatikiotis-Bateson, 2002). Some of these movements (e.g., neck, chin) are believed to be physiologically motivated, due to movements of the laryngeal muscles that control the vocal folds when pitch is varied (Burnham et al., 2015; Yehia et al., 2002). Attempts have also been made to relate certain facial movements (e.g., head, eyebrow, lip) in terms of spatial and temporal changes in distance, direction, speed, and timing to acoustic features of tonal changes in height, contour, and duration (Attina et al., 2010; Garg, Hamarneh, Jongman, Sereno, & Wang, 2019).

9 Multi-Modal Perception of Tone

161

9.2.1 Head and Jaw Research has demonstrated that head movements reflect acoustic correlates in terms of F0 changes. First, the magnitude of head motion appears to be aligned with the amount of F0 variation. For instance, based on computer-vision analysis, Garg et al. (2019) found that Mandarin high-level tone (Tone 1), compared to the other tones, involves minimal head movements and low movement velocity, indicating the “level” (i.e., minimal F0 variation) nature of Tone 1. Likewise, Burnham et al. (2006) showed that head movements (e.g., nodding, tilting, rotation toward the back), as computed from the principal component analysis on kinematic sensor data, were correlated with F0 changes in Cantonese tones. These results are consistent with previous studies on prosody, that head movements are larger (and occur more frequently) in prosodic constituents with a larger amount of variance in F0 (Munhall et al., 2004; Yehia et al., 2002), for example, in sentences with strong focus (Kim et al., 2014; Swerts & Krahmer, 2010), in stressed syllables (Scarborough, Keating, Mattys, Cho, & Alwan, 2009), and in interrogative intonation (Srinivasan & Massaro, 2003). Furthermore, it has been shown that vertical head and jaw movements are compatible with tone contour direction. Garg et al. (2019) demonstrated that upward and downward head movements follow the rising, dipping, and falling tone trajectories for Mandarin mid-high-rising tone (Tone 2), low-dipping tone (Tone 3), and highfalling tone (Tone 4), respectively. Moreover, the time taken for the movements to reach the maximum displacement is also aligned with these trajectories. Similarly, kinematic data show back and forth head movements to be correlated with F0 modulation of contour tones in general (Tones 2–4 in Mandarin, Attina et al., 2010), and a lowered jaw position correlated with the production of a low tone (Tone 3) in low vowel contexts (Shaw, Chen, Proctor, Derrick, & Dakhoul, 2014). These patterns suggest a positive correlation between head/jaw movements and changes in F0 in the production of tonal variations. It has been speculated that head and jaw lowering or raising can be triggered by a reduction or increase in the tension of the vocal folds (movements of the cricothyroid muscle and ligaments) associated with low- or high-pitched tones, respectively (Moisik, Lin, & Eslin, 2014; Smith & Burnham, 2012; Yehia et al., 2002). However, additional quantitative data are needed to further identify the articulatory and physiological relevance of head/jaw movements in characterizing individual tonal categories and how they are associated with F0 variations.

9.2.2 Eyebrows Eyebrow movements are also found to be associated with prosodic articulation (Kim & Davis, 2014; Swerts & Krahmer, 2010; Munhall et al., 2004; Yehia et al., 2002), although little research has focused on tone. Garg et al. (2019) showed that, similar to head movements, the spatial and temporal changes in eyebrow motion also follow the

162

Y. Wang et al.

trajectories of tone height and contour in Mandarin. Specifically, the magnitude of eyebrow displacement, as well as its movement velocity, is smaller for the level tone (Tone 1) as compared to the contour tones. For the contour tones (Tones 2, 3, and 4), eyebrow movements are aligned with the direction and timing of the rising, dipping, and falling trajectories of these tones. It should be noted that these measurements of eyebrow movements have been corrected for head motion; thus, the observed eyebrow movement patterns in tone production are not a byproduct of but rather are independent of head movements. Despite the lack of research on tone, research examining prosodic and non-speech pitch contrasts lends some support to the patterns observed in Garg et al. (2019). Data from kinematic measures reveal larger vertical eyebrow displacement and higher peak velocity of eyebrow movements for focused (Kim et al., 2014), accented (FlechaGarcia, 2010; Swerts & Krahmer, 2010), and stressed (Scarborough et al., 2009) words in a sentence. These results indicate that eyebrow movements may be coordinated with F0 for prosodic contrasts, although in these studies the specific connection to F0 changes (in terms of height and direction) is not straightforward or invariably evident (Ishi, Haas, Wilbers, Ishiguro, & Hagita, 2007; Reid et al., 2015). Although they did not specifically focus on prosody, Huron and Shanahan (2013) did report a causal relationship between vertical eyebrow displacement and F0 height through manipulation of eyebrow movements. By instructing the speakers to raise or lower their eyebrows to different degrees during reading, the authors found higher eyebrow placement to be associated with a higher vocal pitch. These results reveal similar patterns of eyebrow and head movements. However, unlike the case for head motion, eyebrow movements cannot be interpreted in relation to laryngeal activities, and thus pitch. Instead, eyebrow movements in distance, direction, speed, and timing may be spatially and temporally equated with acoustic features in terms of pitch height, contour and duration, since pitch has been claimed to be audio-spatial in representation (Connell, Cai, & Holler, 2013; Hannah et al., 2017).

9.2.3 Lips Lip movements typically signal segmental rather than prosodic contrasts, since the articulation of prosody does not rely on vocal tract configurations. Nonetheless, there has been evidence that lip movements (e.g., lip opening, lowering, inter-lip distance) may be spatially and temporally aligned with prosodic changes such as stress (Dohen & Loevenbruck, 2005; Dohen, Loevenbruck, & Hill, 2006; Scarbourough et al., 2009). For Mandarin tone production, Attina et al. (2010) reported a general correlation between lip closing and F0 irrespective of tones, as well as unique patterns for individual tones (Tones 1 and 2 only). In particular, Tone 1 was characterized by lip raising (as well as jaw advancement), suggesting a potential link between these movements and the height or the lack of contour of this high-level tone; in contrast, Tone 2 production was mainly distinguished by lip protrusion, claimed to be related

9 Multi-Modal Perception of Tone

163

to the rising contour. In addition, for the high-falling tone (Tone 4) in Mandarin, temporal and spatial events coordinate to signal downward movement (Garg et al., 2019). Specifically, relative to the other tones, Tone 4 exhibited the longest time for the velocity of lip closing to reach maximum value and was also accompanied by the longest time for the head and the eyebrows to reach maximum lowering, which suggests that the lowering movement occurred in the later part of the tone production, corresponding to the falling F0 trajectory of this tone. Although these studies show that certain tonal information may be carried by lip configurations, as is the case for head and eyebrows, further research is needed to pinpoint the specific movements characterizing different tone categories, and to examine if/how they correspond to changes in tone height and contour. Taken together, these results collectively suggest that specific movements of the head, eyebrows and lips are correlated with tonal articulation and are likely coordinated with the spatial and temporal dynamics of the production of different tones. However, evidence from tone perception research is needed to determine if these facial tonal cues are indeed used to facilitate perception of categorical tonal distinctions and the extent to which perception is based on the linguistic relevance of these cues.

9.3 Audio-Visual Tone Perception We will first discuss the use of visual (mostly, facial) cues by native perceivers and then turn our attention to non-native perceivers (including learners) whose native language is tonal or non-tonal in order to identify the linguistically relevant cues to visual tone perception.

9.3.1 Native Perceivers Pioneering research by Burnham and colleagues first established the presence of visual facial cues for tones. Burnham et al. (2001a) tested the identification of the six tones of Cantonese. Cantonese perceivers were presented with Cantonese words in three modes: Audio-Visual (AV), in which they both saw and heard a video clip of the speaker; Audio-Only (AO), in which they heard the speaker and saw a still picture of her face; and Video-Only (VO), in which perceivers saw the video clip without any sound. Overall, there was no evidence of visual augmentation: Perceivers were found to be equally accurate in the AV mode (mean accuracy: 82.6%) and the AO mode (82.2%). Moreover, performance in the VO mode (18.6%) was at chance level. While these results suggest that visual information does not augment auditory tone perception, more detailed analyses revealed that perception in the VO mode was better than chance under certain conditions. Specifically, visual information was helpful for perceivers without phonetic training, but not those with phonetic training;

164

Y. Wang et al.

for tone carried on monophthongs, but not diphthongs; for tones spoken in a carrier phrase, but not in isolation form; and for contour tones, but not level tones. Thus, under certain circumstances, visual information did play a role. While perceivers’ tone identification was not very accurate, it was significantly better than chance, indicating that there is helpful visual information in tone articulation. Mixdorff, Hu, and Burnham (2005) replicated the basic findings of Burnham et al. (2001a) for Mandarin. They presented Mandarin Chinese perceivers with the four tones of Mandarin in AV and AO modes. Accuracy was very high (near 100%) and no difference between the AV and AO modes was observed. This absence of visual augmentation was also reported for Thai with an AX discrimination task in which the five tones were presented pair-wise in AV, AO, and VO modes (Burnham et al., 2015), and performance in the VO condition was significantly better than chance. The findings from these studies indicate that visual cues to tone are present (performance in VO mode is better than chance) but native tonal perceivers do not additionally benefit from visual information over that provided by the auditory signal (performance in AV mode is not better than in AO mode). However, similar to Sumby and Pollack’s (1954) observation for segmental distinctions, visual information may become more prominent as auditory information becomes degraded and more difficult to access. The following section examines whether a visual benefit for tonal distinctions can be observed under auditorily challenging conditions such as the presence of background noise, hearing impairment, or when presented with non-native input.

9.3.1.1

Perception in Noise

Given that auditory tone perception has been found to be less accurate in noise (e.g., Lee & Wiener, Chap. 1 of this volume), it is conceivable that perception may benefit from complementary visual information in such adverse auditory conditions. Indeed, while Mixdorff et al. (2005) did not report any difference between the identification of Mandarin tones in AV and AO modes, an advantage for the AV mode over the AO mode became apparent when the same stimuli were presented in babble noise at different signal-to-noise ratios (SNR). Specifically, as the SNR decreased, the relative gain of the AV mode over the AO mode increased from 1.3% at −3 dB to 15.3% at −12 dB. Similar results were reported for Thai (Burnham et al., 2015), with an AV advantage only found when the stimuli were presented in multi-talker Thai babble noise at an SNR of −8 dB. Overall, these results suggest that the visual benefit stems from the early integration of acoustic and visual cues rather than additional information in the video signal per se.

9.3.1.2

(Simulated) Hearing Impairment

The influence of facial information has been well documented for segmental perception in populations with hearing loss (Campbell, Dodd, & Burnham, 1998; Grant,

9 Multi-Modal Perception of Tone

165

Walden, & Seitz, 1998; Schorr, Fox, Wassenhove, & Knudsen, 2005). Research suggests that hearing-impaired perceivers may show a greater reliance on visual information than normal-hearing perceivers (Desai, Stickney, & Zeng, 2008; Rouger, Lagleyre, Fraysse, Deneve, Deguine, & Barone, 2007). However, it remains to be seen if this holds true for the perception of tone as well. Smith and Burnham (2012) took a first step toward addressing this question by using simulated cochlear implant audio. That is, the audio signal was processed in a way to make it similar to that perceived by users of a cochlear implant (CI). Since CIs are poor at providing clear pitch information, CI users may rely more on visual cues to tone than normal-hearing perceivers. Mandarin stimuli were presented in an AX discrimination task in five conditions: AV, AO, VO, CI-simulated AV, and CI-simulated AO. Mandarin perceivers performed significantly better than chance in the VO condition but showed no advantage for the AV over the AO condition. However, when the acoustic signal was degraded to resemble CI speech, perceivers did significantly better in the CI-simulated AV than in the CI-simulated AO condition. These data also suggest that an impoverished audio signal encourages the use of visual information.

9.3.1.3

Directed Attention

Chen and Massaro (2008) investigated whether perceivers could be trained to pay attention to specific visual cues to Mandarin tones. Mandarin perceivers’ tone identification was tested before and after training. This study focused exclusively on the VO mode. During training, participants were instructed to pay attention to mouth, head/chin movements, and especially activities of the neck. They were also allowed to use a sheet that summarized the main visual correlates (cues relating to activity of the neck and chin) of each of the four tones. Before training, perceivers’ accuracy was significantly above chance at 33%. After training, their accuracy was significantly better, at 48%. Thus, perceivers’ awareness of and use of visual cues to tone can be improved through specific training.

9.3.1.4

Individual Tones

So far, we have observed overall gains in performance due to the presence of visual cues with and without accompanying auditory cues. However, while certain consonants (most notably those with more anterior articulations) can benefit more from visual information than others (e.g., Jongman, Wang, & Kim, 2003), similarly, not all tones benefit equally from the presence of the speaker’s face. In their study of Cantonese, Burnham et al. (2001a) reported better-than-chance performance in the VO mode only for the dynamic tones, i.e., those tones whose pitch contour exhibited movement. In Mandarin, the visual gain (AV better than AO) observed by Mixdorff et al. (2005) in babble noise occurred only for the contour tones (Tones 3 and 4). Chen and Massaro (2008) also reported better-than-chance performance in the VO mode

166

Y. Wang et al.

for Tones 2 and 3 before training; after training, performance on Tone 1 was above chance as well. In their discrimination task with all possible pairings of the Mandarin tones, Smith and Burnham (2012) found that level-contour contrasts (Tone 1-Tone 3 was most discriminable) were better discriminated than contour–contour contrasts (Tone 2-Tone 3 was least discriminable) and that the discrimination rankings were the same for the AV and AO conditions, when stimuli were presented without noise. In CI speech where F0 is not available, pairings involving T3 were better discriminated, and this advantage was more pronounced in the AV condition. In the VO mode, Tone 2–Tone 3 was most easily discriminated while pairings involving Tone 4 were poorly discriminated. Finally, Burnham et al. (2015) reported for Cantonese that the dynamic Rising-Falling contrast was most discriminable in the VO mode and that static-dynamic pairs were better discriminated when they included the rising tone rather than the falling tone. Visual augmentation in noise was also greatest for the Rising-Falling contrast. Taken together, greater visual benefits are found for more dynamic tones or tone pairs that are more contrastive in contour shape. These results are consistent with findings in production that head movements are greater for tones with a larger amount of variance in F0 (Garg et al., 2019; Munhall et al., 2004; Yehia et al., 2002).

9.3.2 Non-native Perceivers Burnham, Lau, Tam, and Schoknecht (2001b) followed up on the initial study by Burnham et al. (2001a) by exploring the perception of the same Cantonese stimuli by English (non-native, non-tonal) and Thai (non-native, tonal) perceivers. An AX discrimination task was used which again included the AV, AO, and VO modes. Stimuli were presented in the clear and in babble noise at an SNR of −0.6 dB. For both the English and Thai groups, performance in the VO mode was significantly better than chance. While the Thai perceivers showed no significant difference between AV and AO modes in the clear condition, visual augmentation was significant in the babble noise condition. English perceivers, in contrast to the Thai perceivers, did significantly better in the AO as compared to the AV mode. In the babble noise condition, English perceivers also did significantly better than chance for all three modes but there was no advantage for the AV over the AO condition. In sum, when presented with visual information only, both perceivers of a tonal and a non-tonal language were able to use this information to distinguish the tones of a language they did not know. In addition, for speech embedded in noise, tonal Thai perceivers showed increased accuracy when visual information was added to the auditory signal but English perceivers did not. Even though non-native English perceivers can pick up visual cues to tone as evidenced by their better-than-chance performance in the VO mode, they do not always seem capable of integrating this information with the auditory information. The literature on visual augmentation in tone perception indicates that, for nonnative perceivers, there appears to be a universal advantage of the AV mode over

9 Multi-Modal Perception of Tone

167

the AO mode when stimuli are presented in noise. In their comprehensive study of Thai tone perception by native perceivers of Mandarin, Cantonese, Swedish, and English, Burnham et al. (2015) found that tone perception was consistently better in the AV than AO mode for all language groups. The one exception is the finding that English perceivers were equally accurate in AV and AO modes for speech presented in noise, which was attributed to a floor effect (Burnham et al., 2001b). Interestingly, naïve Dutch perceivers’ identification of Mandarin tones was found to be better in the AV than AO mode even for stimuli presented in the clear (Han, Goudbeek, Mos, & Swerts, 2019). Overall, a visual benefit obtains for non-native perceivers whose native language is a tone language, as well as non-native perceivers without any prior exposure to a tone language. Moreover, there is a language-specific aspect to the processing of visual cues to tone in that non-native perceivers whose native language is non-tonal benefit more from visual information than tonal perceivers. For example, English perceivers outperformed Mandarin perceivers in their discrimination of Mandarin tones in the VO mode (Smith & Burnham, 2012). They were also better than perceivers of other tone languages (Mandarin, Cantonese, Swedish) in discriminating Thai tones in VO (Burnham et al., 2015). In a comparison of congruent and incongruent AV information, English perceivers relied more on facial information while Mandarin perceivers relied almost exclusively on auditory information (Hannah et al. 2017). In sum, these studies indicate that facial cues for tone are more likely used by non-native perceivers who find themselves in a challenging non-native phonetic situation. However, nonnatives’ superior performance in the VO mode does not necessarily transfer to the AV mode; the English perceivers in Burnham et al. (2015) were poorer at AV integration than the non-native tone perceivers. Taken together, non-native visual tone perception appears to involve languagespecific aspects as a function of perceivers’ linguistic experience, just as is the case in non-native auditory perception (e.g., Ingvalson & Wong, Chap. 2; Lee & Wiener, Chap. 1 of this volume). Further research explores these aspects by focusing on visual tone perception in different speech styles and by tracing learning trajectories through visual tone perception training.

9.3.2.1

Clear Speech

Research has shown that acoustic cues to segmental contrasts (e.g., vowels or fricatives) tend to be exaggerated in clear, hyperarticulated speech (Ferguson & KewleyPort, 2007; Maniwa, Jongman, & Wade, 2009; Leung, Wang, Jongman, & Sereno, 2016). These clear speech tokens are in turn more intelligible than their casual speech counterparts (e.g., Ferguson & Kewley-Port, 2002; Maniwa, Jongman, & Wade, 2008). Additionally, visual cues are also more pronounced in clear speech (e.g., Kim, Sironic, & Davis, 2011; Tang et al., 2015). Kim et al. (2011) established that the greater articulatory movement observed in clear speech produced in noise contributed to greater intelligibility in the AV mode. Much less is known, however,

168

Y. Wang et al.

about the perception of clearly produced tones, particularly about whether hyperarticulated visual cues can enhance perception in a linguistically challenging non-native setting. In one of the few studies, Han et al. (2019) did not find an overall significant difference between the perception of clearly and casually produced Mandarin tones by Dutch perceivers. Inspection of the four speakers used in this study revealed that perceivers did perform significantly better on the clearly produced tokens produced by two speakers, but did significantly worse on one of the other two speakers. In addition, analysis of individual tones showed that perceivers identified Tones 2 and 4 more quickly in clear than casual productions. This is probably because contoured tones are hyperarticulated to a greater degree (cf. Kim & Davis, 2001). The lack of any effect of speech style on Tones 1 and 3 may be because Tone 1 involves minimal hyperarticulation, while Tone 3 is the easiest to distinguish in natural style already (Chen & Massaro, 2008; Mixdorff et al., 2005). These results indicate that for non-native perceivers, there may be some visual cues associated with clear speech production of tone. Moreover, these results are consistent with the articulatory findings of greater and more dynamic facial movements of contour than level tones (Garg et al., 2019), indicating that these visual cues are linguistically relevant and can aid non-native perception when enhanced.

9.3.2.2

Perceptual Training

Research has shown that non-native listeners with little or no experience with a tone language can improve their tone perception with a relatively brief training program (Wang, Spence, Jongman, & Sereno, 1999). Beginning American learners of Mandarin improved their accuracy in tone identification by 21% after eight sessions of high-variability training in which they identified which tone of a given pair they had heard and received feedback (Wang et al., 1999). Listeners were trained and tested in the AO mode only. Very few training studies have considered augmenting AO tone training with visual information. As mentioned earlier, Chen and Massaro (2008) included a training phase during which participants were instructed to pay attention to mouth, head/chin, and neck movements. Trainees received one 45 min training session during which they were first shown a 15 min video that illustrated various articulatory correlates of each tone, followed by about 10–20 practice trials with feedback. Training and testing all took place in the VO mode only. Results showed that training yielded a 15% increase in tone identification, from 33% accuracy before training to 48% after training. Most recently, Kasisopa, Antonios, Jongman, Sereno, and Burnham (2018) conducted a systematic comparison of Mandarin tone training in the AO and AV modes. Specifically, participants received either AO training or AV training and were then tested in both the AO and AV modes. Eight groups of participants, consisting of 6- and 8-year-old monolingual or bilingual children with either a tone or non-tone background participated: Thai monolingual, English monolingual, English-Thai bilingual, and English-Arabic bilingual. While the effect of training was minimal for 6-year-olds, 8-year-olds did show improvement as a result of training. In particular, results showed that tone language experience, either monolingual or

9 Multi-Modal Perception of Tone

169

bilingual, is a strong predictor of learning unfamiliar tones. 8-year-old monolingual children improved with AV training but not with AO training, whereas 8-year-old bilingual children improved with AO training and to a lesser extent with AV training. These findings provide longitudinal data supporting linguistically motivated AV tone perception in that visual tone perception can be improved as a function of linguistic experience.

9.4 Gestural Tone Perception Previous research has claimed that pitch is audio-spatial in representation (Connell et al., 2013). Findings from visual cues in tone production indicate that facial gestures such as eyebrows may be spatially equated to tonal variations. This association may encourage the perception of tone to be tied to visual-spatial features. Indeed, similar to facial-spatial gestures, upward and downward hand gestures have been shown to affect pitch perception in the direction of the gesture (Connell et al., 2013). Capturing pitch in gestures owes its inspiration to the illustrative aids in musical perception. For example, to create audio-spatial connections, music teachers use hand gesture levels and diagrams of melodic contours to enhance pitch perception (Apfelstadt, 1988; Welch, 1985), and singers can be trained to improve their pitch perception accuracy using such gestures (Liao, 2008; Liao & Davidson, 2016). In a linguistic context, gestures have been shown to affect the perception of prosodic information in a non-native language. For example, Kelly, Bailey, and Hirata (2017) found that upward or downward hand movements congruous with the direction of the intonational pitch contour (rising or falling, respectively) could facilitate perception of intonation in a non-native language, while incongruous gesture-pitch matching was disruptive, suggesting a direct link between hand gestures and pitch perception. The contribution of hand gestures has been shown to facilitate lexical tone perception and learning. Morett and Chang (2015) trained English perceivers to learn Mandarin tone words either with or without viewing hand gestures tracing tone contours. The results showed greater post-training improvements for the group who received training with gesture compared to the no-gesture training group. However, this improvement only held true for the word-meaning association task with a limited number of training words, whereas for tone identification, gestural training did not show any advantage. As such, it is not clear whether the facilitative effects of gesture could be attributed to effective cross-modal pitch-gesture associations or are the result of memorization using verbal labeling strategies (cf. Connell et al., 2013). Hannah et al. (2017) attempted to address whether a pitch-gesture association can be established in a linguistically meaningful manner by manipulating auditory and gestural input and through comparing native and non-native perceptual patterns. Specifically, native Mandarin perceivers and native English perceivers identified Mandarin tones embedded in noise with either congruent or incongruent auditorygestural inputs, where the hand movements tracing the tones were in the same (congruent) or different (incongruent) direction and shape as the tone contours. Native

170

Y. Wang et al.

Mandarin results showed exclusive reliance on auditory information in the congruent condition; whereas in the incongruent conditions, identification was partially based on gestures, demonstrating the use of gestures as valid cues in Mandarin tone identification. The English perceivers’ performance improved significantly in the congruent auditory-gesture condition compared to the condition without gestural information. Moreover, they relied more on gestural than auditory information in the incongruent condition. These results reveal positive effects of gestural input (tracing tone contours) on both native and non-native tone perception, indicating that cross-modal (visual-spatial) resources can be recruited to aid linguistic perception. These patterns are consistent with the finding that visual presentation of schematic representations of the pitch contours enhances auditory tone perception (Liu et al., 2011, also see Ingvalson & Wong, Chap. 2 of this volume). The fact that perceivers can establish a cross-modal link reflects a linguistically meaningful association between auditory and visual-spatial events in tone perception. Moreover, the different audio-gestural weighting patterns exhibited in native versus non-native perception further reveal the contribution of language-specific factors in multi-modal tone perception.

9.5 Concluding Remarks In tone production, although head and eyebrow motion may be attention-drawing overall, the results of aligned head, eyebrow, and lip movements with specific spatial and temporal pitch movement trajectories of different tones suggest linguistically meaningful associations of these visual cues to tone articulation. Consistently, the degree of visual benefits in tone perception corresponds to the extent of contour movement dynamicity and shape contrastivity of individual tones. Moreover, these benefits can be particularly augmented in non-native perception, when the tonal signal is enhanced in clear speech and with the additional aid of hand gestures, and as perceivers gain additional experience in tone learning. These data suggest languagespecific mechanisms and the influence of language experience in cross-modal tone production and perception, above and beyond a general language-universal system. However, due to the scarcity of research in this area, there remain several underexplored research directions for a more thorough understanding of the role of visual cues in multi-modal tone production and perception. First, it is unclear how some visual cues (e.g., lips) are used in characterizing individual tones. Moreover, further evidence from tone perception research is needed to, further evidence from tone perception research is needed to determine how the visual tonal cues are used to facilitate perception of categorical tonal distinctions, and the extent to which perception is based on the linguistic relevance of these cues. Understanding the visual correlates of tone production and perception will not only advance research on cross-modal integration of sensory-motor information in speech processing, but it will also have important applications for the development of effective tools for tone language acquisition and learning, as well as

9 Multi-Modal Perception of Tone

171

audio-visual tonal speech synthesis including visual aids for impaired conditions and noisy environments.

References Apfelstadt, H. (1988). What makes children sing well? Applications of Research in Music Education, 7, 27–32. Attina, V., Gibert, G., Vatikiotis-Bateson, E., & Burnham, D. (2010). Production of Mandarin lexical tones: Auditory and visual components. In Proceedings of International Conference on Auditory-visual Speech Processing (AVSP) 2010, Hakone. Burnham, D., Ciocca, V., & Stokes, S. (2001a). Auditory–visual perception of lexical tone. In P. Dalsgaard, B. Lindberg, H. Benner, & Z. H. Tan, (eds.), Proceedings of the 7th Conference on Speech Communication and Technology, EUROSPEECH 2001, Scandinavia, pp. 395–398. Burnham, D., Lau, S., Tam, H., & Schoknecht, C. (2001b). Visual discrimination of Cantonese tone by tonal but non-Cantonese speakers, and by nontonal language speakers. In D. Massaro, J. Light, & K. Geraci (eds.), Proceedings of International Conference on Auditory-visual Speech Processing (AVSP) 2001, Adelaide, SA, pp. 155–160. Burnham, D., Kasisopa, B., Reid, A., Luksaneeyanawin, S., Lacerda, F., Attina, V., et al. (2015). Universality and language-specific experience in the perception of lexical tone and pitch. Applied Psycholinguistics, 36, 1459–1491. Burnham, D., Reynolds, J., Vatikiotis-Bateson, E., Yehia, H., & Ciocca, V. (2006). The perception and production of phones and tones: The role of rigid and non-rigid face and head motion. In Proceedings of the International Seminar on Speech Production 2006, Ubatuba. Campbell, R., Dodd, B., & Burnham, D. (1998). Hearing by Eye II: Advances in the Psychology of Speechreading and Audio-visual Speech. Hove, UK: Psychology Press. Chen, T. H., & Massaro, D. W. (2008). Seeing pitch: Visual information for lexical tones of Mandarin-Chinese. Journal of the Acoustical Society of America, 123, 2356–2366. Connell, L., Cai, Z. G., & Holler, J. (2013). Do you see what I’m singing? Visuospatial movement biases pitch perception. Brain and Cognition, 81, 124–130. Desai, S., Stickney, G., & Zeng, F. G. (2008). Auditory-visual speech perception in normal-hearing and cochlear-implant listeners. Journal of the Acoustical Society of America, 123, 428–440. Dohen, M., & Loevenbruck, H. (2005). Audiovisual production and perception of contrastive focus in French: A multispeaker study. Interspeech, 2005, 2413–2416. Dohen, M., Loevenbruck, H., & Hill, H. (2006). Visual correlates of prosodic contrastive focus in French: Description and inter-speaker variability. In R. Hoffmann & H. Mixdorff (eds.), Speech Prosody 2006, pp. 221–224. Ferguson, S. H., & Kewley-Port, D. (2002). Vowel intelligibility in clear and conversational speech for normal-hearing and hearing-impaired listeners. Journal of the Acoustical Society of America, 112, 259–271. Ferguson, S. H., & Kewley-Port, D. (2007). Talker differences in clear and conversational speech: Acoustic characteristics of vowels. Journal of Speech, Language, and Hearing Research, 50, 1241–1255. Flecha-Garcia, M. L. (2010). Eyebrow raises in dialogue and their relation to discourse structure, utterance function and pitch accents in English. Speech Communication, 52, 542–554. Fromkin, V. (1978). Tone: A linguistic survey. New York, NY: Academic Press. Garg, S., Hamarneh, G., Jongman, Sereno, J.A., & Wang, Y. (2019). Computer-vision analysis reveals facial movements made during Mandarin tone production align with pitch trajectories. Speech Communication, 113, 47–62.

172

Y. Wang et al.

Grant, K. W., Walden, B. E., & Seitz, P. F. (1998). Auditory-visual speech recognition by hearingimpaired subjects: Consonant recognition, sentence recognition, and auditory-visual integration. Journal of the Acoustical Society of America, 103, 2677–2690. Han, Y., Goudbeek, M., Mos, M., & Swerts, M. (2019). Effects of modality and speaking style on Mandarin tone identification by non-native listeners. Phonetica, 76, 263–286. https://doi.org/10. 1159/000489174. Hannah, B., Wang, Y., Jongman, A., Sereno, J. A., Cao, J., & Nie, Y. (2017). Cross-modal association between auditory and visuospatial information in Mandarin tone perception in noise by native and non-native perceivers. Frontiers in Psychology, 8, 2051. Howie, J. M. (1976). Acoustical studies of Mandarin vowels and tones. Cambridge: Cambridge University Press. Huron, D., & Shanahan, D. (2013). Eyebrow movements and vocal pitch height: Evidence consistent with an ethological signal. Journal of the Acoustical Society of America, 133, 2947–2952. Ishi, C. T., Haas, J., Wilbers, F. P., Ishiguro, H., & Hagita, N. (2007). Analysis of head motions and speech, and head motion control in an android. Paper presented at the International Conference on Intelligent Robots and Systems, San Diego, CA. Jongman, A., Wang, Y., & Kim, B. (2003). Contribution of semantic and facial information to perception of non-sibilant fricatives. Journal of Speech, Language & Hearing Research, 46, 1367–1377. Kasisopa, B., El-Khoury Antonios, L., Jongman, A., Sereno, J. A., & Burnham, D. (2018). Training children to perceive non-native lexical tones: Tone language background, bilingualism, and auditory-visual information. Frontiers in Psychology, 9, 1508. https://doi.org/10.3389/fpsyg. 2018.01508. Kelly, S., Bailey, A., & Hirata, Y. (2017). Metaphoric gestures facilitate perception of intonation more than length in auditory judgments of non-native phonemic contrasts. Collabra: Psychology 3(7). https://doi.org/10.1525/collabra.76. Kim, J., & Davis, C. (2001). Visible speech cues and auditory detection of spoken sentences: An effect of degree of correlation between acoustic and visual properties. In International Conference on Auditory-visual Speech Processing (AVSP) 2001, Aalborg. Kim, J., & Davis, C. (2014). Comparing the consistency and distinctiveness of speech produced in quiet and in noise. Computer, Speech and Language, 28, 598–606. Kim, J., Cvejic, E., & Davis, C. (2014). Tracking eyebrows and head gestures associated with spoken prosody. Speech Communication, 57, 317–330. Kim, J., Sironic, A., & Davis, C. (2011). Hearing speech in noise: Seeing a loud talker is better. Perception, 40, 853–862. Lehiste, I. (1970). Suprasegmentals. Cambridge, MA: MIT. Leung, K., Jongman, A., Wang, Y., & Sereno, J. A. (2016). Acoustic characteristics of clearly spoken english tense and lax vowels. Journal of the Acoustical Society of America, 140, 45–58. Liao, M. Y. (2008). The effects of gesture use on young children’s pitch accuracy for singing tonal patterns. International Journal of Music Education, 26, 197–2113. Liao, M. Y., & Davidson, J. W. (2016). The use of gesture techniques in children’s singing. International Journal of Music Education, 25, 82–94. Liu, Y., Wang, M., Perfetti, C. A., Brubaker, B., Wu, S., & MacWhinney, B. (2011). Learning a tonal language by attending to the tone: An in vivo experiment. Language Learning, 61, 1119–1141. Maniwa, K., Jongman, A., & Wade, T. (2008). Perception of clear fricatives by normal-hearing and simulated hearing-impaired listeners. Journal of the Acoustical Society of America, 123, 1114–1125. Maniwa, K., Jongman, A., & Wade, T. (2009). Acoustic characteristics of clearly spoken english fricatives. Journal of the Acoustical Society of America, 125, 3962–3973. Mixdorff, H., Hu, Y., & Burnham, D. (2005). Visual cues in Mandarin tone perception. In Proceedings of the 9th European Conference on Speech Communication and Technology, ISCA, Bonn, Germany, pp. 405–408.

9 Multi-Modal Perception of Tone

173

Morett, L. M., & Chang, L.-Y. (2015). Emphasizing sound and meaning: Pitch gestures enhance Mandarin lexical tone acquisition. Language and Cognitive Neuroscience, 30, 347–353. Moisik, S. R., Lin, H., & Esling, J. H. (2014). A study of laryngeal gestures in Mandarin citation tones using simultaneous laryngoscopy and laryngeal ultrasound (SLLUS). Journal of the International Phonetic Association, 44, 21–58. Munhall, K. G., Jones, J. A., Callan, D. E., Kuratate, T., & Vatikiotis-Bateson, E. (2004). Visual prosody and speech intelligibility: Head movement improves auditory speech perception. Psychological Science, 15, 133–137. Perkell, J. S., Zandipour, M., Matthies, M. L., & Lane, H. (2002). Economy of effort in different speaking conditions. I. A preliminary study of intersubject differences and modeling issues. Journal of the Acoustical Society of America, 112, 1627–1641. Reid, A., Burnham, D., Kasisopa, B., Reilly, R., Attina, V., Rattanasone, N. X., & Best, C. T. (2015). Perceptual assimilation of lexical tone: The roles of language experience and visual information. Attention, Perception and Psychophysics, 77, 571–591. Rouger, J., Lagleyre, S., Fraysse, B., Deneve, S., Deguine, O., & Barone, P. (2007). Evidence that cochlear-implanted deaf patients are better multisensory integrators. Proceedings of the National Academy of Sciences, 104, 7295–7300. Scarbourough, R., Keating, P., Mattys, S. L., Cho, T., & Alwan, A. (2009). Optical phonetics and visual perception of lexical and phrasal stress in English. Language and Speech, 51, 135–175. Schorr, E. A., Fox, N. A., van Wassenhove, V., & Knudsen, E. I. (2005). Auditory-visual fusion in speech perception in children with cochlear implants. Proceedings of the National Academy of Sciences, 102, 18748–18750. Shaw, J. A., Chen, W. R., Proctor, M. I., Derrick, D., & Dakhoul, E. (2014). On the inter-dependence of tonal and vocalic production goals in Chinese. Paper presented at the International Seminar on Speech Production (ISSP), Cologne, Germany. Smith, D., & Burnham, D. (2012). Facilitation of Mandarin tone perception by visual speech in clear and degraded audio: Implications for cochlear implants. Journal of the Acoustical Society of America, 131, 1480–1489. Srinivasan, R. J., & Massaro, D. W. (2003). Perceiving prosody from the face and voice: Distinguishing statements from echoic questions in english. Language and Speech, 46, 1–22. Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech intelligibility in noise. Journal of the Acoustical Society of America, 26, 212–215. Swerts, M., & Krahmer, E. (2010). Visual prosody of newsreaders: Effects of information structure, emotional content and intended audience on facial expressions. Journal of Phonetics, 38, 197–206. Tang, L., Hannah, B., Jongman, Sereno, Wang, Y., & Hamarneh, G. (2015). Examining visible articulatory features in clear and plain speech. Speech Communication, 75, 1–13. Traunmüller & Öhrström. (2007). Audiovisual perception of openness and lip rounding in front vowels. Journal of Phonetics, 35, 244–258. Wang, Y., Spence, M. M., Jongman, A., & Sereno, J. A. (1999). Training American listeners to perceive Mandarin tones. Journal of the Acoustical Society of America, 106, 3649–3658. Welch, G. F. (1985). A schema theory of how children learn to sing in tune. Psychology of Music, 13, 3–18. Yehia, H. C., Kuratate, T., & Vatikiotis-Bateson, E. (2002). Linking facial animation, head motion and speech acoustics. Journal of Phonetics, 30, 555–568. Yip, M. J. W. (2002). Tone (pp. 1–14). New York, NY: Cambridge University Press.

Part IV

Development from Infancy Through Childhood

Chapter 10

Lexical-Tonal Perception Development in Infancy Feng-Ming Tsao and Huei-Mei Liu

Abstract The innate capacities and developmental mechanisms involved in infants’ acquisition of their native language are basic topics in speech perception development. Developmental trends regarding infants’ perceptions of phonetic segments have been well documented over the past decades; however, studies on the development of “lexical tones,” which represent a phonetic unit unique to tonal languages, have only begun to emerge in the last decade. This chapter reviews studies on tonal perception development in infants learning a tonal language (e.g., Mandarin and Cantonese) or a nontonal-language (e.g., English and Dutch). These studies have demonstrated that infants learning a nontonal-language are able to discriminate tonal contrasts at the age of 4–6 months, but they cannot easily distinguish the same tonal contrasts at the age of 9–12 months. Conversely, infants exposed to a tonal language exhibited superior ability to discriminate tonal contrasts at the age of approximately 12 months. The trend of lexical-tone learning is similar to that by which infants learn phonetic segments. Developmental factors for tonal perception include experience listening to the native language, the acoustic salience of lexical tones, statistical learning, musical tone exposure, and referential word learning.

10.1 Introduction The ability to distinguish phonetic differences between speech sounds is essential for infants to learn words from their native language. At the beginning of their development, infants exhibit a universal capacity to distinguish between the phonetic segments, namely consonants and vowels, of native and foreign languages (Eimas,

F.-M. Tsao Department of Psychology, National Taiwan University, Taipei, Taiwan e-mail: [email protected] H.-M. Liu (B) Department of Special Education, National Taiwan Normal University, Taipei, Taiwan e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 H.-M. Liu et al. (eds.), Speech Perception, Production and Acquisition, Chinese Language Learning Sciences, https://doi.org/10.1007/978-981-15-7606-5_10

177

178

F.-M. Tsao and H.-M. Liu

Siqueland, Jusczyk, & Vigorito, 1971; Streeter, 1976). Increased sensitivity to consonants and vowels of the native language has been consistently observed in infants aged between 6 and 12 months (Kuhl et al., 2006; Narayan, Werker, & Beddor, 2010; Tsao, Liu, & Kuhl, 2006). A trend toward reduced sensitivity to phonetic categories of foreign languages also emerges during this period (Kuhl et al., 2008; Werker & Tees, 1984; Werker, Yeung, & Yoshida, 2012). However, infants’ discrimination of vowel and consonant contrasts may involve directional asymmetry. For example, in an assessment of the abilities of 5–6-month-old English- and French-learning infants to distinguish between [b] and [v], a perceptual asymmetry was observed: infants more easily discriminated the stop [b] from the fricative [v] than the fricative [v] from the stop [b] in the reverse direction (Nam & Polka, 2016). The developmental reorganization (or perceptual narrowing) of speech perception is not limited to spoken languages; researchers observed that 4-month-old hearing infants could distinguish the handshapes of American Sign Language but 14-month-old hearing infants could not (Palmer, Fais, Golinkoff, & Werker, 2012). Similar to the functions of consonants and vowels, lexical tones distinguish the lexical meanings of syllables in tonal languages. The most well-known example of a tonal language is Mandarin Chinese, which is also the language with the largest number of first-language learners worldwide (Lewis, Simons, & Fennig, 2015). The developmental trends of infants in distinguishing phonetic segments in both native and foreign languages have been well documented (Werker et al., 2012), but only a few studies, emerging within the past decade, have explored whether infants learning nontonal-languages exhibit developmental changes in their abilities to distinguish tonal contrasts (Cabrera et al., 2015; Liu & Kager, 2014, 2017b; Mattock & Burnham, 2006; Yeung, Chen, & Werker, 2013). Additional studies are required to explore the developmental trends of lexical-tone perception in tonal-language learners. The mechanisms behind the rapid changes in infants’ discrimination between lexical tones during the first year of life have not yet been identified (Singh & Fu, 2016). The following paragraphs review studies that have explored the developmental trends of lexical-tone perception in infants learning tonal or nontonal-languages. The subsequent discussion also presents developmental factors that are exclusively relevant to infant’s learning of lexical tones, such as the acoustic salience of tone contrast, as well as those generalized from other domains, such as statistical learning, perception of musical tones, and word learning.

10 Lexical-Tonal Perception Development in Infancy

179

10.2 Developmental Trends of Lexical-Tonal Perception in Infancy 10.2.1 Infants in a Monolingual Environment Perceiving Nonnative Tones Mattock and colleagues (Mattock & Burnham, 2006) were the first to utilize the conditioned head-turn procedure to explore developmental trends in nonnative tonal perception. The researchers assessed 6- and 9-month-old infants either learning English (i.e., a nontonal-language) or Mandarin (i.e., a tonal language) for their discrimination of contrasts between nonnative lexical tones (i.e., the lexical tones of Thai); Thai contour-contour (i.e., rising- vs. falling-pitch contour) and contourlevel (i.e., rising vs. low-level contour) tonal contrasts were presented to the infants. The acoustic difference between the contour-level tones was less than that of the contour-contour tones. The English-learning 9-month-old infants failed to distinguish between both sets of Thai tone pairs, but the 6-month-olds of the same language background perceived the tonal differences in both pairs. For the Mandarin-learning infants, the levels of accuracy in perception of Thai tonal contrasts were similar between the 6- and 9-month-olds. The results reported by Mattock and Burnham (2006) suggested that perceptual reorganization occurs between 6 and 9 months of age in English-learning infants with regard to their distinction of tonal contrasts, but that Mandarin-learning infants between 6 and 9 months of age maintain their sensitivity to tonal contrasts of a foreign language. The decreasing sensitivity to differences in nonnative lexical tones exhibited by nontonal-language learners before 9 months of age is not limited to English-learning infants; a similar trend was observed in infants learning languages of other rhythmic classifications. In another study, in addition to infants learning a stress-timed language (e.g., English), 9-month-old infants learning a syllable-timed language (e.g., French) were not able to distinguish the contrast in Thai tones (rising vs. low level) in a preferential-looking paradigm, and 6-month-olds of the same language background were able to discriminate between the same nonnative tones (Mattock, Molnar, Polka, & Burnham, 2008). The results of the studies by Mattock and colleges on nontonal-language-learning infants demonstrated that infants underwent a considerable change in their perceptual sensitivity to lexical tones between the ages of 6 and 9 months. Pitch (or fundamental frequency, F0) is the main acoustic parameter for lexical tones and is also an acoustic feature of vowels in syllables (see Fon, Chap. 2 of this volume). The ability to discriminate between nonnative vowels has been observed in infants between 4 and 6 months of age (Polka & Werker, 1994). Additionally, 6-month-old infants demonstrate a perceptual magnet effect for vowels; their perceptual discrimination is poorer for vowels similar to prototypical vowels in their native language than for vowels near the boundary between vowels (Kuhl, Williams, Lacerda, Stevens, & Lindblom, 1992). Pitch is an acoustic feature of both vowels

180

F.-M. Tsao and H.-M. Liu

and lexical tones and similar to effects of listening to a native language shaping the vowel perception in infancy. Infants’ ability to perceive pitch in syllables might be shaped by language experiences. Researchers have questioned whether perceptual reorganization is similar for both vowels and lexical tones. Yeung and colleagues used the familiarizationlooking procedure to examine whether 4- and 9-month-old English-learning infants were able to discriminate between the high-rising (Tone 25) and mid-level (Tone 33) tones of Cantonese. The results indicated that 4-month-old infants looked at alternating tone trials (e.g., Tone 25 + Tone 33) longer than at nonalternating tone trials (e.g., Tone 25 + Tone 25) in the testing phase, reflecting ability to discriminate among tones, but 9-month-old infants did not exhibit distinction of tonal differences (Yeung et al., 2013). These results showed that English-learning infants reduced their perceptual sensitivity to distinguish tonal contrasts between 4 and 9 months of age (Yeung et al., 2013). In research that extended beyond focus on pitch contours typical in a specific tonal language, Cabrera et al. (2015) adopted the low-flat versus low-rising pitch contour of [ba] to explore French-learning infants’ development of the ability to discriminate between nonnative lexical tones. The presented tone pairs were not designed to mimic contours of any specific lexical tones in their pitch contours. The results indicated that 6-month-old French-learning infants were able to distinguish the tonal contrast, but 10-month-olds with the same language background did not distinguish the difference between the lexical tones (Cabrera et al., 2015). Thus, the cited studies have all indicated that infants exhibit a developmental trend of decreasing perceptual sensitivity to tone contrasts in nonnative languages during the second half of the first year of life. Shi and colleagues adopted the habituation paradigm to examine the developmental trends of infants’ ability to discriminate between nonnative lexical tones; specifically, French-learning infants between 4 and 11 months of age were studied for their discrimination of contrasts between Mandarin tones (Shi, Santos, Gao, & Li, 2017b). The contrasting tones varied in acoustical salience, with Tone 1 (level) and Tone 4 (falling) representing the most acoustically dissimilar pair and Tone 2 (rising) and Tone 3 (dipping) representing the pair with the greatest acoustic similarity. When French-learning infants were presented with the acoustically similar pair, Tone 2 versus 3, the trend in the infants’ looking-time difference between same trials (habituated stimuli; e.g., Tone 2) and different trials (new stimuli; e.g., Tone 3) was inversely correlated with the infants’ age between 4 and 11 months. However, when presented with Tones 1 and 4, French-learning infants of 4, 8, and 11 months of age were able to distinguish the contrast, demonstrating a trend of maintaining sensitivity to nonnative lexical tones. In addition to revealing trends in infants’ perceptions of contrasts in nonnative tones, Shi et al. (2017) also identified a trend that the ability to distinguish between acoustically dissimilar nonnative tones might be maintained up to an infant’s first birthday. Furthermore, when the acoustically salient contrast of Mandarin Tone 2 (falling) and Tone 4 (falling) was used in a habituation paradigm to explore whether English-learning 19-month-old infants were able to distinguish tone contrasts in nonnative languages, English-learning infants looked longer during the change trials (i.e., new stimuli) than during the nonchange (i.e., habituated stimuli)

10 Lexical-Tonal Perception Development in Infancy

181

in the test phase, suggesting that infants learning nontonal-languages maintain the ability to distinguish acoustically salient tone contrasts after their first birthday (Hay, Graf Estes, Wang, & Saffran, 2015). The contrast between Mandarin Tones 1 and 4 was also used in a study by Liu and Kager (2014). The researchers employed the habituation paradigm to explore the developmental trends exhibited by Dutch-learning 5–18 month-old infants in their discrimination among nonnative lexical tones. Two sets of Tone 1 versus 4 contrasts were implemented in the study: one set with a relatively large pitch difference, and another set with a smaller pitch difference. When the contrasting tones were presented with a larger pitch difference, differences in looking time between habituated tone stimuli (i.e., Tone 1) and new tone stimuli (i.e., Tone 4) were statically significant for Dutch-learning infants at all the studied ages (5–6, 8–9, 11–12, 14–15, and 17– 18 months). This finding indicated that perceptual sensitivity to nonnative lexical tones was maintained in this age range. However, when the Dutch-learning infants were tested with the tone contrasts presented with a smaller pitch difference, infants in the youngest (i.e., 5–6 months) and the oldest groups (i.e., 17–18 months) were able to distinguish the same tone contrast, but infants aged approximately 1 year (8– 9, 11–12, and 14–15 months) failed to distinguish between Tones 1 and 4. This result suggested that the sensitivity to nonnative tones decreases between 5 and 9 months of age, but that sensitivity to differences in nonnative lexical tones increases between 14 and 18 months age. The U-shaped developmental curve for infants’ from 6 to 18 months of age with regard to their ability to discriminate between nonnative lexical tones was also reported in a study of German-learning infants (Gotz, Yeung, Krasotkina, Schwarzer, & Hohle, 2018). These nontonal-language learners of 6 and 9 months of age were tested with the habituation paradigm to determine whether they distinguished the contrast in the Cantonese high-rising (Tone 25) and mid-level (Tone 33) tones. In the test phase of the tone discrimination procedure, the difference in listening time to the habituated tone (i.e., Tone 25) and novel tone (i.e., Tone 33) was statistically significant for the 6-month-old infants, but not for the 9-monthold infants. Thus, the trend of decreasing sensitivity to nonnative lexical tones was observed in German-learning infants between 6 and 9 months of age. When Germanlearning infants were presented with contrasting Cantonese tones in a familiarization paradigm, only 18-month-olds were able to distinguish the contrast in the nonnative tones; the 6- and 9-month-olds failed to detect tone differences (Gotz et al., 2018). In summary, the combined results of studies that used various methods to assess German-learning infants’ tone discrimination performance revealed that the sensitivity of these nontonal-language learners to nonnative lexical tones increased after their first birthday. The results of Gotz et al. (2018) and Liu and Kager (2014) demonstrated that nonnative-language learners aged between 6 and 18 months exhibit a U-shaped developmental trajectory in their abilities to distinguish between tones. The aforementioned studies have revealed trends of decreasing or maintaining sensitivity to nonnative tones, and their findings have also indicated that acoustic salience in tone contrasts is a determining factor in infants’ demonstration of these

182

F.-M. Tsao and H.-M. Liu

trends in their discrimination among nonnative lexical tones. In one study, Frenchlearning infants were tested with the contrast between Mandarin Tones 2 (rising) and 3 (dipping), which are the Mandarin tones with the most similar acoustic features. Under the habituation paradigm, the difference in looking time for habituated (i.e., Tone 2) and different (i.e., Tone 3) stimuli in the test phase was significant among infants of aged 4, 8, and 11 months; however, further data analysis revealed a trend of decreasing sensitivity to contrasts in acoustically similar nonnative tones (Shi et al., 2017b). The contrast between Mandarin Tones 2 and 3 was also adopted by Chen and Kager (2016) in their habituation paradigm-based study of Dutch-learning 4– 12 month-old infants’ developmental trends in the discrimination between nonnative tones. The difference in looking time between habituated and different stimuli was significant among Dutch-learning infants aged 6 and 12 months but not among those aged 4 months. Thus, Dutch-learning infants failed to discriminate between Tones 2 and 3 at 4 months; however, after further development, Dutch-learning infants distinguished between this acoustically similar-tone pair, demonstrating that perceptual sensitivity to discriminate nonnative tones increases in infants between 4 and 6 months of age. In summary, the same phonetic discrimination procedure (i.e., habituation paradigm) was applied in studies using the contrast between Tones 2 and 3 to explore trends in ability to distinguish among nonnative lexical tones for 4- and 12-month-old French-learning infants (Shi et al., 2017b) and Dutch-learning infants (Chen & Kager, 2016). However, the results of these studies were inconsistent. A trend of increasing sensitivity to nonnative lexical tones has been reported for 7- and 11-month-old English-learning infants tested with the contrast of Mandarin Tones 1 (level) and 3 (dipping) (Tsao, 2017). The pitch contours of Tones 1 and 3 differ markedly; thus, this tone pair presents an acoustically salient contrast. Based on a study conducted with the conditioned head-turn procedure, Tsao (2017) reported that 11-month-old English-learning infants exhibited greater ability to distinguish contrasts between nonnative tones than did 7-month-olds with the same language background. Furthermore, 7-month-old English-learning infants performed similarly to Mandarin-learning infants of the same age in distinguishing between Mandarin Tones 1 and 3, implying that the effect of listening to a nontonal-language on discriminating lexical tones was apparent near the first birthday. A recent study that assessed English-learning infants’ tone discrimination performance also reported a trend of increasing perceptual sensitivity to contrasts in nonnative tones during the second half of the first year of life and indicated that this increase in perception of contrasts between nonnative tones varied according to the tones’ acoustic salience (Singh et al., 2018). Specifically, English-learning infants were tested at 6, 9, and 12 months of age under the stimulus-alternating paradigm, a modified version of the habituation paradigm. Both acoustically salient (i.e., Tone 1 vs. 3) and similar (i.e., Tone 2 vs. 3) pairs of Mandarin tones were presented within the same testing session (Singh et al., 2018). Singh et al. (2018) discovered that the English-learning infants could not detect differences in either tone pair at 6 months, but they were able to distinguish the acoustically salient contrast at 9 months. At 12 months of age, the English-learning infants distinguished the contrasts in both Mandarin tone pairs. In summary, the trend reported by Singh et al. (2018) reflected increasing sensitivity to nonnative lexical

10 Lexical-Tonal Perception Development in Infancy

183

tones, and this developmental trajectory was similar to that reported by Tsao (2017). Singh et al. (2018) did not observe the trend of perceptual decline in 6-to-9-monthold infants’ discrimination of nonnative tones, which has been reported in previous studies (Liu & Kager, 2014; Mattock & Burnham, 2006). In brief, most studies have reported a trend of perceptual decline in nontonallanguage learners between 6 and 12 months of age with regard to discriminate between nonnative lexical tones, especially for acoustically similar tone pairs (Liu & Kager, 2014; Mattock & Burnham, 2006; Mattock et al., 2008; Shi et al., 2017b; Yeung et al., 2013). Researchers have also reported that 4–12 month-old nontonallanguage-learning infants exhibit trends of increasing (Chen & Kager, 2016; Singh et al., 2018; Tsao, 2017) and maintaining (Shi et al., 2017b) levels of perceptual sensitivity to contrasts between nonnative tones. Between 14 and 18 months of age, studies reported a trend of increasing sensitivity of nontonal-language-learning infants to perceive tonal differences that infants were not able to distinguish at later ages. Thus, nontonal-language-learning infants would also show a U-shaped trend of tonal perception between 6 and 18 months of age (Gotz et al., 2018; Liu & Kager, 2014). In further studies, researchers may consider using alternative tone discrimination paradigms (e.g., a conditioned head-turn procedure) to determine whether phonetic discrimination between acoustically similar tones contrast increases or decreases over the course of early infancy and to assess the role of acoustic salience on the development of nonnative tonal perception.

10.2.2 Native Lexical Tone Learning in Infants with Monolingual Backgrounds Lexical tones are critical phonetic units of words in tonal languages, and tonallanguage-learning infants must develop accurate perceptual representations of lexical tones to learn words in their native languages. However, most studies of tonal perception development in early infancy have assessed tone discrimination performance in nontonal-language-learning infants. Only a few studies have examined the changes that tonal-language learners aged younger than 1 year exhibit in perceptual sensitivity to lexical tones. Among the findings of these studies, a trend was identified that infants may develop preferential listening for native lexical tones as early as 4 months of age. For example, Mandarin-learning infants’ discrimination between the contrasting rising (Tone 25) and level (Tone 33) tones of Cantonese was similar at 4 and 9 months of age (Yeung et al., 2013). When 4-month-old Cantonese- and Mandarin-learning infants were tested under the preference looking paradigm, infants looked longer at the alternating trials than at the nonalternating trials in the testing phase, and the looking-time pattern differed between the two language groups. For Mandarin-learning 4-month-old infants, the looking-time difference between alternating and nonalternating trials was only significant when nonalternating trials

184

F.-M. Tsao and H.-M. Liu

featured Cantonese Tone 33 rather than Tone 25. For Cantonese-learning 4-monthold infants, looking-time difference between alternating and nonalternating trials was significantly different when the nonalternating trials featured either Cantonese Tone 23 or Tone 55. Tonal direction difference between alternating and nonalternating trials only affected the tone discrimination performance of Mandarin-learning 4-monthold infants, and the directional effect of presenting lexical tones was not observed in Cantonese-learning 4-month-old infants’ discrimination of native lexical tones, reflecting that Cantonese-learning infants discriminated between the lexical tones of Cantonese more accurately than did Mandarin-learning infants at 4 months of age. Researchers have reported that nontonal-language learners exhibit perceptual narrowing of nonnative lexical tones after 9 months of age (Cabrera et al., 2015; Mattock & Burnham, 2006; Mattock et al., 2008; Yeung et al., 2013). By contrast, this trend of decreasing sensitivity to nonnative lexical tones has not been observed in tonal-language-learning infants. When Mandarin-learning 6- and 9-month-old infants were tested with contrasting Thai tones (contour vs. contour and level vs. contour) in the conditioned head-turn procedure, the infants in both age groups performed equally well in discriminating the contrasts (Mattock & Burnham, 2006). Additionally, Mandarin-learning 9-month-olds were able to distinguish between Cantonese lexical tones (Yeung et al., 2013). In summary, studies examining tonallanguage-learning infants’ perceptions of contrasts between nonnative lexical tones have revealed that the infants’ experiences listening to native lexical tones may not only enhance their sensitivity to acoustic features of lexical tones in their native language but also stabilize their ability to discriminate between contrasting tones of nonnative languages. Extending from the finding that perceptual sensitivity to nonnative lexical tones remains similar in tonal-language learners between 4 and 9 months of age, researchers have examined whether increasing exposure to lexical tones in infants’ native language increases infants’ sensitivity to contrasts between native tones. Tests using the familiarization paradigm revealed that Cantonese-learning 4- and 9-month-old infants performed equally well in distinguishing between the level and rising tones of their native language (Yeung et al., 2013). The divergent trends of decreasing sensitivity to nonnative phonemes and increasing sensitivity to native phonemes have been consistently reported in studies exploring developmental trends of consonant and vowel perception in infants between 6 and 12 months of age (Kuhl, Ramirez, Bosseler, Lin, & Imada, 2014; Yeung et al., 2013). As addressed in the previous section, a decline in perception of contrast between nonnative lexical tones was observed in nontonal-language-learning infants aged older than 9 months. If the divergent trends of tonal perception are similar to those of phonemic perception development, infants should exhibit increased sensitivity to the lexical tones of their native language after 9 months of age. In addition to listening experience with lexical tones, the acoustic salience of lexical tones affects infants’ lexical-tonal perception between 9 and 12 months of age. Dutch-learning infants aged between 9 and 12 months exhibited a perceptual decline in their distinction of Mandarin tones when the pitch contour difference between the tested lexical tones was of relatively low salience (Liu & Kager, 2014).

10 Lexical-Tonal Perception Development in Infancy

185

Studies of Mandarin-learning 7- and 11-month-old infants’ perceptions of the lexical tones of their native language have revealed that exposure to lexical tones and the acoustic salience of tone pairs determine changes in the development of tonal perception (Tsao, 2017). In a conditioned head-turn procedure, Mandarin-learning 10–12month-old infants exhibited greater ability to distinguish between Tones 1 and 3, which exhibit salient contrast, than did 6–8-month-old infants of the same language background (Tsao, 2017). Conversely, the Mandarin-learning infants of the two age groups performed similarly in discriminating between the acoustically similar pair, Tones 2 (rising tone) and 3 (dipping tone), as well as between the acoustically distinct Tone 2 (rising tone) and Tone 4 (falling tone) (Tsao, 2008). Increases in abilities to distinguish between acoustically salient native tones were observed in infants aged approximately 1 year, but 7- and 11-month-olds exhibited similar perceptual sensitivity to native tones that were not substantially different in pitch height or contour (Tsao, 2017). Increasing perceptual sensitivity to native lexical tones has also been observed in infants in the second half of the first year of life. In a recent study, the stimulusalternating paradigm was used to assess the tone discrimination performance of Mandarin-learning infants at 6 and 9 months of age for both acoustically distinct and similar-tone pairs (i.e., Tone 1 vs. 3 and Tone 2 vs. 3; Singh et al., 2018). At 6 months of age, the Mandarin-learning infants distinguished the contrast between Tones 1 and 3, but not that between Tones 2 and 3. At 9-months, the Mandarin-learning infants discriminated between tones in both acoustically salient and similar pairs. In summary, between 6 and 9 months of age, Mandarin-learning infants exhibited considerable changes in their abilities to discriminate between acoustically similar native tones, and their ability to distinguish between tones with acoustically salient contrast remained stable (Singh et al., 2018). Tsao (2017) reported increases in ability to distinguish between acoustically distinct tones (i.e., Tone 1 vs. 3 of Mandarin) in Mandarin-learning infants of approximately 10–12 months, and Singh et al. (2018) reported a substantial change in the discrimination of acoustically similar tones (i.e., Tone 2 vs. 3) in Mandarin-learning infants of approximately 9 months. Both of these studies clearly revealed a trend of increasing perceptual sensitivity to native lexical tones in infants between 6 and 12 months of age. Fine-tuning to pitch contours of native lexical tones might be essential to enhance tonal-language learners’ performance in detecting tonal differences within their native language. In an assessment where pitch height and duration of lexical tones were the same and pitch contour was the only acoustic parameter for discriminating between tonal stimuli, Mandarin-learning 10–12 month-old infants performed better than Mandarin-learning 6–8-month-olds in distinguishing between tones with pitch contours similar to those of Mandarin Tones 1 and 3, but the age effect in their performance distinguishing between tones was not significant when the pitch contours of the tones (level vs. mid-falling) were not similar to the pitch contours of Mandarin tones (Tsao, 2017). This study suggests that fine-tuning occurs in Mandarin-learning infants’ perceptions of major acoustic cues, namely, pitch contour, of native lexical tones at approximately 10–12 months of age, and that this trend contributes to greater accuracy in their discrimination between native lexical tones.

186

F.-M. Tsao and H.-M. Liu

In addition to influencing the development of the ability to distinguish contrasts in native tones in infants between 6 and 12 months of age, the acoustical salience of tone contrasts also affects native tonal perception development in infants aged approximately 12–13 months. A recent study used the alternative discrimination paradigm, under which infants were examined in a habituation phase (e.g., with Tone 3), first test phase (e.g., with Tone 1), refamiliarization phase (e.g., with Tone 3), and second test phase (e.g., with Tone 2). The results indicated that the Mandarin-learning 12–13-month-olds distinguished between both similar tones (i.e., Tone 2 vs. 3) and distinct tones (i.e., Tone 1 vs. 3, (Singh, Poh, & Fu, 2016). Acoustic salience played a role in tone discrimination: the effect size of looking-time difference between tones was larger for the distinct-tone pair than the similar-tone pair (Singh et al., 2016). Tsao (2008) utilized the conditioned head-turn procedure to explore whether Mandarin-learning 12-month-old infants performed differently in discriminating between tones with varying degrees of acoustic salience. The study demonstrated that the infants distinguished between Tones 1 and 3, the pair with the most salient contrast, more accurately than between Tones 2 and 3 or Tones 2 and 4. However, the advantage of greater difference in pitch contour for tone discrimination was limited to one presentation direction of Tone 1 versus 3 in the head-turn procedure. In this procedure, infants heard change and nonchange tone trials. The nonchange trials involved the presentation of speech stimuli (e.g., Tone 1), which were also used as background stimuli throughout the discrimination procedure. The change trials presented speech stimuli (e.g., Tone 3) that differed from the background stimuli. Tsao (2008) observed an asymmetrical effect on tone discrimination: infants exhibited greater perception of the difference between Tones 1 and 3, when the Tone 1 was changed to Tone 3 than when the tones were switched in the other direction in change and nonchange trials. The directional asymmetry associated with the presentation of speech stimuli in phonetic discrimination is not unique to infants’ discrimination between lexical tones. Studies on phonetic perception development have also reported directional asymmetry in 6–12-month-old infants’ perception of consonant and vowel contrasts (Nam & Polka, 2016; Polka & Bohn, 2003; Polka & Werker, 1994). In the [u]-[y] vowel contrast, infants more readily perceived contrast in the change from [y] to [u] than in the change from [u] to [y], suggesting that some vowels (e.g., corner vowels) serve as the references for vowel differentiation and that both listening experience to a native language and perceptual bias generate references for vowel shape the development of vowel perception (Polka & Bohn, 2011). Because few studies have demonstrated directional asymmetry in tone discrimination, the debate regarding whether perceptual asymmetry in differentiation between lexical tones indicates the perceptual organization of lexical tones remains open. The conditioned head-turn procedure has been used to examine whether tonallanguage learners exhibited between the most acoustically similar-tone pair, Tone 2 verus 3 (Tsao, 2008, 2017). A further study adopted an alternative procedure, the habituation paradigm, to examine Mandarin-learning 4–13 month-olds’ developmental trends in distinguishing perceptual changes in discriminating between the most acoustically similar tones (Shi, Gao, Achim, & Li, 2017a). Shi et al. (2017a)

10 Lexical-Tonal Perception Development in Infancy

187

revealed that this group of tonal-language learners was able to distinguish these acoustically similar tones, and no significant trends of increased ability to discriminate between native tones were observed among the infants aged between 4 and 13 months. The consistent results of studies that used different tone discrimination procedures (Shi et al., 2017a; Singh et al., 2018; Tsao, 2017) suggests that greater acoustic differences between native lexical tones facilitate development of tonal perception for infants aged younger than 1 year.

10.2.3 Bilingual Infants Learning Two Nontonal-Languages or One Tonal Language and One Nontonal-Language Monolingual and bilingual infants develop similar abilities to perceive consonants and vowels in their native languages in the end of infancy; however, emerging studies have revealed different growth patterns between these two groups of infants with regard to their development of language-specific phonetic perception. At 6 months of age, both Spanish monolingual infants and Catalan-Spanish bilingual infants were able to discriminate between the Spanish [o] and [u] vowels (Sebastian-Galles & Bosch, 2009). However, at 8 months of age, monolingual infants who were not bilingual infants detected the phonetic difference in this vowel pair. The Catalan-Spanish bilingual infants regained their abilities to discriminate between the Spanish vowels at approximately 12 months of age. Thus, Sebastian-Galles and Bosch (2009) revealed a U-shaped developmental trajectory for bilingual infants in native vowel perception development, and this pattern of vowel perception development was not observed in monolingual infants. This finding provokes further inquiry regarding whether the dual-language exposure of bilingual infants induces developmental patterns of tonal perception that differ from those of monolingual infants. Regarding listening experience for lexical tones, a study explored tonal perception changes in two groups of bilingual infants (Liu & Kager, 2017b). One group was learning two nontonal-languages, such as Dutch and English; thus, this group of infants did not hear any lexical tones in their homes. Infants in the other group were being raised in homes with speakers of one tonal language and one nontonallanguage, such as Mandarin and English. The bilingual infants were not only exposed to lexical tones in their homes but also listened to words and sentences from a nontonal-language. Liu and Kager (2017b) used the habituation paradigm to explore developmental trajectory of lexical-tonal perception in the bilingual infants that had listened to two nontonal-languages (e.g., Dutch and English, Dutch and German). Similar to monolingual infants learning one nontonal-language, a decline was also evident in the bilingual infants’ ability to discriminate between nonnative tones (i.e., Mandarin Tone 1 vs. 4) at approximately 8–9 months of age. In addition, the U-shaped trajectory for perception of the difference between nonnative lexical tones with less acoustic salience that has been observed among monolingual infants between 6 and 18 months of age was also observed among the bilingual infants. However, the age

188

F.-M. Tsao and H.-M. Liu

of regaining tone discrimination performance was approximately 11–12 months in bilingual infants, whereas monolingual infants failed to discriminate between the same nonnative tones until 17–18 months of age (Liu & Kager, 2017b). In other words, the bilingual infants learning two nontonal-languages developed the ability to detect differences in tones that were not lexical in their native languages 6 months earlier than monolingual infants did, suggesting that exposure to more than one language enhanced tonal perception development in bilingual infants aged younger than 1 year. Further research has been conducted to determine whether this tonal perception advantage of bilingual infants is evident in bilingual infants raised in families speaking both a tonal language (e.g., Mandarin) and a nontonal-language (e.g., English). English–Mandarin bilingual infants were tested in an alternative discrimination procedure. The infants were examined in a habituation phase, first test phase, refamiliarization phase, and second test phase at 6, 9, and 12 months of age with regard to their ability to discriminate between Mandarin tone pairs (Singh et al., 2018). When presented with tones with less acoustic salience (i.e., Tone 2 vs. 3), the English–Mandarin bilingual infants failed to distinguish tone contrast from 6 to 12 months of age. When the English–Mandarin bilingual infants were tested with the most saliently contrasting pair (i.e., Tone 1 vs. 3), at all three ages, the bilingual learners still could not detect the tonal difference. By contrast, Singh et al. (2018) reported that English-learning monolingual infants failed to discriminate between the two nonnative tones at 6 months of age, but they discriminated between both nonnative tone pairs at 12–13 months of age. Compared with the improvement in tone discrimination performance in English-learning infants between 6 and 12 months of age, the tone discrimination performance of English–Mandarin bilingual infants did not exhibit a substantial change over the same age range. Therefore, English–Mandarin bilingual infants are slower in their development of the ability to discriminate among lexical tones than English monolingual infants. Studies have consistently revealed that bilingual infants are similar to monolingual nontonal-language-learning infants in developing the ability to distinguish between lexical tones at 6 months of age, but that the developmental trends of tone discrimination diverge between monolingual and bilingual infants after the first birthday (Liu & Kager, 2017b; Singh et al., 2018). Additionally, studies have agreed that divergent trajectories in bilingual infants’ perceptions of lexical tones emerge between 6 and 12 months of age. Development is more rapid in bilingual infants learning two nontonal-languages than in monolingual nontonal-language learners (Liu & Kager, 2017b) but slower in bilingual infants learning one tonal language and one nontonallanguages (Singh et al., 2018). The available studies adopted similar-tone discrimination procedures (i.e., the habituation paradigm) and contrasting Mandarin tone pairs; thus, methodological difference does not account for this complex picture of bilingual infants. Exposure to one tonal language and one nontonal-language may lead to additional perceptual processing in infants’ determination of whether the pitch of syllables is lexical in a tonal language or nonlexical in a nontonal-language. The cognitive cost of “code-switching” in bilingual infants learning one tonal language and one nontonal-language may hinder the infants’ ability to quickly learn the lexical

10 Lexical-Tonal Perception Development in Infancy

189

tones of one of their ambient languages. Few studies have explored the developmental trajectories of lexical-tonal perception in bilingual infants who have listened to both tonal and nontonal-languages; additional studies examining tone discrimination performance in this group of bilingual infants are thus essential not only to verify their rate of development in distinguishing between tones, but also to determine whether simultaneous exposure to two languages advances development of tonal perception.

10.3 Factors Associated with Development of Tonal Perception in Infancy Many factors contribute to the fine-tuning of the lexical-tonal perception that occurs in early infancy. The developmental trends for discriminating native and nonnative lexical tones discussed in the previous sections clearly indicate that listening to a tonal language shapes infants’ perceptual organization of lexical tones, and exposure to a nontonal-language either increases or decreases the infants’ recognition of contrasts among lexical tones during the second half of the first year of life. In addition to the factor of listening to a native language, the following paragraphs review lexicaltone-specific factors, such as acoustic salience, and general factors, such as musical training, statistical learning mechanisms, and word learning processes that contribute to the development of lexical-tonal perception early in life.

10.3.1 Acoustic Salience of Lexical-Tone Contrasts As discussed in the previous sections, studies have revealed that the acoustical salience of tone contrasts moderates the development of infants’ tone discrimination abilities. For example, with regard to relative pitch contour difference, acoustically distinct lexical tones (e.g., Mandarin Tone 1 vs. 3) were easier than acoustically similar tones (e.g., Mandarin Tone 2 vs. 3) for 12-month-old tonal-languagelearning infants to discriminate (Tsao, 2017). Greater pitch contour difference also corresponded with increased ability to discriminate between tones among nontonallanguage learners, such as Dutch- or French-learning infants (Liu & Kager, 2014; Shi et al., 2017b). The cited studies have demonstrated that acoustic salience of tones is a tone-specific factor associated with the rate of tonal perception development, and that acoustic salience is also a language-general factor because nontonal-language learners’ ability to discriminate between tones also varied according to the degree of pitch contour similarity. For native tonal perception development, the acoustic salience of tone contrasts joins experience listening to language-specific lexical tones as a factor that propels the development of tonal perception. Older Mandarin-learning infants perform better

190

F.-M. Tsao and H.-M. Liu

than younger Mandarin-learning infants in distinguishing between tones with similar acoustic salience (Tsao, 2017). For nonnative tonal perception, the acoustical salience of tones may exhibit different effects on developmental pace of tonal perception in nontonal-language-learning infants. Dutch-learning infants demonstrated increasing ability to perceive the contrast between Mandarin Tones 2 and 3 between 6 and 12 months of age (Chen & Kager, 2016), but the ability of French-learning infants to perceive the same contrast decreased during the same period (Shi et al., 2017b).

10.3.2 Statistical Learning Mechanism Regarding speech perception in early infancy, conditional statistical learning and distributional statistical learning are mechanisms through which preverbal infants learn novel words and phonetic categories through exposure to a native language (Thiessen & Erickson, 2012). Conditional statistical learning refers to the probability that specific sound sequences will occur, such as the sound sequence [k]-[u]-[p] corresponding to the word cup. In other words, conditional probabilities are the basis by which infants determine the combination probabilities for specific sound sequences. In conditional statistical learning, statistical information enables infants to segment words from continual speech streams and thereby acquire new words (Saffran, Aslin, & Newport, 1996). In contrast to the probabilities of speech sequences involved in conditional statistical learning, the occurrence frequency of a specific event X in a given population of events is the basis of distributional statistical learning. Infants are sensitive to distributional probabilities (e.g., unimodal or bimodal distributions) of individual speech sounds on the acoustic continuum, which represents the major temporal dimension through which infants’ distinguish the voicing features of stop consonants (Maye, Weiss, & Aslin, 2008). Infants aged 8 months were able to discriminate between consonants by listening to a bimodal distribution of consonants, but similar effects were not observed after infants were exposed to a unimodal distribution of consonants (Maye et al., 2008). In addition to enhancing perception of consonants, distributional statistical learning also enables nontonal-language-learners to distinguish tonal contrasts. Figure 10.1 illustrates the relative frequencies at which lexical-tone stimuli were presented in a bimodal learning condition. As previous discussed, Dutch-learning 11–12 month-old infants could not easily distinguish between Mandarin Tones 1 and 4, which exhibit relatively low salient acoustic difference (Liu & Kager, 2014). After this group of Dutch-learning infants listened to an eight-step tone continuum that varied with the pitch contour from Mandarin Tone 1 (level tone) to Tone 4 (falling tone) in the bimodal condition for approximately 180 s, they were able to distinguish between Tone 1 and 4 (Liu & Kager, 2014). Conversely, listening to the same tone continuum in unimodal contrast did not enable the Dutch-learning infants to discriminate between the same Mandarin tones. In addition, Dutch-learning 14 month-old infants did not appear to benefit from the distributional learning mechanism with regard to their perception of nonnative lexical tones (Liu & Kager, 2017c). The

10 Lexical-Tonal Perception Development in Infancy

191

Fig. 10.1 Relative frequencies of tone stimuli presented in a bimodal distribution

cited studies have clearly revealed that statistical learning represents one of the most critical learning mechanisms through which preverbal infants learn nonnative lexical tones; however, the effectiveness of statistical learning for nonnative tonal perception decreases as learners increase in age. This change in the mechanisms’ effect may correspond with the developmental trend whereby, after 6 months of age, older nontonal-language learners experience greater difficulty than younger nontonal-language learners in discriminating between nonnative tones (Liu & Kager, 2014; Yeung et al., 2013).

10.3.3 Musical Tone Exposure Pitch contour and height are the major acoustic parameters of lexical tones. These acoustic cues are also essential for the differentiation of musical tones. Therefore, listening to musical tones that resemble the pitch contours of lexical tones may enhance the auditory processing of musical pitch perception and extend to recognition of pitch contour differences between lexical tones. Music-to-lexical-tone transfer has been reported for nontonal-language-speaking adults (Ong, Burnham, Stevens, & Escudero, 2016; see also Ong, Tan, Chan & Wong, Chap. 8 of this volume). In addition to musical exposure enhancing lexical-tonal perception, learning a tonal language also increases children’s ability to discriminate among musical tones. When both Mandarin-learning and English-learning 3–5-year-old children were presented with various musical timbres (e.g., C4-G4 on trumpet vs. C4-G4 on vibraphone), the groups of children performed similarly in detecting timbre differences. However, Mandarin-learning children outperformed English-learning children in distinguishing between variations in musical pitch (e.g., C4-G4 vs. G4-C4) (Creel, Weng, Fu, Heyman, & Lee, 2018).

192

F.-M. Tsao and H.-M. Liu

Recent studies have provided further support for the positive transfer of musical exposure to lexical-tonal perception in preverbal infants. When 4- and 12-month-old Dutch-learning infants were tested with contrasting nonnative tones (i.e., Mandarin Tone 2 vs. 3) and contrasting musical melodies (i.e., a rising 3-note melody, D4-E4F4, vs. a falling 3-note melody D4-C4-F4), the 12-month-old infants differentiated between both the nonnative tones and the musical melodies but the younger infants of the same language background perceived neither the contrast in lexical tones nor that in musical tones, suggesting a connection between perception of lexical tones and musical tones in nontonal-language learners (Chen, Stevens, & Kager, 2017). Although the tones with subtle pitch differences (Mandarin Tone 1 vs. 4) were difficult for nontonal-language learners to distinguish at 9 months of age, the nontonal-language-learning infants distinguished the musical tones with the same pitch contours as the Mandarin tones (Liu & Kager, 2017a). These studies indicated that nontonal-language-learning infants are more sensitive to the pitch differences of musical tones than to those of lexical tones. Future studies could expose infants to a series of musical tones to assess whether increasing infants’ abilities to distinguish between pitches of musical tones also enhance their abilities to discriminate between the tones of a foreign language. The short-term musical exposure in nontonal-language-speaking adults is effective to enhance their performance in perceiving lexical tones (Ong et al., 2016). This line of musical exposure study is essential to determining the strength of the music-to-lexical-tone transfer effect for infants’ learning lexical tones.

10.3.4 Learning New Words and Lexical-Tonal Perception In addition to listening to lexical tones, the word-referent mapping context through which infants learn words of phonetically minimal pairs, such as [pea] versus [tea], might enhance the development of lexical-tonal perception in preverbal infants. In a word learning task (Object X-Tone X; Object Y-Tone Y), English-learning 9month-old infants were presented with two visually distinct objects on a monitor and listened to two novel words that differed only in Cantonese lexical tone (Tone 33 vs. 25) (Yeung, Chen, & Werker, 2014). Unlike the infants only exposed to the novel word-referent mapping condition in the training phase for a word learning task, infants in a referential-labeling group underwent a pretraining phase in which they watched video clips depicting the correct word-object referential mapping (i.e., a visual object “car” was paired with the word “car”). Nonnative lexical-tone discrimination performance of both English-learning infant groups was then tested under the habituation paradigm. At 9 months of age, only the referential-labeling-group infants, who also exhibited a larger vocabulary size than infants in the same group, were able to discriminate the nonnative lexical tones. Thus, word-referent mapping in the native language may have helped the infants perceive phonetic differences among words and enhanced English-learning 9-month-old infants’ abilities to discriminate

10 Lexical-Tonal Perception Development in Infancy

193

nonnative lexical tones. Chapter 11 of Singh’s in this volume further addresses the acquisition of novel words and the development of tonal perception in children.

10.4 Conclusion and Future Directions Studies assessing lexical-tone discrimination abilities in infants learning tonal- or nontonal-languages have documented a general trend of lexical-tonal perception in early infancy. Both tonal-language learners and nontonal-language learners perform similarly in discriminating between tones before 6 months of age. Experiences of listening to a tonal language enhance abilities to distinguish native tones among tonal-language learners’ aged between 10 and 12 months. The abilities of nontonallanguage learners to distinguish tonal differences decrease after 9 months of age, suggesting a trend of perceptual reorganization. A general trend of divergent development between tonal- and nontonal-language learners is evident after 10 months of age. Additionally, acoustic salience of tone pairs corresponds with different rates of development for learning lexical tones among both tonal- and nontonal-language learners. Emerging data have indicated that infants learning two nontonal-languages exhibit a similar trend of tonal perception development as that observed in monolingual infants, but the bilingual infants regain the ability to discriminate between contrasting tones at an earlier age than do the monolingual infants (Liu & Kager, 2017b). Exciting new directions for the further exploration of infants’ development of tonal perception are presented subsequently.

10.4.1 Developmental Trends in Tonal Perception Compared with the number of studies on infant discrimination among consonants and vowels in native and nonnative languages, few studies on the development of lexical-tonal perception among infants have been conducted. Most studies of tone discrimination published in the past decade have reported similar trends in the perception of native and nonnative tone contrasts for infants between 6 and 12 months of age. However, some studies that used the same contrasting tones (e.g., Mandarin Tone 2 vs. 3) have yielded inconsistent findings regarding increases or decreases in infants’ abilities to discriminate between nonnative lexical tones (Chen & Kager, 2016; Shi et al., 2017b). These inconsistencies indicate that the developmental trajectory of monolingual infants and the transfer effect of learning a nontonal-language on tonal perception have not yet been defined and may be further analyzed in future studies.

194

F.-M. Tsao and H.-M. Liu

10.4.2 Models of Tonal Perception Development As described, studies have identified the general trends of infants’ perceptions of native and nonnative tones and examined factors associating with these developmental trends. Two models of tonal perception development have been proposed. These models account for the effects of language experience on the perceptual organization of lexical tones. First, according to the Perceptual Assimilation Model (PAM), a listener’s experience with the phonological and phonetic properties of native language prosodic systems (e.g., tone, pitch accent, and intonation) affects their perception of nonnative tones (Reid et al., 2015; So & Best, 2010). Adults’ ability to discriminate between tones is very good in two-category (TC) assimilation, wherein two lexical tones are perceptually assimilated into two phonetic categories; very poor in the single-category (SC) pattern, wherein two lexical tones are judged as one phonetic category; and in-between in the category goodness (CG) difference pattern. The PAM predicts the decline in ability to perceive nonnative lexical tones during infancy, when this tone contrast is an SC pattern, and suggests that nontonal-language learners might maintain or increase their sensitivity to nonnative tone contrasts in the CG pattern. An alternative model to explain tonal perception development in infancy is the Processing Rich Information from Multidimensional Interactive Representations (PRIMIR) (Curtin & Werker, 2018). The PRIMIR assumes that speech inputs to infants involve multiple types of information and that infants’ initial biases, e.g., preferences for speech, native language rhythm, and infant-directed speech (IDS), combine with a learning mechanism (e.g., statistical learning) and the requirements of specific language tasks (e.g., phonetic discrimination and word learning tasks) to facilitate the organization of the information and the extraction of speech signals from the ambient language. Regarding tonal perception development, an assumption is posed in the PRIMIR that acoustic correlates (pitch) of lexical tones provide valuable information for infants; for example, the pitch of speech sounds can be perceived as both tones that define lexical meaning of syllables and as signals of emotional valance. Supporting evidence for the PRIMIR includes the divergent trends of development for tone discrimination ability in tonal and nontonal-language learners, the effect of acoustic salience on these developmental trends, and 18-month-old infants’ ability to distinguish between lexical tones but failure to use these distinctions of tone to guide word learning (Curtin & Werker, 2018). Both the PAM (Best, 1994; Best, McRoberts, & Goodell, 2001; Best, McRoberts, & Sithole, 1988) and PRIMIR (Curtin, Byers-Heinlein, & Werker, 2011; Werker & Curtin, 2005) explained perceptual development of consonants and vowels in preverbal infants. The aforementioned findings regarding tonal perception development in infants that have experienced simultaneous dual-language exposure raise new possibilities for further development of the PRIMIR model. The pitch of speech inputs to bilingual infants (nontonal + nontonal languages) is the acoustic cue through which sentence types, affect, and gender of speakers are identified. Considering these multiple functions of pitch, researchers may further develop the PRIMIR, based on

10 Lexical-Tonal Perception Development in Infancy

195

the extent to which listening to two ambient languages shifts bilingual infants’ use of dynamic filters (i.e., initial basis and requirements of specific language tasks) in their representational planes from general perception (i.e., phoneme and word form) (Curtin & Werker, 2018). In the past decade, studies assessing the ability of tonal- and nontonal-languagelearning infants to discriminate between contrasting lexical tones have accumulated sufficient data to reveal the general trends of tonal perception development for infants between 6 and 12 months of age. In addition to listening to a tonal language, several factors could enhance infants’ ability to distinguish lexical tones, including higher acoustic salience of tones, bimodal statistical learning, exposure to musical tones, referential word learning conditions, and simultaneous learning of two nontonallanguages. Several topics and methods remain open to further investigation, such as the use of alternative methods to examine factors that influence the pace of infants’ development in perception of lexical tones and the establishment of comprehensive frameworks to explain tonal perception development in monolingual infants as well as bilingual infants.

References Best, C. T. (1994). The emergence of native-language phonological influences in infants: A perceptual assimilation model. In J. C. Goodman & H. C. Nusbaum (Eds.), The development of speech perception: The transition from speech sounds to spoken words (pp. 167–224). Cambridge, MA: The MIT Press. Best, C. T., McRoberts, G. W., & Goodell, E. (2001). Discrimination of non-native consonant contrasts varying in perceptual assimilation to the listener’s native phonological system. Journal of the Acoustical Society of America, 109(2), 775–794. Best, C. T., McRoberts, G. W., & Sithole, N. M. (1988). Examination of perceptual reorganization for nonnative speech contrasts: Zulu click discrimination by English-speaking adults and infants. Journal of Experimental Psychology: Human Perception and Performance, 14(3), 345–360. Cabrera, L., Tsao, F. M., Liu, H. M., Li, L. Y., Hu, Y. H., Lorenzi, C., & Bertoncini, J. (2015). The perception of speech modulation cues in lexical tones is guided by early language-specific experience. Frontiers in Psychology, 6. Chen, A., & Kager, R. (2016). Discrimination of lexical tones in the first year of life. Infant and Child Development, 25(5), 426–439. Chen, A., Stevens, C. J., & Kager, R. (2017). Pitch perception in the first year of life, a comparison of lexical tones and musical pitch. Frontiers in Psychology, 8. Creel, S. C., Weng, M., Fu, G., Heyman, G. D., & Lee, K. (2018). Speaking a tone language enhances musical pitch perception in 3–5-year-olds. Developmental Science, 21. https://doi.org/10.1111/ desc.12503. Curtin, S., Byers-Heinlein, K., & Werker, J. F. (2011). Bilingual beginnings as a lens for theory development: PRIMIR in focus. Journal of Phonetics, 39(4), 492–504. Curtin, S., & Werker, J. F. (2018). PRIMIR on tone. Frontiers in Psychology, 9. Eimas, P. D., Siqueland, E. R., Jusczyk, P., & Vigorito, J. (1971). Speech perception in infants. Science, 171, 303–306. Gotz, A., Yeung, H. H., Krasotkina, A., Schwarzer, G., & Hohle, B. (2018). Perceptual reorganization of lexical tones: Effects of age and experimental procedure. Frontiers in Psychology, 9.

196

F.-M. Tsao and H.-M. Liu

Hay, J. F., Graf Estes, K., Wang, T., & Saffran, J. R. (2015). From flexibility to constraint: The contrastive use of lexical tone in early word learning. Child Development, 56(1), 10–22. https:// doi.org/10.1111/cdev.12269 Kuhl, P. K., Conboy, B. T., Coffey-Corina, S., Padden, D., Rivera-Gaxiola, M., & Nelson, T. (2008). Phonetic learning as a pathway to language: New data and native language magnet theory expanded (NLM-e). Philosophical Transactions of the Royal Society B-Biological Sciences, 363(1493), 979–1000. Kuhl, P. K., Ramirez, R. R., Bosseler, A., Lin, J. F. L., & Imada, T. (2014). Infants’ brain responses to speech suggest analysis by synthesis. Proceedings of the National Academy of Sciences of the United States of America, 111(31), 11238–11245. Kuhl, P. K., Stevens, E., Hayashi, A., Deguchi, T., Kiritani, S., & Iverson, P. (2006). Infants show a facilitation effect for native language phonetic perception between 6 and 12 months. Developmental Science, 9(2), F13–F21. Kuhl, P. K., Williams, K. A., Lacerda, F., Stevens, K. N., & Lindblom, B. (1992). Linguistic experience alters phonetic perception in infants by 6 months of age. Science, 255(5044), 606–608. Lewis, M. P., Simons, G. F., & Fennig, C. D. (2015). Ethnologue: Languages of the world (Eighteenth). Dallas, Texas: Summer Institute of Linguistics International. Liu, L. Q., & Kager, R. (2014). Perception of tones by infants learning a non-tone language. Cognition, 133(2), 385–394. Liu, L. Q., & Kager, R. (2017a). Enhanced music sensitivity in 9-month-old bilingual infants. Cognitive Processing, 18(1), 55–65. Liu, L. Q., & Kager, R. (2017b). Perception of tones by bilingual infants learning non-tone languages. Bilingualism-Language and Cognition, 20(3), 561–575. Liu, L. Q., & Kager, R. (2017c). Statistical learning of speech sounds is most robust during the period of perceptual attunement. Journal of Experimental Child Psychology, 164, 192–208. Mattock, K., & Burnham, D. (2006). Chinese and English infants’ tone perception: Evidence for perceptual reorganization. Infancy, 10(3), 241–265. Mattock, K., Molnar, M., Polka, L., & Burnham, D. (2008). The developmental course of lexical tone perception in the first year of life. Cognition, 106(3), 1367–1381. Maye, J., Weiss, D. J., & Aslin, R. N. (2008). Statistical phonetic learning in infants: facilitation and feature generalization. Developmental Science, 11(1), 122–134. https://doi.org/10.1111/j.14677687.2007.00653.x. Nam, Y. J., & Polka, L. (2016). The phonetic landscape in infant consonant perception is an uneven terrain. Cognition, 155, 57–66. Narayan, C. R., Werker, J. F., & Beddor, P. S. (2010). The interaction between acoustic salience and language experience in developmental speech perception: evidence from nasal place discrimination. Developmental Science, 13(3), 407–420. Ong, J. H., Burnham, D., Stevens, C. J., & Escudero, P. (2016). naive learners show crossdomain transfer after distributional learning: The case of lexical and musical pitch. Frontiers in Psychology, 7. Palmer, S. B., Fais, L., Golinkoff, R. M., & Werker, J. F. (2012). Perceptual narrowing of linguistic sign occurs in the 1st year of life. Child Development, 83(2), 543–553. https://doi.org/10.1111/j. 1467-8624.2011.01715.x Polka, L., & Bohn, O. S. (2003). Asymmetries in vowel perception. Speech Communication, 41(1), 221–231. Polka, L., & Bohn, O. S. (2011). Natural referent Vowel (NRV) framework: An emerging view of early phonetic development. Journal of Phonetics, 39(4), 467–478. Polka, L., & Werker, J. F. (1994). Developmental-changes in perception of nonnative vowel contrasts. Journal of Experimental Psychology-Human Perception and Performance, 20(2), 421–435. Reid, A., Burnham, D., Kasisopa, B., Reilly, R., Attina, V., Rattanasone, N. X., & Best, C. T. (2015). Perceptual assimilation of lexical tone: The roles of language experience and visual information. Attention Perception & Psychophysics, 77(2), 571–591.

10 Lexical-Tonal Perception Development in Infancy

197

Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8-month old infants. Science, 274, 1926–1928. Sebastian-Galles, N., & Bosch, L. (2009). Developmental shift in the discrimination of vowel contrasts in bilingual infants: Is the distributional account all there is to it? Developmental Science, 12(6), 874–887. Shi, R. S., Gao, J., Achim, A., & Li, A. J. (2017). Perception and representation of lexical tones in native Mandarin-learning infants and toddlers. Frontiers in Psychology, 8. Shi, R. S., Santos, E., Gao, J., & Li, A. J. (2017). Perception of similar and dissimilar lexical tones by non-tone-learning infants. Infancy, 22(6), 790–800. Singh, L., & Fu, C. S. (2016). A new view of language development: The acquisition of lexical tone. Child Development, 87(3), 834–854. https://doi.org/10.1111/cdev.12512 Singh, L., Fu, C. S. L., Seet, X. H., Tong, A. P. Y., Wang, J. L., & Best, C. T. (2018). Developmental change in tone perception in Mandarin monolingual, english monolingual, and Mandarinenglish bilingual infants: Divergences between monolingual and bilingual learners. Journal of Experimental Child Psychology, 173, 59–77. Singh, L., Poh, F., & Fu, C. S. (2016). Limits on monolingualism? A comparison of monolingual and bilingual infants’ abilities to integrate lexical tone in novel word learning. Frontiers in Psychology, 7. https://doi.org/10.3389/fpsyg.2016.00667. So, C. K., & Best, C. T. (2010). Cross-language perception of non-native tonal contrasts: Effects of native phonological and phonetic influences. Language and Speech, 53, 273–293. Streeter, L. A. (1976). Language perception of 2-month-old infants shows effects of both innate mechanisms and experience. Nature, 259, 39–41. Thiessen, E. D., & Erickson, L. C. (2012). Discovering words in fluent speech: The contribution of two kinds of statistical information. Frontiers in Psychology, 3, 590. https://doi.org/10.3389/ fpsyg.2012.00590 Tsao, F. M., Liu, H. M., & Kuhl, P. K. (2006). Perception of native and non-native affricate-fricative contrasts: cross-language tests on adults and infants. Journal of the Acoustical Society of America, 120(4), 2285–2294. Tsao, F. M. (2008). The effect of acoustical similarity on lexical-tone perception of one-year-old Mandarin-learning infants. Chinese Journal of Psychology, 50, 111–124. Tsao, F. M. (2017). Perceptual improvement of lexical tones in infants: Effects of tone language experience. Frontiers in Psychology, 8. Werker, J. F., & Curtin, S. (2005). PRIMIR: A developmental framework of infant speech processing. Language Learning and Development, 1(2), 197–234. Werker, J. F., & Tees, R. C. (1984). Cross-language speech perception: Evidence for perceptual reorganization during the first year of life. Infant Behavior and Development, 7(1), 49–63. Werker, J. F., Yeung, H. H., & Yoshida, K. A. (2012). How do infants become experts at nativespeech perception? Current Directions in Psychological Science, 21(4), 221–226. https://doi.org/ 10.1177/0963721412449459 Yeung, H. H., Chen, K. H., & Werker, J. F. (2013). When does native language input reorganize phonetic perception? The precocious case of lexical tone. Journal of Memory and Language, 68(2), 123–139. Yeung, H. H., Chen, L. M., & Werker, J. F. (2014). Referential labeling can facilitate phonetic learning in infancy. Child Development, 85(3), 1036–1049.

Chapter 11

Early Word Recognition and Word Learning in Mandarin Learning Children Leher Singh

Abstract For the most part, research on early lexical processes has concentrated on Indo-European and Romance languages. As a result, past research has largely focused on sources of phonological variation relevant to these languages families (i.e., vowels and consonants) in the developing lexicon. Mandarin uses consonants and vowels, but also lexical tones, to define words. A primary focus of this chapter is on how Mandarin tones influence and constrain three fundamental lexical processes in Mandarin learners: word segmentation, word recognition, and word learning. In addition, studies with both bilingual and monolingual learners of Mandarin are reviewed. To summarize, research investigating early lexical processes in Mandarin reveals specific differences between the developmental course of tone acquisition and the course of acquisition charted for vowels and consonants. This invites expansion of formal models and theoretical accounts of early lexical development to accommodate the influence of suprasegmental phonology on the developing lexicon.

11.1 Introduction A fundamental objective of child language research is to describe universal pathways to native language proficiency and further to this objective, to develop models of language development that characterizes language acquisition in all children. However, in large part, research—and consequently, theory—has been disproportionately guided by evidence drawn from the English monolingual child. Most language learners do not acquire English natively, but acquire Mandarin Chinese as a native language (Dryer & Haspelmath, 2013). An empirical skew toward languages like English can limit the generalizability of theories of early language development to even larger populations who speak a different language. While not unique to the study of language development, non-probability sampling and/or convenience sampling is not uncommon in psychological research. However, this practice does raise questions L. Singh (B) National University of Singapore, Singapore, Singapore e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 H.-M. Liu et al. (eds.), Speech Perception, Production and Acquisition, Chinese Language Learning Sciences, https://doi.org/10.1007/978-981-15-7606-5_11

199

200

L. Singh

about whether developmental change applies universally or only to specific groups of learners. As early as infancy, children acquiring a tone system demonstrate distinct developmental trajectories in tone discrimination not easily accounted for by theories developed for infants learning non-tone systems (e.g., perceptual narrowing). Research findings on infant tone sensitivity in perceptual discrimination tasks are covered in Chap. 10 of this volume. In this chapter, I will focus on three core developments in lexical processing in Mandarin Chinese learners: word segmentation, novel word learning, and familiar word recognition and discuss how research findings from each area contribute to an evolving narrative on child language acquisition. Where available, interwoven into each section are discussions of comparisons of monolingual and bilingual learners of Mandarin Chinese.

11.2 Word Segmentation in Mandarin Chinese Learners A seminal study by Jusczyk & Aslin (1995) revealed that infants between 7 and 8 months track familiar words in speech months (for a review of this literature, see Bergmann & Cristia, 2016). Although infant word segmentation engages cognitive universals, such as phonological short-term memory (Minagawa, Hakuno, Kobayashi, Naoi, & Kojima, 2017), the ability to track repetitions of words is specific to the infants’ native language (Newman, Tsay, & Jusczyk, 2003; Polka & Sundara, 2012) and foreshadows later vocabulary growth in the toddler years (Newman, Ratner, Jusczyk, Jusczyk, & Dow, 2006; Singh, Reznick, & Liang, 2012). However, infants are constrained in early word segmentation in ways that impact upon our understanding of the salience of tone and pitch: When words change in vocal emotion, talker gender, or vocal pitch, infants at 7–8 months incorrectly interpret these changes as signifying different words (Houston & Jusczyk, 2000; Singh, Morgan, & White, 2004; Singh, White, & Morgan, 2008). This suggests that infants over-represent and assign relevance to surface variation in speech, such as pitch movements, that limits their ability to equate repetitions of the same word. This raises questions about tone languages, which vary pitch both lexically and non-lexically. Singh & Foong (2012) investigated word segmentation abilities of bilingual infants acquiring both Mandarin and English when tested in each of their languages, with specific attention to the influence of pitch variation on word recognition. Infants were tested at 7.5-, 9-, and 11-months of age. All infants were familiarized with individual monosyllabic words and then tested on their ability to recognize these familiarized words in Mandarin Chinese and in English. Importantly, during the English session, one familiarized word was matched in vocal pitch between familiarization and test phases of the experiment and one familiarized word changed in pitch during the test phase from the familiarization phase. Pitch variation was lexical (i.e., a tone shift) in Mandarin and non-lexical (i.e., a pitch transposition)

11 Early Word Recognition and Word Learning in Mandarin …

201

in English. Results revealed an ability across all age groups to consistently recognize pitch-matched (English) and tone-matched (Mandarin) words. However, agerelated differences emerged for words that changed in pitch (English) or in tone (Mandarin). At 7.5 months, when tested on recognition of a pitch-mismatched word in English, infants did not recognize this word, replicating previous findings with English monolingual infants (Singh et al., 2008). At this age, in Mandarin, infants also did not recognize words that changed in tone. At 9 months, infants’ interpretations of pitch and tone reversed: When tested in English, infants recognized pitchmismatched words; however, they also incorrectly recognized tone-mismatched words as instances of familiarized words. At 11 months, infants recognized pitchmismatched words when tested in English, but not tonal mismatches in Mandarin, demonstrating a language-selective interpretation of pitch variation. An important question is whether the developmental trajectory observed in this study is specific to bilingual infants. Prior research suggests both similarities (Polka & Sundara, 2003; Singh, 2017) and differences (Byers-Heinlein, & Werker, 2013; Polka, Orena, Sundara, & Worrall, 2017) in early word knowledge in monolingual and bilingual infants. It is possible that Mandarin monolingual infants would demonstrate a different trajectory with respect to tone interpretation, a possibility that awaits investigation. The following section moves from word segmentation to novel word learning in Mandarin Chinese.

11.3 Novel Word Learning in Mandarin Chinese The majority of studies on novel word learning have focused on the acquisition of words in English and other European languages. These studies have yielded important discoveries about infant’s sensitivity to vowel and consonant variation, demonstrating that infants as young as 14 months are sensitive to vowel changes in newly learned word (e.g., Mani & Plunkett, 2008, but see Curtin, Fennell, & Escudero, 2009) as well as to consonant changes (e.g., Ballem & Plunkett, 2005). By comparison, investigations of word learning in Chinese are scarce. One question of particular relevance to Mandarin populations is how infants represent not just vowels and consonants, but also lexical tones native to Mandarin Chinese. The study of Mandarin lexical tones contributes in specific ways to our understanding of the architecture of the developing lexicon. In particular, studies with learners of English, French, and also Italian have suggested that the lexicon of infants and children does not afford equal priority to different units of phonology. In particular, prior studies have suggested that infants place greater weight on consonants versus vowels as determinants of lexical identity, leading to a hypothesized consonant bias in word learning (e.g., Havy & Nazzi, 2009, Nazzi & Bertoncini, 2009, Nazzi, Floccia, Moquet, & Butler, 2009, but see also Floccia, Nazzi, Delle Luche, Poltrock, & Goslin, 2014, Højen & Nazzi, 2016). Such biases have been observed in adults within these language communities (e.g., Havy, Serres, & Nazzi, 2014; Lee, Rayner, & Pollatsek, 2002) and have been attributed to (i) intrinsic properties

202

L. Singh

of consonants versus vowels (Cutler & Mehler, 1993; Floccia et al., 2014), (ii) to differences in how infants perceive consonants and vowels in the initial state (Bonatti, Pena, Nespor, & Mehler, 2005), and (iii) to the role of language experience in guiding infants’ abstractions about the role of consonants versus vowels in word learning (e.g., Højen & Nazzi, 2016; Keidel, Jenison, Kluender, & Seidenberg, 2007). The first two accounts would predict similarity across populations in how infants weigh consonants and vowels in lexical processes, while the last account invokes languagespecific experience as a driver of phonological bias. An additional consideration is that unlike more commonly studied languages, the co-existence of segmental (vowels/consonants) and suprasegmental (tones) in Mandarin Chinese may change mutual dependencies between units of phonology for Mandarin learners. Supportive evidence for this comes from a corpus analysis by Tong, Francis, & Gandour (2008) demonstrating that the information value—at a lexical level—contributed by vowels exceeds that contributed by consonants and also by tones in Mandarin Chinese. It is therefore possible that a consonant bias is not a universal feature of development. Investigations into the acquisition of Mandarin Chinese therefore provide a valuable lens through which to investigate universal versus language-dependent constraints on language acquisition. Studies investigating sensitivity to Mandarin phonological contrasts when learning words have employed looking time measures to determine the fidelity with which infants bind Mandarin tones to newly learned words (Graf Estes & Hay, 2015; Hay, Graf Estes, Wang, & Saffran, 2015; Ma, Zhou, Singh, & Gao, 2017; Singh, Tam, Chan, & Golinkoff, 2014; Singh, Poh, & Fu, 2016; Singh & Quam, 2016). These studies have involved training infants and children on new labels for novel objects. Crucially, the labels introduced are tone-bearing syllables. During a test phase, childrens’ memories for the trained words are assessed during two types of test trials: one where the word and tone are segmentally and tonally matched and another where the word is segmentally matched but tonally contrastive. Investigating whether infants are sensitive to tone when learning new words, Singh et al. (2014) taught infants novel words in Mandarin Chinese via a preferential looking paradigm. After a training phase, infants were tested on their recognition of those words when correctly produced as well as on their recognition of those words when mispronounced in one of two ways. Words were either mispronounced on account of a vowel substitution (a two-feature height and backness change) or on account of a tone substitution (Mandarin Tone 2–4). Infants were tested at 18- and 24-months and were either bilingual learners of Mandarin and English, of English and another nontone language or monolingual learners of English only. Results demonstrated that all groups were sensitive to lexical tones in Mandarin at 18 months, treating tone substitutions as mispronunciations of newly learned words. This is surprising in view of the fact that tones only distinguished word meanings for the Mandarin/English bilingual learners. All groups were also sensitive to the vowel changes, which distinguished words in each of the participants’ languages. Tone sensitivity and vowel sensitivity were comparable in magnitude for each group. However, at 24 months, only Mandarin learning infants remained sensitive to tone changes and vowel changes. In contrast,

11 Early Word Recognition and Word Learning in Mandarin …

203

24-month-old non-tone language learners (English monolinguals and English/nontone language bilinguals) were only sensitive to vowel changes and not to tone changes demonstrating a language-dependent sensitivity to lexical tones. Asking a similar question via a different paradigm, Hay et al. (2015) investigated monolingual English learning infants’ sensitivity to lexical tones when learning novel words. Using a habituation-based paradigm, Hay et al. tested 14-, 17-, and 19-monthold infants on sensitivity to the same tone mispronunciations used by Singh et al. (2014), Mandarin tones 2 and 4. Results revealed that while 14-month-old infants were sensitive to lexical tone changes, 17- and 19-month-old infants were not. In comparing tone sensitivity in non-tone language learners, Hay et al. (2015) reported that infants learning English disregarded tone changes as determinants of meaning at by 17 months, whereas Singh et al. (2014) reported that non-tone language learners continued to bind tone to newly learned words at 18 months. These findings can perhaps be reconciled by the fact that the preferential looking paradigm employed by Singh et al. (2014) provides comparatively rich referential support in contrast to habituation-based approaches to novel word learning used by Hay et al. (2015). Nevertheless, both studies point to an early de facto sensitivity to Mandarin lexical tones whether or not infants are learning Mandarin, later followed by a selective attenuation in sensitivity to lexical tones in infants who are not learning a tone language. Graf Estes and Hay (2015) also investigated tone sensitivity in bilingual learners, learning two non-tone languages. They found that bilingual infants integrated tone for longer than monolingual infants. It should be noted that both Hay et al. (2015) and Singh et al. (2014) employed the same tone contrasts—tones 2 versus 4. This is potentially methodologically significant as these tones correspond to rising and falling pitch contours, respectively. In addition to serving as contrastive tones in Mandarin, rising/falling pitch contours are pragmatically contrastive in many languages drawing an important prosodic distinction between questions and statements (Bolinger, 1978). It is therefore possible that sensitivity to these particular contrasts is influenced by their pragmatic significance. Subsequent research suggests that rising/falling pitch contours may be interpreted in a different way to other pitch contrasts without clear pragmatic significance (e.g., high-rising contrasts) (Burnham, Singh, Mattock, Woo, & Kalashnikova, 2018). It is possible that other tone pairs would not elicit the same sensitivity and that tone sensitivity develops asynchronously for different tone pairs, a pattern mirrored in production (Wong, Schwartz, & Jenkins, 2005) and in infant tone discrimination (Shi, Gao, Achim, & Li, 2017; Tsao, 2017). In a study designed specifically to investigate the effects of different types of tone and intonational contrasts on tone sensitivity in novel word learning, Burnham et al. (2018) contrasted 18-month-old infants on their sensitivity to two types of Mandarin contrasts (Tones 1 vs. 2 and Tones 2 vs. 4) with their sensitivity to two types of closely corresponding Thai tone contrasts (high-rising and rising–falling). Monolingual Mandarin learning infants were tested as well as Mandarin–English bilingual learners. An additional group of monolingual English learners was tested on their sensitivity to Thai or Mandarin contrasts as well as on their sensitivity to English intonational contrasts (statement vs. question and statement vs. order). All infants

204

L. Singh

were tested via the switch paradigm. Results revealed that monolingual English learners were not sensitive to Thai or Mandarin tone contrasts nor were they sensitive to intonational contrasts as indicative of word meaning. In contrast, Mandarin monolingual infants were sensitive to Mandarin tones, but only to the Tone 1–Tone 2 contrast, but not to the Tone 2–Tone 4 contrast. Recall that Tones 2 and 4 represent question and statement contrasts in Mandarin as well as tone contrasts which may account for why these tones were not associated with word meanings (Yuan, 2004). Furthermore, Mandarin monolingual infants were not sensitive to either Thai contrast, suggesting phonological precision in Mandarin learners’ tone representations. Bilingual Mandarin–English infants demonstrated an interesting and distinctive pattern of results: Like monolingual Mandarin infants, they were only sensitive to Mandarin Tones 1 and 2 and not to Mandarin Tones 2 and 4. However, unlike monolingual Mandarin infants, they were sensitive to the Thai contrast that corresponds closely to Mandarin Tones 1 and 2, although not to the Thai correspondents of Tones 2 and 4. This suggests that bilingual learners of English and Mandarin demonstrate greater flexibility in their tone representation, accepting non-native analogues of native tones as lexically relevant distinctions. This finding converges with results of prior investigations of phonetic sensitivities in bilingual infants suggesting greater flexibility in bilingual learner’s phonetic category boundaries for segments (FerjanRamirez, Ramirez, Clarke, Taulu, & Kuhl, 2016; Petitto et al., 2012; Singh, 2018). The present study suggests bilinguals may also maintain greater flexibility in their sensitivity to suprasegmental sources of lexical contrast (i.e., tones). The present study suggests that tone interpretation in a novel word learning paradigm may be constrained by tone-intonation relationships at 18 months. Although further experimentation is needed to confirm this possibility, it may be that lexical tone contrasts that overlap with non-lexical intonation contrasts (e.g., questions/statements) are more challenging for infants to negotiate. Tone-intonation relations are one of several factors that could constrain the acquisition of lexical tones. Another factor that may determine infants’ sensitivity to lexical tones is perceptual salience. In a study designed to investigate whether infants’ sensitivity to tone in a word learning paradigm is dependent on tone salience, Singh et al. (2016) compared 12–13-month-old Mandarin monolingual and English–Mandarin bilingual infants on their sensitivity to a set of Mandarin tone contrasts. Infants were familiarized with words labeled by a syllable produced in Tone 3, a complex tone produced with a falling–rising contour. Infants were then were exposed to the word-object pair to which they were familiarized in Tone 3 as well to the familiarized object labeled by the same word in Tone 2 and to the familiarized object labeled by the same word in Tone 1. Tones 2 and 3 are reportedly the most confusable tone pair in the Mandarin tone inventory even for native learners (Shen & Lin, 1991). In contrast, Tones 1 and 3 are relatively easy to discriminate and have been shown to be the least confusable pair of Mandarin tones (Wang, Spence, Jongman, & Sereno, 1999). In an initial discrimination task, Singh et al. (2016) found that 12–13-month-old Mandarin monolingual infants could discriminate Tones 1 and 3 as well as Tones 2 and 3 in an auditory discrimination paradigm that did not require infants to map tones to meanings. However, when tones were associated with word meanings, it

11 Early Word Recognition and Word Learning in Mandarin …

205

was only at 18 months that Mandarin monolingual infants demonstrated sensitivity to a change from Tones 3 to 2 and from Tones 3 to 1. In contrast, English–Mandarin bilingual infants demonstrated a 6-month lead in tone sensitivity. Bilingual infants were tested on their sensitivity to tones in a word learning task both when introduced to a word embedded in a set of English carrier phases as well as when introduced to the same word embedded in Mandarin carrier phrases. Bilingual infants were sensitive to lexical tones—both salient (Tone 3 vs. Tone 1) and subtle (Tone 3 vs. Tone 2)—when learning a new word in a Mandarin context. However, when learning a word in an English context, they were not sensitive to either tone contrast. Bilingual infants therefore demonstrated a precocious and language-sensitive interpretation of tones as a source of lexical contrast relative to their monolingual peers. Moreover, this precocity applied to salient tone contrasts (i.e., Tones 1 and 3) as well as to comparatively subtle tone contrasts (i.e., Tones 2 and 3). Nevertheless, by 18 months of age, Mandarin learning infants—monolingual and bilingual—appear to bind both salient and subtle tone contrasts to newly learned words. The program of studies described above focuses on novel word learning between 12 and 18 months, when infants exhibit the beginnings of a productive vocabulary. However, children continue to add to their vocabularies at an aggressive rate in the following months, demonstrating a rapid rise in their vocabulary size after 18 months (Mayor & Plunkett, 2011). One might predict greater tone sensitivity over this period on account of an expanded lexical inventory on account of positive relationships between vocabulary size and mispronunciation effects for segmental substitutions (e.g., Law & Edwards, 2015). In an investigation of tone sensitivity and vowel sensitivity in Mandarin monolingual toddlers, Ma et al. (2017) used a preferential looking paradigm to compare 2- and 3-year-old children’s response to newly learned words as well as to variants of those words (tone and vowel substitutions). Tone substitutions encompassed a shift between rising and falling tones (Tones 2 vs. 4). Vowel substitutions incorporated a three-feature change in backness, height and roundedness. Results collapsed across age groups revealed that children were slower in general to orient toward target images when words were mispronounced via tone or vowel mispronunciations. Accuracy analyses, focusing on the proportionate amount of time spent fixating the target object when it was correctly pronounced and mispronounced, revealed more nuanced, age-dependent effects of mispronunciations. Specifically, 2-year-old children were sensitive to tone mispronunciations and vowel mispronunciations in equal measure, rejecting both types of mispronunciations as acceptable target labels. Upon testing older children, the authors discovered that 3-year-old children interpreted vowel substitutions as mispronunciations as did 2-year-old children. However, contrary to expectations, 3-year-old children did not interpret tone shifts as mispronunciations, preferentially fixating the target object when labeled by a tonal alternation. In a second study designed to investigate limits on Mandarin monolingual 3-yearold children’s apparent insensitivity to lexical tones, Ma et al. (2017) tested 3-yearold Mandarin monolingual children on their sensitivity to tones when additional cues were provided. Specifically, during familiarization, children were trained on tone minimal pairs with the expectation that this would draw their attention to tone

206

L. Singh

as a source of lexical contrast. Additionally, Ma et al. expanded the tone pairs in this experiment, incorporating distinct rising and falling tones (Tones 2 vs. 4) but also more confusable rising and dipping tones (Tones 2 vs. 3). When provided with supportive cues during familiarization, results revealed that participants were able to map rising and falling tones (Tones 2 and 4) onto contrastive objects. However, they were not able to recognize words with which they were familiarized in dipping tone (Tone 3). This pattern is mirrored in production, where Tone 3 is considerably difficult to master and is only reliably produced after Tones 1, 2, and 4 (Wong et al. 2005). These findings add to a groundswell of further evidence that tone sensitivity may decline with maturation to be further discussed in the next section. In the previous set of studies, tone sensitivity was investigated by familiarizing participants with word-object pairings in a highly structured fashion (i.e., infants simply viewed visual objects in conjunction with repeated presentation of an auditory label). The task demands of such paradigms deviate in potentially significant ways from learning words in social contexts where word meaning links are often inferred from interactions. Sensitivity in novel word learning in conversational contexts was investigated in bilingual English–Mandarin preschool children to determine whether bilingual children demonstrated a language-specific interpretation of pitch variation. When learning English and Mandarin simultaneously, learners must integrate tones selectively in Mandarin during novel word learning and disregard pitch as lexically relevant in English. In a study designed to investigate whether preschool children were able to interpret tones in a language-selective manner, Singh and Quam (2016) tested 3- to 4- and 4- to 5-year-old Mandarin–English bilingual children on their sensitivity to tone shifts when learning words in English and in Mandarin conversational contexts. Words to be learned were manipulated such that within-word (phonotactic) cues were specific to the target language or common to both languages. Therefore, target words either lent themselves to one language or another or were ambiguous in terms of the language to which they belonged. Children were taught words in English and Mandarin in a conversational context and then tested on their recognition of the same words and tone variants of taught words in each language. Results demonstrated that 3–4-year-old children recognized words that they were taught when the words matched in tone both in English and Mandarin. However, when the words did not match in pitch (English) or tone (Mandarin), children did not demonstrate recognition of these words. Moreover, this pattern of results was observed whether children received leading phonotactic cues to the target language or not. In contrast, when children were 4–5 years of age, when children were presented with words with no leading phonotactic cues, they rejected pitch variants as lexical equivalents in English as well as tone variants in Mandarin Chinese, similar to 3– 4-year-old children. It was only when 4–5-year-old children were presented with leading phonotactic cues to the target language that they were able to demonstrate a language-dependent sensitivity to tones, integrating tone changes in Mandarin but disregarding the same changes in English. Although Mandarin learners bind tones to newly learned words as infants, some aspects of tone interpretation such as learning words via conversation take time to mature.

11 Early Word Recognition and Word Learning in Mandarin …

207

Studies investigating children’s sensitivity to lexical tones when learning new words reveal four important findings. First, infants demonstrate an early sensitivity to lexical tones whether they are learning a tone language or not (Hay et al., 2015; Singh et al., 2015). It is only at 17 months (in habituation-based tasks without referential support) and 24 months (in preferential looking tasks with referential support) that Mandarin learning infants demonstrate a language-specific sensitivity to tones that is not shared by their non-tone learning peers. Secondly, as with production, sensitivity to lexical tones is variable depending on the tone used. Different tone pairings elicit different sensitivities in novel word learning (Burnham et al., 2018; Ma et al., 2017; Singh et al., 2016). Third, bilingual tone processing may present language learning opportunities such as precocious integration of tones relative to monolingual Mandarin learning infants (Singh et al., 2016). However, some of the challenges of learning two languages where the functions of pitch differ may be more evident in the preschool years. At this stage, a language-selective integration of tones may be challenging for children (Singh & Quam, 2016) although this is a tentative claim given that there is no monolingual backdrop against which to evaluate bilingual data obtained by Singh & Quam (2016). In addition to mapping novel words to meaning, language learners must rapidly recognize words that they already know in sentential contexts. The ability to do so is a strong predictor of concurrent and later language abilities (Marchman & Fernald, 2008). The following section discusses childrens’ abilities to understand familiar words in Mandarin with specific attention to differences between monolingual and bilingual learners of Mandarin.

11.4 Familiar Word Recognition in Mandarin Chinese In a first attempt to determine the accuracy with which Mandarin learners recognize known words, Singh, Goh, and Wewalaarachchi (2015) tested Mandarin learning toddlers and preschoolers on spoken word recognition via a preferential looking paradigm. Similar to prior instantiations of this paradigm (e.g., Mani & Plunkett, 2007, 2011; Swingley & Aslin, 2002; White & Morgan, 2008), participants were presented with pairs of visual objects. Upon viewing the pair of objects for some time, one of the objects—presumed to be familiar to infants—was labeled in Mandarin. Proportionate fixation time to the target object versus the unlabeled object (distractor object) was tracked before and after hearing the label. A statistically significant increase in fixation to the target object upon hearing its label serves as evidence of word recognition. No significant increase in fixation to the target upon hearing its label serves as evidence of rejecting the label as a name for the target object. Occasionally, a third pattern of results surfaces: Participants preferentially fixate the distractor object upon hearing an auditory label. When distractor objects are unfamiliar to participants, a distractor preference is interpreted as evidence that the participant may have formed a new association between the auditory label and the distractor object. This pattern of results is often interpreted as more convincing

208

L. Singh

evidence that the mispronunciation has been definitively rejected and mapped to a different object. Participants were presented with several trials. On half of the trials, the target was correctly labeled while on half of the trials, the target was labeled by a mispronunciation caused by a consonant, vowel, or tone substitution. Participants’ abilities to accurately recognize correctly produced words and to reject incorrect pronunciations were investigated. Participants were tested at two age groups: 3 years of age and 4.5 years of age. Analyses revealed that all participants recognized correctly produced Mandarin words, preferentially fixating visual targets upon hearing them accurately labeled. Participants did not preferentially fixate visual targets upon hearing their labels mispronounced. However, responses varied markedly for mispronunciations due to vowel, consonant, and tone substitutions within each age group. At the younger age group (3 years), upon hearing vowel and consonant mispronunciations, children demonstrated similar responses for vowel and consonant substitutions. Specifically, they did not preferentially fixate target or distractor. In contrast, however, when hearing tone mispronunciations, participants preferentially fixated the distractor object suggesting that sensitivity to tones relative to vowels and consonants was comparatively high. However, at 4.5 years of age, children demonstrated very different results, expressing distractor preferences when hearing vowel and consonant substitutions. In contrast, participants demonstrated no preference for target or distractor objects when presented with tone mispronunciations. In combination, these findings suggest that responses to vowel and consonant variation were similar to one other at both age groups and furthermore, were dissociable from tone sensitivity at both age groups. While vowel and consonant sensitivity appeared to increase in older versus younger children, as reflected by a movement from no target/distractor preference to a distractor preference when hearing a mispronounced label, tone sensitivity appeared to attenuate over the same period. Wewalaarachchi and Singh (submitted) have pursued this line of inquiry in older children, ranging from 5 to 6 years of age, demonstrating that tone sensitivity continues to decrease with age such that 6-year-old children were found to treat tone substitutions equivalently to correct pronunciations. This parallels an age-related decline in tone sensitivity reported by Ma et al. (2017) in novel word learning. Such a decline has not been observed for vowels and consonants either in the present study or in Ma et al. (2017) who manipulated vowels as well as tones. It should be noted that the study reported by Singh et al. (2015) incorporated three Mandarin tones: 1, 2, and 4. Tone 3 (falling–rising tone) was not incorporated as it maintains a more variable form in Mandarin than Tones 1, 2, and 4 due to tone sandhi rules. Singh, Tan, and Wewalaarachchi (2017) investigated effects of salient tone mispronunciations (substitutions between Tones 1 and 4) as well as effects of subtle tone mispronunciations (substitutions between Tones 2 and 3) on word recognition in 3-year-old participants. Results revealed that children were highly sensitive to salient mispronunciations as indicated by Singh et al. (2015). However, they were insensitive to subtle mispronunciations, responding similarly to substitutions of Tones 2 and 3 as they did to correctly produced words. Similar difficulties with Tones 2 and 3 were reported in toddlers tested in familiar word recognition by Shi et al. (2017). This

11 Early Word Recognition and Word Learning in Mandarin …

209

finding informs conclusions drawn from previous studies with younger participants, suggesting that high tone sensitivity is not observed for the entire Mandarin tone inventory, and in fact, toddlers can be largely insensitive to subtle tone changes such as Tone 2 to Tone 3. It should be noted that participants tested in Singh et al. (2015) were bilingual learners of English and Mandarin. It remains unclear whether the observed age-based decline in tone sensitivity could be attributable to learning a non-tone language concurrently with a tone language. This question is informed by the results of a similar study by Ma et al. (2017) revealed that Mandarin monolingual children were not sensitive to a range of tone substitutions at 3 years of age, responding to tone mispronunciations as if they were correct pronunciations. In contrast, the same children were sensitive to vowel substitutions. This finding suggests that sensitivity to tone variation may attenuate with age, whereas sensitivity to vowels and consonants may be more stable over development. In a systematic comparison of monolingual (Mandarin) and bilingual (English– Mandarin) learners, Wewalaarachchi, Wong, and Singh (2017) compared 2-year-old children on their sensitivity to tone, vowel, and consonant variation using a preferential looking paradigm. Wewalaarachchi et al. (2017) reported both similarities and differences between monolingual and bilingual learners. Both groups were similar in correctly accepting accurate labels as referring to familiar targets; however, monolingual infants demonstrated more rapid recognition of correctly produced words than bilingual infants. Both groups were also similar in rejecting consonant, vowel and tone mispronunciations as incorrect labels. However, the relative priority assigned to each type of mispronunciation in terms of processing efficiency (speed of recognition or mis-recognition of the target) differed by group: monolingual Mandarin learning infants demonstrated the least degree of sensitivity to consonants, followed by vowels and tones. In contrast, bilingual infants demonstrated least sensitivity to tones, followed by consonants and then by vowels. This pattern of results suggests that while both groups are similarly accurate in recognizing correct pronunciations and rejecting incorrect productions of familiar words, they varied in terms of the efficiency with which they do so. Moreover, the relative processing constraints associated with vowel, consonant, and tone variation differed between monolingual and bilingual learners which each group demonstrating a different ordering in the processing costs arising from variation in vowels, consonants, and tones. Thus far, studies on spoken word recognition have focused on words presented in citation form. However, in natural speech, words occur predominantly in the context of clauses, phrases, and sentences. The context within which words occur can alter their physical form, leaving it incumbent on listeners to recover the underlying phonological structure. Research with Mandarin speaking children shows that tone sensitivity is quite heavily influenced by the word context within which tones occur (e.g., Wong & Strange, 2017). Instability in the form that words assume can pose a challenge to listeners. This challenge, often termed the ‘variability problem,’ has been reasonably well studied in learners of English and other Western non-tone languages (e.g., Houston & Jusczyk, 2000; Schmale, Cristia, Seidl, & Johnson, 2010; Singh, 2008; Skoruppa, Mani, & Peperkamp, 2013) although less so in learners of Mandarin

210

L. Singh

Chinese. Mandarin Chinese presents with some unique sources of variability, two of which will be discussed here: tone sandhi and tone-intonation relationships. First, Mandarin, like English, is associated with morphophonemic changes where words change their surface form in response to phonological context. A prime example of this in Mandarin Chinese is tone sandhi. According to tone sandhi rules, whole-tone substitutions can occur in a context-conditioned manner. According to the Tone 3 Sandhi rule, when two syllables co-occur, the first syllable alternates to Tone 2, resulting in a Tone 2-Tone 3 disyllabic sequence in place of a Tone 3-Tone 3 disyllabic sequence. Learners therefore have to appreciate that the first syllable in such a sequence bears Tone 3 and has undergone a phonological alternation. In addition, learners have to distinguish the alternating (post-sandhi) form from a disyllable where the base form is a Tone 2-Tone 3 sequence (non-sandhi form). For example, Tone 3 sandhi rules prescribe that that the phrase/fђn(214) tʂh Aŋ(214)/ (flour mill) is obligatorily modified such the first syllable is alternated to [fђn(35) tʂh Aŋ(214)], while preserving the original meaning of the word. However, given that the/fђn(35) tʂh Aŋ(214)/ means ‘graveyard’ (坟场), this tonal alternation creates a potential lexical ambiguity (Chen, 2000). Studies investigating children’s production of sandhi forms suggest that children do not systematically produce sandhi forms over the first 5 years of life (Chen, Wang, Shu, Wu, & Li, 2010; Wang, 2011). Children demonstrate evidence of reliably producing Sandhi forms at 6 years of age (Wang, 2011). In a word recognition study, Wewalaarachchi and Singh (2016) investigated whether children demonstrate receptive knowledge of sandhi forms. In this study, 3–5 year-old Mandarin learning children were presented with familiar words in a paradigm similar to Singh et al. (2015) described earlier in this section. Children were presented with 24 trials belonging to four possible trial types: correctly produced disyllables that were non-sandhi forms all of which were Tone 2-Tone 1 disyllables; ‘garden-variety’ mispronounced forms that were not sandhi forms, all of which were Tone 2-Tone 1 sequences mispronounced as Tone 3-Tone 1 sequences; sandhi forms that had undergone Tone 2 Sandhi alternation (Tone 2-Tone 3 sequences) and pre-sandhi forms that had not undergone the prescribed alternation (i.e., Tone 3-Tone 3 sequences). Children were presented with familiar objects labeled in the four ways articulated above. As before, word recognition was measured via accuracy of fixation to visual targets. In addition, the time course of word recognition was charted. Results revealed that children reliably recognized correctly produced forms, preferentially fixating visual targets when they were correctly labeled. Children also reliably recognized post-sandhi forms that had undergone the correct alternation. Children did not fixate visual targets upon hearing ‘garden-variety’ (non-sandhi) mispronunciations. They also did not fixate visual targets upon hearing pre-sandhi forms. These findings suggest that although children’s productive mastery of sandhi forms may remain fragile through the preschool years, their comprehension processes reflect an ability to distinguish words based on sandhi alternations. The analyses above focused on accuracy of spoken word recognition drawing from analyses of target preferences upon hearing sandhi and non-sandhi forms. However, a more detailed analysis into the time course of target selection revealed a comparatively nuanced picture. Charting the proportion of fixations to the target

11 Early Word Recognition and Word Learning in Mandarin …

211

from the distractor object over time after hearing auditory labels affords insight into temporal constraints on children’s lexical selections. These analyses revealed that children’s eye movements to the target were slightly weaker for post-sandhi forms versus correct pronunciations late in the processing window (1400–2400 ms. after the onset of the target word), revealing processing costs linked to sandhi forms relative to non-sandhi forms. A similar analysis was performed for mispronunciations. It should be noted that upon hearing mispronounced forms, children should not fixate the target object and doing so reflects erroneous mappings between the auditory label and the visual target. Late in the processing window (1600–2200 ms. after the onset of the target word), children demonstrated slightly reduced target fixations (i.e., fewer false alarms to the mispronunciations) when hearing generic mispronunciations relative to pre-sandhi forms. This suggests that generic mispronunciations were more robustly rejected as labels for the target word than pre-sandhi forms. These findings add to accuracy analyses by suggesting that there may be temporal processing costs to sandhi forms relative to non-sandhi forms. Nevertheless, in the aggregate, children appear to demonstrate faithful comprehension of sandhi forms by 5 years of age although it must be acknowledged that this does not demonstrate knowledge of sandhi rules. Another factor that could conceivably complicate word recognition of Mandarin tones is tone-intonation correspondences. Every language uses pitch variation toward a variety of non-lexical ends, such as the communication of vocal affect (Lieberman, 1967), the placement of stress (Fernald & Mazzie, 1991) and to distinguish communicative intent, such as questions versus statements (van Heuven & Haan, 2002). Mandarin Chinese is no exception: Communicative intent, such as questions versus statement forms, is reliably distinguished by pitch variation (Ho, 1977; Yuan, 2004, 2006; Zeng, Martin, & Boulakia, 2004). It is therefore incumbent upon learners to control for intonational variation to arrive at lexical tones and vice versa. Studies with Mandarin speaking adults have demonstrated that adult judgments of communicative intent are indeed taxed by co-occurring tone cues (Yuan, 2004). Specifically, adults encountered particular difficulty identifying question forms when sentences contained rising tones (i.e., Tone 2), interpreted as prioritization of lexical functions over intonational functions of pitch in language processing. In a similar investigation with children, 3–4- and 4–5-year-old Mandarin learning children were tested on their recognition of familiar words in a preferential looking paradigm similar to that used in Singh et al. (2015). In this study, Singh and Chee (2016) presented children with familiar words marked by rising tones, in rising intonation (i.e., question forms) as well as in falling intonation (i.e., statement forms). They were also presented with words marked by falling tones in rising intonation (question forms) as well as in falling intonation (statement forms). Acoustic profiles of each tone are described in Singh & Chee (2016). However, it should be noted that intonational variation did not result in speakers crossing a tone boundary; rather, intonational variation simply altered properties of the pitch contour of the target words in a more subtle fashion that did not cause adult listeners to mis-identify the tone.

212

L. Singh

Results demonstrated that younger children at 3–4 years of age only recognized familiar words when pitch cues to tone and intonation converged. In other words, they only recognized Tone 2 (rising) tone words in question forms and Tone 4 (falling) tone words in statement forms. They did not recognize Tone 2 words in statement forms, nor did they recognize Tone 4 words in question forms. In contrast, by 4– 5 years of age, children recognize familiar words in Tone 2 and Tone 4 in both rising and falling intonation, suggesting that by this point, their interpretation of lexical tones was not contingent upon convergent intonational cues. This study suggests that while tone interpretation on the part of Mandarin learning children is faithful and accurate in infancy (e.g., Singh et al., 2015, 2016), it appears to be more limited when intonational variation changes the realization of specific tones.

11.5 Models of Early Language Development: Where Does Tone Fit? As it stands, prevailing models of early speech perception and language development such as PRIMIR (Werker & Curtin, 2005) and PAM (Best, 1994) do not readily account for lexical tones. In some sense, these models may be challenged by some research findings on tone acquisition, such as by reports of perceptual facilitation for tone in non-tone language learners or by findings that toddlers integrate tones into newly learned words that they fail to discriminate several months earlier. There have been recent efforts to explore the extent to which current developmental models of speech perception account for tones (see Curtin & Werker, 2018; Reid et al., 2015) as well as for adult models of speech perception [see T-TRACE by Tong, McBride and Burnham (2014) or COHORT on lexical tones by Zhou and Marslen-Wilson (1994)]. However, these models await empirical evidence to fully determine whether they capture processing of lexical tones as well as of vowels and consonants. Models must also consider the role of tone in atypical populations given that tone production and perception can be impacted by language disorders that affect prosodic sensitivity (see Chap. 13).

11.6 Conclusions To summarize, infants and children appear to make gradual and incremental progress in their understanding and interpretation of Mandarin phonology. As early as 11 months, infants demonstrate a language-specific interpretation of Mandarin tones, even when learning non-tone language concurrently. Just 1–2 months later, bilingual infants learning English and Mandarin correctly and selectively bind lexical tones to meaning when learning new words in Mandarin. Later, at 24 months of age, when

11 Early Word Recognition and Word Learning in Mandarin …

213

word learning is well underway, infants demonstrate a clear appreciation of the consequences of tone, vowel, and consonant mispronunciations when recognizing familiar words whether they are learning Mandarin monolingually or in conjunction with a non-tone language such as English. The time course of spoken word recognition differs in subtle ways between monolingual and bilingual learners, but both groups demonstrate a robust sensitivity to mispronounced forms. At 3–5 years of age, children demonstrate an awareness of context-driven changes, specifically of tone Sandhi rules, evidenced by correct recognition of legally alternating forms and correct rejection of pre-Sandhi forms as acceptable labels for known objects. Finally, between 4 and 5 years of age, children demonstrate a robust ability to recognize tone-bearing words, corresponding to words they know, regardless of their intonational context. This brief chronicle suggests that while tones appear early in children’s production, leading to the argument that they are the first phonological constituent to which infants are sensitive in perception (Yeung, Chen, & Werker, 2013) and production (Clumeck, 1980). However, the refinement and maturation of tone categories, at least in Mandarin, appear to take several additional years. Somewhat paradoxically, studies with tone interpretation reveal a decline in sensitivity with lexical tones over time (Ma et al., 2017; Singh et al., 2015; Wewalaarachchi & Singh, submitted). This pattern of results has not been observed with vowels and consonants. The scope and longevity of this decline in sensitivity as well as possible means by which older children compensate for reduced tone sensitivity to arrive at correct semantic interpretations remain to be determined. In addition to charting the development of Mandarin in early lexical processing, an additional goal of this review was to provide a comparison of monolingual and bilingual learners of tone languages. Although systematic comparisons are quite rare, bilingual learners of Mandarin appear to develop in their knowledge of Mandarin phonology at a similar pace to monolingual peers, with some evidence of bilingual facilitation in tone interpretation (Singh et al., 2016). There is also evidence of bilingual processing costs in familiar word recognition (Wewalaarachchi et al., 2017) in the form of reduced processing efficiency, consistent with a larger body of studies with bilingual learners demonstrating reduced efficiency in lexical access (Gollan, Fennema-Notestine, Montoya, & Jernigan, 2007; Kaushanskaya & Marian, 2007; Roberts, Garcia, Desrochers, & Hernandez, 2002). In large part, however, bilingual and monolingual learners of Mandarin Chinese appear to demonstrate comparable abilities in building and accessing a Mandarin lexicon. To conclude, there are distinctive elements of Mandarin that warrant systematic investigation of Mandarin acquisition as a complement to the vast body of research conducted on the acquisition of Indo-European languages such as English, French, and Spanish. The presence of a tone system and different links between vowels/consonants and the lexicon (see Wiener & Turnbull, 2016) serve as distinguishing properties of Mandarin compared with English that may lead us to hypothesize a distinct course of language acquisition to that charted for English. Empirical research conducted in each of these areas suggests that language acquisition may operate under different constraints for Mandarin as compared to English. Continued efforts to understand language-specific pathways to proficiency are integral to the

214

L. Singh

development and refinement of models and theories of early language development. Such models and theories often promise to describe universals in development and do not limit their putative scope to the acquisition of specific language communities from which they draw participants. Further research aimed at expanding the evidence basis on the early acquisition of Mandarin and of other language families beyond Romance and Germanic languages could potentially expand existing and future models of early language acquisition in significant ways.

References Ballem, K., & Plunkett, K. (2005). Phonological specificity in children at 1; 2. Journal of Child Language, 32(1), 159–173. https://doi.org/10.1017/S0305000904006567 Bergmann, C., & Cristia, A. (2016). Development of infants’ segmentation of words from native speech: a meta-analytic approach. Developmental Science, 19(6), 901–917. https://doi.org/10. 1111/desc.12341 Best C. T. (1994). The emergence of native-language phonological influences in infants: a perceptual assimilation model. In H. C. Nusbaum (ed.), The development of speech perception: The transition from speech sounds to spoken words. Cambridge, MA: MIT. Bolinger, D. L. (1978). Intonation across languages. In J. Greenberg (Ed.), Universals of human language (pp. 471–524). Stanford: Stanford University Press. Bonatti, L. L., Peña, M., Nespor, M., & Mehler, J. (2005). Linguistic constraints on statistical computations: The role of consonants and vowels in continuous speech processing. Psychological Science, 16(6), 451–459. https://doi.org/10.1111/j.0956-7976.2005.01556.x Burnham, D., Singh, L., Mattock, K., Woo, P. J., & Kalashnikova, M. (2018). Constraints on tone sensitivity in novel word learning by monolingual and bilingual infants: Tone properties are more influential than tone familiarity. Frontiers in Psychology, 8, 2190. https://doi.org/10.3389/fpsyg. 2017.02190. Byers-Heinlein, K., & Werker, J. F. (2013). Lexicon structure and the disambiguation of novel words: Evidence from bilingual infants. Cognition, 128(3), 407–416. Chen, M. Y. (2000). Tone sandhi: Patterns across Chinese dialects. Cambridge: Cambridge University Press. Chen, C., Wang, M., Shu, H., Wu, H., & Li, C. C. (2010). Development of tone sensitivity in young Chinese children. In Speech Prosody 2010-Fifth International Conference. Clumeck, H. (1980). The acquisition of tone. Child Phonology, 1, 257–275. Curtin, S., Fennell, C., & Escudero, P. (2009). Weighting of vowel cues explains patterns of word– object associative learning. Developmental Science, 12(5), 725–731. https://doi.org/10.1111/j. 1467-7687.2009.00814.x Curtin, S. & Werker, J. F. (2018). PRIMIR on tone. Frontiers in Psychology. https://doi.org/10. 3389/fpsyg.2018.01007. Cutler, A., & Mehler, J. (1993). The periodicity bias. Journal of Phonetics, 21, 103–108. Retrieved from https://hdl.handle.net/2066/15603. Dryer, M., & Haspelmath, M. (2013). The world atlas of language structures online. Leipzig: Max planck institute for evolutionary anthropology. Fernald, A., & Mazzie, C. (1991). Prosody and focus in speech to infants and adults. Developmental Psychology, 27(2), 209–221. https://doi.org/10.1037/0012-1649.27.2.209 Floccia, C., Nazzi, T., Delle Luche, C., Poltrock, S., & Goslin, J. (2014). English-learning one- to two-year-olds do not show a consonant bias in word learning. Journal of Child Language, 41(5), 1085. https://doi.org/10.1017/S0305000913000287

11 Early Word Recognition and Word Learning in Mandarin …

215

Gollan, T. H., Fennema-Notestine, C., Montoya, R. I., & Jernigan, T. L. (2007). The bilingual effect on boston naming test performance. Journal of the International Neuropsychological Society, 13(2), 197–208. https://doi.org/10.1017/S1355617707070038 Graf Estes, K., & Hay, J. F. (2015). Flexibility in bilingual infants’ word learning. Child Development, 86(5), 1371–1385. https://doi.org/10.1111/cdev.12392 Havy, M., & Nazzi, T. (2009). Better processing of consonantal over vocalic information in word learning at 16 months of age. Infancy, 14(4), 439–456. https://doi.org/10.1080/152500009029 96532 Havy, M., Serres, J., & Nazzi, T. (2014). A Consonant/Vowel asymmetry in word-form processing: Evidence in childhood and in adulthood. Language and Speech, 57(2), 254–281. https://doi.org/ 10.1177/0023830913507693 Hay, J. F., Graf Estes, K., Wang, T., & Saffran, J. R. (2015). From flexibility to constraint: The contrastive use of lexical tone in early word learning. Child Development, 86(1), 10–22. https:// doi.org/10.1111/cdev.12269 Ho, A. T. (1977). Intonation variation in a Mandarin sentence for three expressions: Interrogative, exclamatory and declarative. Phonetica, 34(6), 446–457. https://doi.org/10.1159/000259916 Højen, A., & Nazzi, T. (2016). Vowel bias in Danish word-learning: processing biases are languagespecific. Developmental Science, 19(1), 41–49. https://doi.org/10.1111/desc.12286. Houston, D. M., & Jusczyk, P. W. (2000). The role of talker-specific information in word segmentation by infants. Journal of Experimental Psychology: Human Perception and Performance, 26(5), 1570–1582. https://doi.org/10.1037/0096-1523.26.5.1570 Jusczyk, P. W., & Aslin, R. N. (1995). Infants? Detection of the sound patterns of words in fluent speech. Cognitive Psychology, 29(1), 1–23. https://doi.org/10.1006/cogp.1995.1010 Kaushanskaya, M., & Marian, V. (2007). Bilingual language processing and interference in bilinguals: Evidence from eye tracking and picture naming. Language Learning, 57(1), 119–163. https://doi.org/10.1111/j.1467-9922.2007.00401.x Keidel, J. L., Jenison, R. L., Kluender, K. R., & Seidenberg, M. S. (2007). Does grammar constrain statistical learning? Commentary on Bonatti, Peña, Nespor, and Mehler (2005). Psychological Science, 18(10), 922–923. https://doi.org/10.1111/j.1467-9280.2007.02001.x Law, F., & Edwards, J. (2015). Effects of vocabulary size on online lexical processing by preschoolers. Language Learning and Development, 11(4), 331–355. https://doi.org/10.1080/ 15475441.2014.961066 Lee, H., Rayner, K., & Pollatsek, A. (2002). The processing of consonants and vowels in reading: Evidence from the fast priming paradigm. Psychonomic Bulletin & Review, 9(4), 766–772. https:// doi.org/10.3758/BF03196333 Lieberman, P. (1967). Intonation, perception, and language. Cambridge: M.I.T. Press. Ma, W., Zhou, P., Singh, L., & Gao, L. (2017). Spoken word recognition in young tone language learners: Age-dependent effects of segmental and suprasegmental variation. Cognition, 159, 139– 155. https://doi.org/10.1016/j.cognition.2016.11.011 Mani, N., & Plunkett, K. (2007). Phonological specificity of vowels and consonants in early lexical representations. Journal of Memory and Language, 57(2), 252–272. https://doi.org/10.1016/j. jml.2007.03.005 Mani, N., & Plunkett, K. (2008). Fourteen-month-olds pay attention to vowels in novel words. Developmental Science, 11(1), 53–59. https://doi.org/10.1111/j.1467-7687.2007.00645.x Mani, N., & Plunkett, K. (2011). Does size matter? Subsegmental cues to vowel mispronunciation detection. Journal of Child Language, 38(3), 606–627. https://doi.org/10.1017/S03050009100 00243 Marchman, V. A., & Fernald, A. (2008). Speed of word recognition and vocabulary knowledge in infancy predict cognitive and language outcomes in later childhood. Developmental Science, 11(3), F9–F16. https://doi.org/10.1111/j.1467-7687.2008.00671.x Mayor, J., & Plunkett, K. (2011). A statistical estimate of infant and toddler vocabulary size from CDI analysis. Developmental Science, 14(4), 769–785. https://doi.org/10.1111/j.1467-7687.2010.010 24.x

216

L. Singh

Minagawa, Y., Hakuno, Y., Kobayashi, A., Naoi, N., & Kojima, S. (2017). Infant word segmentation recruits the cerebral network of phonological short-term memory. Brain and Language, 170, 39–49. https://doi.org/10.1016/j.bandl.2017.03.005 Nazzi, T., & Bertoncini, J. (2009). Phonetic Specificity in early lexical acquisition: New evidence from consonants in coda positions. Language and Speech, 52(4), 463–480. https://doi.org/10. 1177/0023830909336584 Nazzi, T., Floccia, C., Moquet, B., & Butler, J. (2009). Bias for consonantal information over vocalic information in 30-month-olds: Cross-linguistic evidence from French and English. Journal of Experimental Child, 102(4), 522–537. https://doi.org/10.1016/j.jecp.2008.05.003. Newman, R., Ratner, N. B., Jusczyk, A. M., Jusczyk, P. W., & Dow, K. A. (2006). Infants’ early ability to segment the conversational speech signal predicts later language development: A retrospective analysis. Developmental Psychology, 42(4), 643–655. https://doi.org/10.1037/ 0012-1649.42.4.643 Newman, R., Tsay, J., & Jusczyk, P. (2003). The development of speech segmentation abilities. Retrieved from https://hollich.psych.purdue.edu/Jusczyk/pdf/Develop.pdf. Petitto, L. A., Berens, M. S., Kovelman, I., Dubins, M. H., Jasinska, K., & Shalinsky, M. (2012). The “perceptual wedge hypothesis” as the basis for bilingual babies’ phonetic processing advantage: New insights from fNIRS brain imaging. Brain and Language, 121(2), 130–143. https://doi.org/ 10.1016/j.bandl.2011.05.003 Polka, L., Orena, A. J., Sundara, M., & Worrall, J. (2017). Segmenting words from fluent speech during infancy—Challenges and opportunities in a bilingual context. Developmental Science, 20(1), e12419. https://doi.org/10.1111/desc.12419 Polka, L., & Sundara, M. (2003). Word segmentation in monolingual and bilingual infant learners of English and French. Retrieved from https://www.internationalphoneticassociation.org/icphsproceedings/ICPhS2003/papers/p15_1021.pdf. Polka, L., & Sundara, M. (2012). Word segmentation in monolingual infants acquiring Canadian English and Canadian French: Native language, cross-dialect, and cross-language comparisons. Infancy, 17(2), 198–232. https://doi.org/10.1111/j.1532-7078.2011.00075.x Reid, A., Burnham, D., Kasisopa, B., Reilly, R., Attina, V., Rattanasone, N. X., & Best, C. T. (2015). Perceptual assimilation of lexical tone: The roles of language experience and visual information. Attention, Perception and Psychophysics, 77(2), 571–591. https://doi.org/10.3758/s13414-0140791-3 Roberts, P. M., Garcia, L. J., Desrochers, A., & Hernandez, D. (2002). English performance of proficient bilingual adults on the boston naming test. Aphasiology, 16(4–6), 635–645. https://doi. org/10.1080/02687030244000220 Schmale, R., Cristià, A., Seidl, A., & Johnson, E. K. (2010). Developmental changes in infants’ ability to cope with dialect variation in word recognition: Infants’ word recognition across dialects. Infancy, 15(6), 650–662. https://doi.org/10.1111/j.1532-7078.2010.00032.x Shen, X. S., & Lin, M. (1991). A perceptual study of Mandarin tones 2 and 3. Language and Speech, 34(2), 145–156. https://doi.org/10.1177/002383099103400202 Shi, R., Gao, J., Achim, A., & Li, A. (2017). Perception and representation of lexical tones in native Mandarin-learning infants and toddlers. Frontiers in Psychology, 8, 1117. https://doi.org/ 10.3389/fpsyg.2017.01117. Singh, L. (2008). Influences of high and low variability on infant word recognition. Cognition, 106(2), 833–870. https://doi.org/10.1016/j.cognition.2007.05.002 Singh, L. (2017). He said, she said: Effects of bilingualism on cross-talker word recognition in infancy. Journal of Child Language, 1–13. https://doi.org/10.1017/S0305000917000186. Singh, L. (2018). Bilingual Infants demonstrate advantages in learning words in a third language. Child Development, 89(4), e397–e413. https://doi.org/10.1111/cdev.12852 Singh, L., & Chee, M. (2016). Rise and fall: Effects of tone and intonation on spoken word recognition in early childhood. Journal of Phonetics, 55, 109–118. https://doi.org/10.1016/j.wocn.2015. 12.005

11 Early Word Recognition and Word Learning in Mandarin …

217

Singh, L., & Foong, J. (2012). Influences of lexical tone and pitch on word recognition in bilingual infants. Cognition, 124(2), 128–142. https://doi.org/10.1016/j.cognition.2012.05.008 Singh, L., Goh, H. H., & Wewalaarachchi, T. D. (2015). Spoken word recognition in early childhood: Comparative effects of vowel, consonant and lexical tone variation. Cognition, 142, 1–11. https:// doi.org/10.1016/j.cognition.2015.05.010 Singh, L., Tam, J. H., Chan, C., & Golinkoff, R. M. (2014). Influences of vowel and tone variation on emergent word knowledge: a cross-linguistic investigation. Developmental Science, 17(1), 94–109. https://doi.org/10.1111/desc.12097 Singh, L., Morgan, J. L., & White, K. S. (2004). Preference and processing: The role of speech affect in early spoken word recognition. Journal of Memory and Language, 51(2), 173–189. https://doi. org/10.1016/j.jml.2004.04.004 Singh, L., Poh, F., & Fu, C. (2016). Limits on monolingualism? A comparison of monolingual and bilingual infants’ abilities to integrate lexical tone in novel word learning. Frontiers in Psychology, 7. https://doi.org/10.3389/fpsyg.2016.00667. Singh, L., & Quam, C. (2016). Can bilingual children turn one language off? Evidence from perceptual switching. Journal of Experimental Child Psychology, 147, 111–125. https://doi.org/10.1016/ j.jecp.2016.03.006 Singh, L., Reznick, J. S., & Liang, X. H. (2012). Infant word segmentation and childhood vocabulary development: A longitudinal analysis. Developmental Science, 15(4), 482–495. https://doi.org/ 10.1111/j.1467-7687.2012.01141.x Singh, L., Tan, A., & Wewalaarachchi, T. D. (2017). Lexical tone variation and spoken word recognition in preschool children: Effects of perceptual salience. Journal of Child Language, 44(4), 924–942. https://doi.org/10.1017/S0305000916000325 Singh, L., White, K. S., & Morgan, J. L. (2008). Building a word-form lexicon in the face of variable input: Influences of pitch and amplitude on early spoken word recognition. Language Learning and Development, 4(2), 157–178. https://doi.org/10.1080/15475440801922131 Skoruppa, K., Mani, N., & Peperkamp, S. (2013). Toddlers’ processing of phonological alternations: Early compensation for assimilation in English and French. Child Development, 84(1), 313–330. https://doi.org/10.1111/j.1467-8624.2012.01845.x Swingley, D., & Aslin, R. N. (2002). Lexical neighborhoods and the word-form representations of 14-month-olds. Psychological Science, 13(5), 480–484. https://doi.org/10.1111/1467-9280. 00485 Tong, Y., Francis, A., & Gandour, J. (2008). Processing dependencies between segmental and suprasegmental features in Mandarin Chinese. Language and Cognitive Processes, 23(5), 689– 708. https://doi.org/10.1080/01690960701728261 Tong, X., McBride, C., & Burnham, D. (2014). Cues for lexical tone perception in children: Acoustic correlates and phonetic context effects. Journal of Speech, Language, and Hearing Research, 57(5), 1589–1605. https://doi.org/10.1044/2014_JSLHR-S-13-0145 Tsao, F. M. (2017). Perceptual improvement of lexical tones in infants: Effects of tone language experience. Frontiers in Psychology, 8. https://doi.org/10.3389/fpsyg.2017.00558. Van Heuven, V. J., & Haan, J. (2002). Temporal distribution of interrogativity markers in Dutch: A perceptual study. (pp. 61–86). Berlin, New York: Mouton de Gruyter. https://doi.org/10.1515/ 9783110197105.1.61 Wang, C. Y. (2011). Children’s Acquisition of Tone 3 Sandhi in Mandarin. Wang, Y., Spence, M. M., Jongman, A., & Sereno, J. A. (1999). Training American listeners to perceive mandarin tones. The Journal of the Acoustical Society of America, 106(6), 3649–3658. https://doi.org/10.1121/1.428217 Wong P., & Strange W. (2017). Phonetic complexity affects children’s Mandarin tone production accuracy in disyllabic words: A perceptual study. PLoS One, 12(8), 0182337. Werker, J. F., & Curtin, S. (2005). PRIMIR: A developmental framework of infant speech processing. Language Learning and Development, 1(2), 197–234. https://doi.org/10.1207/s15 473341lld0102_4.

218

L. Singh

Wewalaarachchi, T. D., & Singh, L. (submitted). Relative sensitivity to vowel, consonant and tone variation at 6 years of age. Wewalaarachchi, T. D., & Singh, L. (2016). Effects of suprasegmental phonological alternations on early word recognition: Evidence from tone sandhi. Frontiers in Psychology, 7, 627. https://doi. org/10.3389/fpsyg.2016.00627 Wewalaarachchi, T. D., Wong, L. H., & Singh, L. (2017). Vowels, consonants, and lexical tones: Sensitivity to phonological variation in monolingual Mandarin and bilingual English-Mandarin toddlers. Journal of Experimental Child Psychology, 159, 16. https://doi.org/10.1016/j.jecp.2017. 01.009 White, K. S., & Morgan, J. L. (2008). Sub-segmental detail in early lexical representations. Journal of Memory and Language, 59(1), 114–132. https://doi.org/10.1016/j.jml.2008.03.001 Wiener, S., & Turnbull, R. (2016). Constraints of tones, vowels and consonants on lexical selection in Mandarin Chinese. Language and Speech, 59(1), 59–82. https://doi.org/10.1177/002383091 5578000 Wong, P., Schwartz, R., & Jenkins, J. (2005). Perception and production of lexical tones by 3year-old, Mandarin-speaking children. Journal of Speech, Language, and Hearing Research, 48, 1065–1079. https://doi.org/10.1044/1092-4388(2005/074) Yeung, H. H., Chen, K. H., & Werker, J. F. (2013). When does native language input affect phonetic perception? The precocious case of lexical tone. Journal of Memory and Language, 68(2), 123– 139. https://doi.org/10.1016/j.jml.2012.09.004 Yuan, J. (2004). Perception of Mandarin intonation. Paper presented at the Chinese Spoken Language Processing, 2004 International Symposium, pp. 45–48. IEEE. https://doi.org/10.1109/ CHINSL.2004.1409582 Yuan, J. (2006). Mechanisms of question intonation in Mandarin. (pp. 19–30). Berlin, Heidelberg: Springer. https://doi.org/10.1007/11939993_7. Zeng, X. L., Martin, P., & Boulakia, G. (2004). Tones and intonation in declarative and interrogative sentences in Mandarin. In International Symposium on Tonal Aspects of Languages: With Emphasis on Tone Languages. Zhou, X., & Marslen-Wilson, W. (1994). Words, morphemes and syllables in the Chinese mental lexicon. Language and Cognitive Processes, 9(3), 393–422. https://doi.org/10.1080/016909694 08402125.

Chapter 12

Speech Development in Mandarin-Speaking Children Gang Peng and Fei Chen

Abstract The unique ability to communicate through speech clearly distinguishes human beings from all other animals. Children start to produce their first words by the age of one; by four, most children have developed the ability to use their native language; by six or seven, they become veteran users of their native language. Many studies have focused on the developmental trajectories of one or two types of phonetic units of speech (e.g., consonants, vowels, or tones), but a more comprehensive picture of speech development is still lacking. Questions deserving further investigation include: What are the order and rate of acquisition of various phonetic units? What is the possible driving force underlying the developmental order? How can we ascertain when children have obtained the same speech competence as adults? In this chapter, we will first review the literature on the development of speech perception and production in Mandarin-speaking children. Then, we will discuss relevant issues, and suggest possible solutions to the unresolved questions.

12.1 Introduction Children’s language acquisition provides an opportunity for us to observe a language in its nascent state and to trace it through the many subsequent changes. In learning to communicate, children need to gain knowledge of the phonological forms of their mother tongue, and gradually acquire the perceptual discrimination and articulatory gestures required to perceive and produce these sounds in an adult-like manner. For more than 100 years (dating from Sterns’ diaries describing language use in infancy, 1907), there have been thousands of descriptive and experimental studies on children’s speech development in different language backgrounds. Based on these findings, we can figure out the normative pathways to speech development—the G. Peng (B) · F. Chen Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong, China e-mail: [email protected] Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China © Springer Nature Singapore Pte Ltd. 2020 H.-M. Liu et al. (eds.), Speech Perception, Production and Acquisition, Chinese Language Learning Sciences, https://doi.org/10.1007/978-981-15-7606-5_12

219

220

G. Peng and F. Chen

age of acquisition of speech sounds (phonological acquisition)—and subsequently discuss fundamental theories of spoken language development. To date, research in the field of child language development has focused primarily on children’s acquisition of Romance and Germanic languages. Not surprisingly, English has received the most attention. Some norms of the phonological acquisition of English-speaking children, including the developmental stage of phonemes and error patterns, have been studied extensively and well established (e.g., Dodd et al., 2003; Prather et al., 1975; Stoel-Gammon & Dunn, 1985). To compare the developmental patterns among children acquiring different languages, there was a remarkable upsurge of interest in cross-linguistic studies of language acquisition (Slobin, 1985). Many studies have examined developmental universals and languagespecific patterns in the developmental patterns of children from various language backgrounds by investigating the order and rate of acquisition of phonemes. Among them, the peculiarities of Mandarin Chinese offer an excellent, perhaps unique, opportunity for the evaluation and expansion of language theories and language acquisition. Nevertheless, the field of Mandarin acquisition (i.e., studies of how native Mandarinspeaking children acquire their native language) remains relatively underexplored. Furthermore, in the existing research concerning Mandarin morphology, syntax, and writing system acquisition, the acquisition of Mandarin phonology is perhaps the least explored (Hua, 2002). Modern Mandarin is a tonal language with a relatively simple syllable structure (see Fig. 12.1). Each syllable must be attached to one of the four lexical tones which carry different lexical meanings (Wang, 1973). The expression of tones is superimposed on other phonetic units of speech, such as vowels and consonants. Tones are typically instantiated on the vowels, but their realization interacts with surrounding consonants (Hombert, Ohala, & Ewan, 1979). The lexical tones and nuclear vowels are compulsory elements of Mandarin syllables, whereas onset and ending consonants are optional (Wang, 1973). It is expected that the developmental patterns of Mandarin-speaking children’s phonological acquisition reflect both universal tendencies and language-specific constraints.

Fig. 12.1 Diagram of the modern Mandarin syllable (elements in parentheses are optional) (adapted from Chen, Peng, Yan, & Wang, 2017)

12 Speech Development in Mandarin-Speaking Children

221

Chao (1951) presented the earliest description of the phonological acquisition of Mandarin-speaking children. This study provided an analysis of the consonant, vowel, and tone repertoires of the phonological system of a twenty-eight-month-old girl who was acquiring Mandarin as her first language. Since then, many research papers and books have focused on the acquisition of speech sounds in Mandarin. The limited information available suggests that Mandarin children’s phonological development is influenced by the characteristics of Mandarin speech. This chapter will first summarize previous findings on the phonological acquisition of Mandarin, including the developmental trajectories of individual phonetic units of speech (e.g., consonants, vowels, and tones) before discussing theories of phonological acquisition to account for the detailed evidence presented in Mandarin studies. Following this, we will discuss the method of determining when children obtain the same speech competence as adults during the process of phonological maturation. In the final section, we will focus on unsolved issues, and propose future research directions that could potentially inform and advance the field of Mandarin speech acquisition.

12.2 Developmental Trajectories of Phonetic Units of Speech Previous behavioral studies, both cross-sectional and longitudinal, potentially provide ‘normative data’ on the phonological acquisition by Mandarin-speaking children, which can be used for cross-linguistic comparison and the assessment of phonological disorders in Mandarin-speaking children (Dodd et al., 2003). However, significant discrepancies have also been reported in the order and age of acquisition of Mandarin phonetic units. Conflicting results in the sequence and timeline by which children master the speech sounds can be attributed to many factors, including but not limited to, criteria used for performance evaluation, collection approach (crosssectional or longitudinal research), speech mode (spontaneous production or imitation), sample size, age range of subjects, living area of subjects, and number of transcribers. Most concerns are related to methodological issues, particularly the criteria used. Importantly, the acquisition of speech sounds occurs gradually and progressively. It is not all or nothing (Olmsted, 1971). Since the acquisition of phonological production is a developmental continuum ranging from an initial stage of being able to articulate a proper sound of a certain phoneme to the final stage of being able to articulate that phoneme systematically accurately, the distinction between ‘phonological emergence’ and ‘phonological stabilization’ is crucial (Hua, 2002). According to Hua (2002), ‘phonological emergence’ determines when a child is able to articulate a certain phoneme. A phoneme is considered to have emerged when children of a certain age first produce a sound of it correctly at least once. Phonological stabilization refers to when a child articulates a phoneme with certain consistency. Since there is a certain amount of inconsistency in children’s production, a criterion is needed

222

G. Peng and F. Chen

to determine the age of phonological stabilization. A phoneme is considered stable when the child produces it correctly on at least two out of three attempts. When 90% of the children in an age group achieve an accuracy rating of at least 66.7% (i.e., 2/3) for a phoneme, it is considered to be stabilized in that age group.

12.2.1 The Acquisition of Mandarin Consonants There are 22 consonants in Mandarin, of which 21 serve as syllable-initial consonants, and two as syllable-final consonants [ŋ, −n]. The consonant [n] can appear in both the syllable-initial and syllable-final positions, whereas the consonant [ŋ] occurs only in the coda position. Aspiration, a distinctive feature in Mandarin, is used to differentiate six minimal pairs: [p/ph , t/th , k/kh , ts/tsh , tʂ/tʂh , tɕ/tɕh ]. The places and manners of articulation of Mandarin consonants are listed in Table 12.1. Similar to English, consonant acquisition in Mandarin-speaking children has received more attention than other types of phonetic unit. Table 12.2 shows 13 wellcited studies on Mandarin consonant acquisition. Not surprisingly, the differences among them, especially in the criteria used, resulted in differences in the age of acquisition of Mandarin consonants. Specific findings of each study are summarized as follows. Chao (1951) provided a detailed description of the consonant acquisition of a girl who was acquiring Mandarin as her first language in the USA. At the age of 2;4, the child’s consonant inventory consisted of 11 phonemes, including three pairs of unaspirated and aspirated voiceless plosives [p, ph , t, th , k, kh ], two nasals [m, n], and three fricatives [f, s, x]. Given that it was based on one child only, few generalizations can be made from Chao’s study. Though Jeng (1979) used the terms ‘emergence’ and ‘stabilization,’ he did not define specific criteria. This study is an attempt to test the applicability of Jakobson’s Table 12.1 Consonants in Mandarin Manners places

Stop

Nasal

Unasp.

Asp.

Labial

p

ph

m

t

th

n

Vls

Labio-dental Alveolar

Affricate Vd

Lateral

Unasp.

Asp.

ts

tsh

tʂ

tʂh

tɕ

tɕh

f s ʂ

Retroflex

ɕ

Alveolo-palatal Velar

Fricative

k

kh

ŋ

ʐ

x

Note Unasp. = Unaspirated, Asp. = Aspirated, Vls = Voiceless, Vd = Voiced

l

Not clear

Consonant emergence Longitudinal research

Not clear

Consonant stabilization

Consonant emergence Cross-sectional research and stabilization (66.7% criteria)

Consonant emergence Longitudinal research and stabilization (66.7% criteria)

Consonant emergence Longitudinal research

Consonant emergence Longitudinal research and stabilization

Not clear

Jeng (1979)

Wu and Xu (1979)

Hsu (1987)

Shiu (1990)

Hua and Dodd (2000)

Hua (2002): Chap. 4

Cao (2003)

Si (2006)

Liu (2007)

Longitudinal research

Longitudinal research

Cross-sectional research

Longitudinal research

Consonant emergence Description at a certain age

Chao (1951)

Collection approach

Criteria used

Studies

Spontaneous production

Spontaneous production

Spontaneous production

Spontaneous production and imitation

Spontaneous production and imitation

Spontaneous production

Spontaneous production and imitation

Spontaneous production

Spontaneous production

Spontaneous production

Speech mode

10

1

1

4

129

2

28

5

2

1

Sample size

Table 12.2 An overview of studies on Mandarin consonant acquisition in chronological order

(1;6 ~ 1;11)–(1;10 ~ 2;3)

2;0–5;0

Birth–1;9

1;1.15–2;0.15; 1;0.0–2;0.15; 0;10.15–2;0.15; 1;2.0–1;8.0

1;6–4;6

0;9–3;0 and 0;7–2;4

1;0–6;0

Birth–3;0

0;2–1;8 and 1;3–2;7

2;4

Age range

Shanghai

Beijing

(continued)

Shandong province

Beijing

Beijing

Taiwan

Taiwan milieu

Not clear

Taiwan

USA

Living area

12 Speech Development in Mandarin-Speaking Children 223

Consonant emergence Cross-sectional research

Consonant emergence Longitudinal research and stabilization (90% criteria)

Chen and Kent (2010)

Zhong (2013)

Cross-sectional research

Not clear

Xie (2009)

Collection approach

Criteria used

Studies

Table 12.2 (continued)

Spontaneous production and imitation

Spontaneous production

Spontaneous production and imitation

Speech mode

1

24

149

Sample size

3;4–3;10

0;7–1;0 and 1;1–1;6

2;4–6;0

Age range

Tianjin

Taiwan

Shijiazhuang (Hebei Province)

Living area

224 G. Peng and F. Chen

12 Speech Development in Mandarin-Speaking Children Table 12.3 Age of emergence and stabilization of syllable-initial consonants as found by Hua and Dodd (2000)

Age

225

Consonant emergence th ,

Consonant stabilization ɕ

1;6–2;0

t,

2;1–2;6

f, s, tʂ

2;7–3;0

p, l

3;1–3;6

ph , kh ,

3;7–4;0

ʂ

ph

4;1–4;6

ts, tsh , ʐ

l, s, ʐ, tɕ, tɕh

> 4;6

k, m, n, x, tɕ,

tɕh ,

t, m n p, th , f, x, ɕ

tʂh

k, kh

ʂ, tʂ, tʂh , ts, tsh

laws of irreversible solidarity1 in the acquisition of Mandarin phonology. Two boys aged 0;2–1;8 and 1;3–2;7, who were acquiring Mandarin in Taiwan, provided the speech data. Jeng found that the consonants acquired earliest were [p, t, k, ts], followed by nasals, aspirated stops, fricatives (except [f]), approximant [ʐ], and finally [f]. Wu and Xu (1979) longitudinally analyzed speech sounds of five children over three years from birth. The results indicated that the first two consonants to emerge were glottal [h] and nasal [m], before three months. Subsequently, the stops [p], [t], [k], the nasal [n], and fricative [f] emerged after three months. Hsu (1987) conducted a small-scale cross-sectional study to examine the phonological development of 28 children aged 1;0–6;0 who were acquiring Mandarin Chinese in Taiwan. Two-thirds of the parents spoke Taiwanese as their first language, and the others spoke various Chinese dialects. Among the consonants acquired before 20 months were [p, m, t, k], while the children aged 4;4–6;0 still made errors with the consonants [f, n, tʂ, tʂh , ʂ, ʐ, ts, tsh ]. Shiu (1990) analyzed the phonological development of a boy aged 1;0–3;0 and a girl aged 0;7–2;4, and used phonetic accuracy and consistency in the children’s realizations as explicit criteria for acquisition. The early acquisition of [p] and [m] was found to be followed by the establishment of the labial-dental contrasts between [p] and [t], and [m] and [n]. Table 12.3 summarizes the age of emergence and stabilization of Mandarin syllable-initial consonants found in Hua and Dodd’s cross-sectional study (Hua & Dodd, 2000). By 4;6, 90% of the children were able to articulate all of the 21 syllable-initial consonants. Among the first sounds produced by 90% of the children were nasals, alveolar stops, alveolo-palatal fricatives and affricates, and velar stops and fricative. The velar fricative [x] and three alveolo-palatals ([tɕ, tɕh , ɕ]) emerged very early in the children’s speech. Two alveolar affricates ([ts, tsh ]) and the retroflex [ʐ] appeared last.

1 The laws of irreversible solidarity were systematically proposed by Jakobson (1941/1968), and this

theory made a connection between the children’s phonological acquisition order and the distribution of phonological features among the world’s languages. See more details in Sect. 12.3 in this chapter.

226

G. Peng and F. Chen

The stabilization of syllable-initial consonants in Mandarin phonology can be classified into the following three groups: (1) phonemes which were stabilized as soon as the children were able to articulate them (e.g., [t, m, p]); (2) phonemes which took a relatively short period to stabilize after the children were able to articulate them (e.g., [n, f, x]); (3) and phonemes which took a long time to stabilize after the children were able to articulate them (e.g., [tɕ, tɕh , s]). Hua (2002) longitudinally examined the patterns of acquisition of Mandarin consonants among four subjects, with a particular focus on those consonants acquired before the age of two. By the end of the data collection period (24 months old), syllable-initial consonants [p, t, m] and syllable-final consonants [-n, ŋ] stabilized in the speech of all the children. There were variations in the emergence of sounds: while two subjects produced all the Mandarin consonants, one never used the sounds [ph , kh , ʂ, tɕh , tʂh , tsh ]. Another subject never produced the sounds [ph , th , kh , ʐ, tɕh , tʂh , tsh ] in his speech by the offset time of his data collection (i.e., 1;8). Cao (2003) depicted her own daughter’s phonological development from birth to 21 months. She found that the development of Mandarin consonants was closely related to the place and manner of consonant articulation. The developmental order was from front and back-to-middle consonants. The front consonants [m, p] and back consonant [ŋ] emerged earliest, while [ʐ] and [l] were the last to appear. Furthermore, the manner of articulation also influenced the emergence of consonants: nasals appeared first, then stops, fricatives, affricates, and lastly laterals. Voiceless consonants were realized initially as the corresponding voiced ones. The unaspirated consonants appeared earlier than the aspirated ones, and the aspirated consonants developed from weaker to stronger aspiration. Si (2006) made a detailed observation of her daughter’s phonological acquisition from two to five years of age. The age of emergence and stabilization of all Mandarin consonants found by Si are shown in Table 12.4. Unaspirated sounds were acquired earlier than the corresponding aspirated ones. Her daughter acquired all the stops by three years of age; the alveolo-palatals by 3;6; and nasals and lateral by four years old. Among the six fricatives, [ɕ] was the first to be acquired, earlier than [x], both of which were acquired before 3;6. Among the six affricates, the first acquired sound was the unaspirated alveolo-palatal affricate [tɕ], which emerged almost at the same Table 12.4 Age of emergence and stabilization of Mandarin consonants found by Si (2006)

Age

Consonant emergence th ,

n, ŋ, f, tɕ, ɕ, l

2;0

p, m, t,

2;6

x, s, tɕh , −n, ph , k, kh , ts

3;0

tsh ,

tʂ,

tʂh ,

ʂ, ʐ

Consonant stabilization m, t p, ph , th , ɕ tɕ, k, kh

3;6

tɕh , f, x

4;0

n, ŋ

4;6

l, −n

5;0

s, ʂ, ts, tsh , tʂ, tʂh , ʐ

12 Speech Development in Mandarin-Speaking Children

227

time as the velar stops [k, kh ]. The labial [m] was the earliest of the three nasals to be acquired, at approximately age two. In Liu (2007), the front consonants were acquired relatively earlier than the back, and unaspirated consonants were acquired earlier than the aspirated ones. The stops and affricates were acquired earlier than the fricatives, and the nasals were earlier than laterals. Specifically, the nasal coda [-n] was acquired later than the initial [n]. Xie (2009) investigated the acquisition order of syllable-initial consonants in Mandarin, based on a cross-sectional study of 149 children. Front stops were found to be acquired earlier than the back ones, while there was no such regularity for fricatives and affricates. Unaspirated segments were acquired before their aspirated counterparts, and the feature of aspiration was acquired between 2;6 and 3;0. Stops were generally acquired earlier than fricatives, and most fricatives were acquired earlier than affricates, with the exception of [s]. Among the fricatives, [x] and [ɕ] were acquired earliest, followed by [f]. Regarding the place of articulation, the alveolopalatal fricative and affricates were acquired earlier than their retroflex counterparts, and the retroflex fricative and affricates were acquired earlier than their alveolar counterparts. In Chen and Kent (2010), the early development of consonantal production in infants learning Mandarin was studied during the transition from babbling (0;7–1;0) to producing first words (1;1–1;6). Consonantal development showed two universal patterns: labials and alveolars (including alveolars, retroflexes, alveolo-palatals) occurred more frequently than velars; and nasals developed earlier than fricatives, affricates, and liquids. They also found two language-specific patterns in Mandarin: alveolars were more prominent than labials, and affricates were developed early. Zhong (2013) investigated the syllable-initial consonant acquisition of Mandarinspeaking children aged between 3;4 and 3;10. During this period, stops were acquired in the order [p] > [ph ] > [t] > [k] > [th ] > [kh ] (>means ‘is acquired before’); fricatives were acquired in the order [ɕ] > [f] > [x] > [s] > [ʂ]; affricates were acquired in the order [tɕ] > [tɕh ] > [ts] = [tʂ] > [tsh ] = [tʂh ]; and Mandarin sonorants were acquired in the order [m] > [l] > [n] > [ʐ]. In conclusion, a comparison of these studies reveals the following significant anomalies. First, they differ greatly in terms of the criteria for defining when a sound would be considered acquired (‘emerged’ or ‘stabilized’), which led to differences in the age of acquisition identified. Some studies did not even offer a straightforward criterion (e.g., Jeng, 1979; Hsu, 1987; Liu, 2007; Xie, 2009). Reasonably, phonological emergence was much earlier than phonological stabilization (see a direct comparison within the same studies in Tables 12.3 and 12.4). Second, larger samples of subjects were often recruited in the cross-sectional studies than in the longitudinal ones. Consonants often emerged earlier in the longitudinal studies than in the cross-sectional study. For example, all Mandarin consonants emerged before three years in the longitudinal study by Si (2006), but was delayed to 4;6 in the cross-sectional study by Hua and Dodd (2000). The study by Dodd (1995) also reported that the English phoneme repertoires found in the longitudinal studies consisted of more phonemes than those found in the cross-sectional study. A plausible explanation for this difference may be that different types of speech sample

228

G. Peng and F. Chen

were collected. Although both types of study attempted to collect spontaneous speech samples, the children in the longitudinal study had the opportunity to partially control topics and content, and to produce familiar words during daily communication with caregivers, and were less stressful and nervous than in the picture-naming task which is often adopted in cross-sectional study. Since some longitudinal studies involved parents (often also the authors) transcribing their own children’s speech samples (e.g., Cao, 2003; Si, 2006; Zhong, 2013), another possibility for earlier emergence in the longitudinal studies could be familiar listeners giving credit for intended phonemes, even if these phonemes were imperfectly realized. Third, in those studies which adopted the picture-naming task, some of the younger children failed to produce the target word spontaneously and were then asked to imitate the examiner (e.g., Hsu, 1987; Hua & Dodd, 2000; Xie, 2009; Zhong, 2013). Since previous studies quantified the gain in speech intelligibility with the help of ‘lip reading’ in comparison with the acoustic signal only (e.g., Benoît and Goff, 1998), children in these studies might have learnt the speech production from the examiner at the testing time, which might not reflect their actual sound production ability. Fourth, although children in all these studies were acquiring Mandarin as their first language, their language environments were not the same, as summarized in the last column of Table 12.2. For example, the children in Chao (1951) and Clumeck’s (1980) studies were acquiring Mandarin in America, but those in Jeng (1979), Hsu (1987), and Shiu’s (1990) studies were in Taiwan, influenced by Taiwanese (or Hokkien, of the Southern Min dialectal group of Chinese) on a daily basis. It is not clear how much influence the L2 language environment (another language or another Chinese dialect) may exert on the age and order of Mandarin phonemes’ acquisition. Finally, the existence of individual variations in phonological development may be a cause for the disagreements in the above studies, as many longitudinal findings were obtained with data from only one case study (e.g., Chao, 1951; Cao, 2003; Si, 2006; Zhong, 2013). Despite the differences in the analyses and findings mentioned above, these studies have reached a great deal of consensus on the order and rate of acquisition of some Mandarin consonants. First, consonants with different manners were acquired in a specific order, with stops and nasals appearing earlier than fricatives and affricates, which were earlier than laterals. The voiceless consonants were always realized before the corresponding voiced ones in young children. Similarly, unaspirated consonants were acquired earlier than the corresponding aspirated ones in most cases. Second, among the six Mandarin fricatives, [ɕ] and [x] were always the first ones to be acquired, while [s] was often acquired later. The earlier acquired affricates were always the alveolo-palatal ones [tɕ] and [tɕh ], while the other four affricates were often the last ones to be acquired among all 22 Mandarin consonants. Third, the labial consonants [m] and [p] and the alveolar [t] were usually among the earliest Mandarin consonants acquired. The lateral sound [l] and four retroflex sounds [ʂ, ʐ, tʂ, tʂh ] were always among the last Mandarin consonants acquired.

12 Speech Development in Mandarin-Speaking Children

229

12.2.2 The Acquisition of Mandarin Vowels Although analyses of the Mandarin vowel system remain somewhat unsettled on the number of surface vowels and the distribution of allophones in the phonemic system, most studies proposed 12 or 13 surface vowels and four to six vowel phonemes (Cheng, 1973; Lin, 1989). Table 12.5 lists all Mandarin monophthongs, diphthongs, and triphthongs. Among the Mandarin linguofacial monophthongs, [E] and [ђ] both occur in very restricted contexts: [E], as a monophthong, is used only in conversational particles expressing a speaker’s emotions, such as surprise and agreement; [ђ], as a monophthong, occurs only in weakly stressed syllables. The vowel [ɚ] is a retroflexed central vowel, which occurs either in isolation or retroflection and thus has a very restricted combination with onset consonants. The apical vowel [ɿ] occurs only after [ts], [tsh ], and [s], while the apical vowel [ʅ] occurs only after initials [tʂ], [tʂh ], [ʂ], and [ʐ]. There are nine diphthongs and four triphthongs collectively. Table 12.6 lists 11 well-cited studies on Mandarin vowel acquisition, with details including criteria, collection approach, speech mode, sample size, age range, living area of subjects. Similar to Mandarin consonant acquisition, differences in these areas, especially the criteria used, lead to different ages of acquisition of Mandarin vowels being identified. Specific findings of each study are summarized in the following. In Chao (1951), among the observed patterns of Mandarin vowel acquisition, diphthongs tended to be realized as monophthongs, the sounds [i, u, y] in diphthongs and triphthongs went through stages of deletion and addition before stabilization. Wu and Xu (1979) found that the early occurring Mandarin vowels produced by infants were [a], [E], [i], which belong to the category of unrounded front vowels. In Jeng (1979), the four vowels [A, Au, i, E] occurred earliest, while [u, y, o] appeared later. Hsu (1987) found that the monophthongs [A] and [i] emerged first, around the age of 1;1. By the age of 1;6, all monophthongs had emerged, with the exception of Table 12.5 Vowels in Mandarin Vowel height

Vowel backness

High

[i] [y]

Front Monophthong (linguofacial vowel)

High-mid Low-mid

Central

[u] [ђ]

[E]

Low Monophthong (apical vowel)

[ɿ], [ʅ]

Retroflex vowel

[ɚ]

Diphthong

[iA], [uA], [uo], [iE], [yE], [ai], [ei], [Au], [ou]

Triphthong

[uai], [uei], [iAu], [iou]

Back

[A]

[7] [o]

230

G. Peng and F. Chen

Table 12.6 An overview of studies on Mandarin vowel acquisition, in chronological order Studies Criteria used

Collection approach

Speech mode

Chao (1951)

Description at a certain age

Spontaneous 1 production

2;4

USA

Wu and Vowel Xu emergence (1979)

Longitudinal research

Spontaneous 5 production

Birth–3;0

Not clear

Jeng (1979)

Not clear

Longitudinal research

Spontaneous 2 production

0;2–1;8 and 1;3–2;7

Taiwan

Hsu (1987)

Not clear

cross-sectional research

Spontaneous 28 production and imitation

1;0–6;0

Taiwan milieu

Hua and Dodd (2000)

Vowel Cross-sectional Spontaneous 129 emergence research production and and stabilization imitation (66.7% criteria)

1;6–4;6

Beijing

Vowel emergence

Sample Age range size

Living area

Hua Vowel (2002): emergence Chap. 4

Longitudinal research

Spontaneous 4 production and imitation

1;1.15–2;0.15; 1;0.0–2;0.15; 0;10.15–2;0.15; 1;2.0–1;8.0

Beijing

Cao (2003)

Vowel emergence

Longitudinal research

Spontaneous 1 production

Birth–1;9

Shandong province

Si (2006)

Not clear

Longitudinal research

Spontaneous 1 production

2;0–5;0

Beijing

Liu (2007)

Not clear

Longitudinal research

Spontaneous 10 production

(1;6–1;11)–(1;10–2;3) Shanghai

Shi and Vowel Cross-sectional Spontaneous 40 Wen stabilization research production (2007) (66.7% and criteria) imitation

1;0–6;0

Chen and Kent (2010)

0;7–1;0 and 1;1 to 1;6 Taiwan

Vowel emergence

Cross-sectional Spontaneous 24 research production

Tianjin

[y], which presented difficulties even to children aged 6;0. The diphthongs emerged almost as early as the monophthongs, but only five diphthongs [ai, Au, iA, iε, uA] stabilized by the age of 6;0.[uai] stabilized first, at the age of 1;9, followed by [iAu] between 2;7 and 3;0.[uei] and [iou] did not stabilized even between the ages of 5;1 and 6;0. In Hua and Dodd (2000), vowels were found to emerge very early in development. The youngest group of children (1;6–2;0) was able to produce all the monophthongs.

12 Speech Development in Mandarin-Speaking Children

231

Diphthongs were often reduced to monophthongs. Triphthongs were often reduced to diphthongs (in most cases) or sometimes to monophthongs. In Hua (2002), among the monophthongs, the central low vowel [A] and back high vowel [u] were the earliest to emerge in the four children studied here; the retroflex vowel [ɚ] and the back vowel [o] seemed to be the last monophthongs to emerge in the children’s output.[ei] was the first diphthong to emerge for all children, and [yε] the last.[iou] was the first triphthong to emerge, while [uai] was the last, for three children. In Cao (2003), the emergence order of Mandarin monophthongs was identified as [A] > [i] > [E] > [o] > [7] > [ɚ] > [u] > [y]. The Mandarin finals with diphthongs and triphthongs emerged later than monophthongs. Nasal finals were the last to appear, with the nasal coda emerging and developing gradually. In Si (2006), the linguofacial vowel occurred earliest, followed by the retroflex vowel. The two apical vowels appeared later. Among the linguofacial vowels, monophthongs were acquired at age two, one year earlier than the diphthongs and triphthongs. The retroflex vowel [ɚ] was acquired at age 3;6, while the apical vowels [ɿ] and [ʅ] were not fully acquired even at age five. In Liu (2007), for Mandarin monophthongs, the acquisition order was as follows: linguofacial vowels > back apical vowel [ʅ] > retroflex vowel [ɚ] > front apical vowel [ɿ]. Diphthongs were acquired earlier than triphthongs. Shi and Wen (2007) conducted a cross-sectional study of Mandarin vowels produced by 40 Mandarin-speaking children aged one to six. Their results indicated that the acquisition order (66.7% criteria) of Mandarin monophthongs was as follows: [A] > [i] > [7] > [u] > [ʅ] > [ɿ] > [y]. The developmental phases of Mandarin vowels could be divided into three stages: before age two, from two to three, and after three. The development of vowels [A], [i], [7] occurred before age two, and there was tremendous progress in [u], [ʅ], [ɿ], and [y] between two and three. After three, the development speed of all Mandarin vowels slowed down and became comparatively stable. Chen and Kent (2010) recoded spontaneous vocalizations produced by 24 infants grouped by age: G1 (0;7–1;0) and G2 (1;1–1;6). Vowel development exhibited two universal patterns: the predominance of low and mid vowels, e.g., [E] and [ђ], over high vowels. Language-specific patterns were also found, such as the early appearance and acquisition of low vowels [A]. Vowel production was similar in G1 and G2, and a continuum of developmental changes brought infants’ vocalization closer to the adult model. In conclusion, due to methodological differences among various studies, the significant discrepancies shown in consonant acquisition also applied to the age and order of Mandarin vowel acquisition. However, these studies have reached a great deal of consensus in terms of Mandarin vowel acquisition. First, the Mandarin diphthongs and triphthongs were acquired later than monophthongs. Diphthongs and triphthongs tended to be realized as monophthongs during the process of vowel acquisition. Second, among the monophthongs, the linguofacial vowels were acquired first (with the exception of [y]), while the retroflex vowel [ɚ] and two apical vowels ([ɿ], [ʅ]) appeared later. Third, the unrounded Mandarin vowels were often acquired earlier

232

G. Peng and F. Chen

than the rounded vowels ([u], [y]). The low vowel [A] and high vowel [i] were always among the early acquired Mandarin vowels.

12.2.3 The Acquisition of Mandarin Tones As mentioned, Mandarin is a tonal language that exploits variations in pitch at the syllable level to distinguish lexical meanings. The four lexical tones can be categorized phonologically into a high-level tone (Tone 1), a mid-rising tone (Tone 2), a low-falling-rising tone (Tone 3), and a high-falling tone (Tone 4). The major factors distinguishing different lexical tones are the height and direction of the fundamental frequency (F0) contour. The F0 contour of Tone 3 varies depending on context. It is typically a dipping (low-falling-rising) tone in isolation and a low falling tone in non-final position (Xu, 1997). Moreover, when Tone 3 is produced in continuous speech and followed by another tone 3, it changes into Tone 2. There are some other tone sandhi rules in Mandarin, which are closely associated with the morphological structures of Chinese words, and sometimes with grammatical structures. Moreover, weak stress, often referred to as the ‘neutral tone’ or weak syllable (see Norman, 1988), is one of the essential prosodic features in Mandarin. This chapter only focuses on the acquisition of four Mandarin citation tones (Tones 1–4) occurring at the monosyllabic level. The findings regarding Mandarin tone acquisition are comparatively more consistent than those for the acquisition of Mandarin vowels and consonants. Studies on tone production suggested very early mastery of Mandarin tones and reported that children produce tones correctly around age two, well before they have achieved mastery of consonants and vowels (e.g., Chao, 1951; Clumeck, 1980; Hua & Dodd, 2000; Hua, 2002; Li & Thompson, 1977; Si, 2006). Furthermore, Tone 1 and Tone 4 have been found to be mastered earlier than Tone 2 and Tone 3 (Clumeck, 1980; Hua, 2002; Li & Thompson, 1977), and most of the tone errors involve a lack of distinction between Tone 2 and Tone 3 (e.g., Clumeck, 1977; Li & Thompson, 1977). Yeung, Chen, and Werker (2013) systematically explored the developmental changes in tone perception in Mandarin-speaking infants (see also Tsao & Liu, Chap. 10 this volume). Their results demonstrated that language experience could affect the perception of lexical tones from as early as four months old: English, Cantonese-, and Mandarin-exposed infants demonstrated different discrimination abilities that accorded with the properties of their native languages at this stage. This study suggested that the formation of tone categories took place earlier than that of vowels and consonants. What was previously regarded as a language-general stage of phonological development (from birth to six months) appears not to be applicable to infants whose mother tongue is a tonal language. Moreover, Tsao (2008) examined whether the acoustic similarity between lexical tones would affect the perceptual discrimination performance of ten- to 12-month-old infants using the head-turn methodology. Infants were taught to turn their heads to a sound or a change in sound sequences. The results showed that the discrimination accuracy between Tone 1 and

12 Speech Development in Mandarin-Speaking Children

233

Tone 3, which is the most distinct contrast acoustically, was greater than that for other less distinct tonal contrasts (e.g., Tone 2 vs. Tone 3). In a picture-pointing task, three-year-old children showed greater accuracy in perceiving Tone 1, Tone 2, and Tone 4 (90%, 87%, and 89%, respectively) than Tone 3 (70%), which was most frequently misidentified as Tone 2 (Wong et al., 2005). In conclusion, it seems that Mandarin tones are fully acquired earlier than segmental elements (vowels and consonants). The production and perception of Mandarin Tone 1 and Tone 4 are acquired earlier than Tone 2. Tone 3 is always the last tonal category to be acquired in Mandarin. The patterns of tone acquisition identified here raise two specific questions: Why is tone acquisition completed earlier than that of segments? Why is one tone acquired earlier than another? We will discuss these questions in the next section along with related theories of phonological acquisition.

12.3 Theoretical Approaches to Mandarin Phonological Acquisition The theory of child phonology began with Jakobson’s (1941/1968) monograph Child language, aphasia, and phonological universals (translated from German in 1968), which is probably the best-known and most influential account of phonological development. It is grounded in the framework of structural linguistics. This theory suggests that the early acquisition of a sound depends on its distribution across the world’s languages. The acquisition of Mandarin phonetic units partially supports Jakobson’s theory. According to his ‘laws of irreversible solidarity,’ nasals should be acquired before orals, front consonants before back consonants, and stops before fricatives. Shiu (1990) found that the early acquisition of [p] and [m] was followed by the establishment of the labial-dental contrasts between [p] and [t], and [m] and [n]. This finding agrees with Jakobson’s postulation on the first and second consonant split (i.e., the first contrast within the consonantal system is between nasal and oral; the second between labial and dental). However, according to Hua and Dodd (2000), front consonants ([f]) are acquired at about the same stage as back consonants ([x, ŋ]). Moreover, the alveolo-palatal consonants [tɕ, tɕh , ɕ] tend to be acquired earlier than their alveolar counterparts [ts, tsh , s] which have a more frontal articulation. These findings clearly contradict Jakobson’s predictions of the earlier acquisition of front consonants. Moreover, a sound or feature with high distribution frequency in the world’s languages should be acquired early (Jakobson, 1941/1968), and vice versa. The late acquisition of Mandarin vowel [y], which is a rather rare sound among other languages, greatly supports this notion. Furthermore, high falling tones are more frequent across the world’s languages than rising tones, and one study (Chen & Kent, 2009) indicates that in Mandarin-learning infants, falling contours occur (Tone 4) significantly more often than rising contours (Tone 2), similar to

234

G. Peng and F. Chen

the prosodic patterns found in English-learning infants during the first year of their lives. However, the three alveolo-palatal affricates ([tɕ], [tɕh ], [ɕ]), which are very rare in the world’s major languages, emerge very early in Mandarin. These data do not support Jakobson’s proposal that the frequency of a phoneme across the world’s languages reflects its age of acquisition. The notion of ‘markedness’ has also been used to interpret similarities and differences in the order of sound acquisition (Eckman, 1977). It has been hypothesized that those sounds appearing early in a child’s phonological inventory are maximally unmarked, while those occurring late are marked. Therefore, children use unmarked sounds as substitutions for marked sounds in early stages. The unmarked features are assumed to acquire first because they are considered more phonetically natural, and the marked features (e.g., aspiration, roundedness, retroflex, and falling-rising pitch direction) are acquired later. Therefore, children tend to replace marked features with unmarked features. Jakobson’s ‘laws of irreversible solidarity’ and his theory of ‘markedness’ sought to explain children’s acquisition of sounds in relation to the structures of the languages they are learning. In contrast, other researchers (Kent, 1992; Locke, 1983) emphasized the role of child-centered articulatory and perceptual constraints on children’s acquisition of phonology. The phonemes acquired later in Mandarin include all retroflex sounds, liquids, and rounded vowels. The late acquisition of these sounds, which are believed to be difficult to articulate and perceive (Locke, 1983), supports the hypothesis that biological constraints affect the order of phonological acquisition. Furthermore, in terms of the acquisition order of Mandarin lexical tones, Wong (2012) indicated that the order of accuracy of Mandarin children’s four tones (i.e., Tone 4, Tone 1, Tone 2, and Tone 3, from highest to lowest accuracy) follows the order of articulatory complexity, and suggested that tone acquisition is closely related to the maturation of speech motor control. In addition, the early acquisition of a particular feature such as affrication (e.g., [tɕ], [tɕh ]) in Mandarin might highlight the possible influence of the ambient language on acquisition. Against nativist theory (Chomsky, 1965), the ‘environmentalist’ approach, originating from Skinner’s (1957) behaviorism, considers children’s learning as a stimulus–response process. This theory succeeds in drawing attention to the role of the environment (e.g., language input) in acquisition. Chen and Kent (2010) used a text corpus of 1,177, 984 Chinese characters (Cheng, 1982; Liu et al., 1975), including over 900,000 syllables with both consonant and vowel components. They found that one of the major characteristics of Mandarin is a slightly higher frequency of affricates (26.89%) in comparison to that of fricatives (24.97%). Before affricates are completely acquired, they can be replaced either by stops (e.g., [ts] by [t]) or by other affricates (e.g., [tsh ] and [tɕh ] by [ts]), but they have never been found to be replaced by fricatives. However, fricatives are sometimes replaced by affricates. For example, fricative [ɕ] was found to be substituted by the affricates [tɕ] and [tɕh ] (Hua & Dodd, 2000; Hua, 2002). Moreover, the predominant production of [A] over other vowels in Mandarin-learning infants is closely related to the pattern in surrounded child-directed speech. The occurrence of low-central vowels (i.e., [A])

12 Speech Development in Mandarin-Speaking Children

235

was over 28% (the highest frequency of all the Mandarin vowels) in caregivers’ child-directed speech (Chen & Kent, 2010). Hua and Dodd (2000) systematically studied the production of Mandarin among one- to four-year-old children and reported that children acquired phonological elements in the following order: tones were acquired first, followed by vowels and syllable-final consonants, which were then followed by syllable-initial consonants. The phonological saliency hypothesis (Hua & Dodd, 2000) might account for the order of phonological production in Mandarin, with Mandarin tones being the most salient factor. Mandarin tone is compulsory for every syllable. Switching lexical tones causes a change of word meaning, and there are only four choices. Syllableinitial consonants have the lowest saliency: their presence is optional (not all syllables have syllable-initial consonants), and there is a range of 21 syllable-initial phonemes that can be used. Vowels are compulsory syllable components. However, the relatively large number of options (including monophthongs, diphthongs, and triphthongs) lowers their saliency. In conclusion, differences in the saliency of individual components in a certain language may lead to variations in developmental patterns. Singh and Fu (2016) proposed some other approaches to explain why Mandarin tones are acquired earlier than other speech units. One possible explanation for the early emergence of tones pertains to their provenance: vocal pitch. Pitch plays a role in every language in the form of intonation contrast, which is marked primarily by the pitch movements as well. For example, questions and exclamatory sentences are mainly distinguished by pitch across tonal and non-tonal languages alike. Similarly, the use of pitch-to-signal emotion is robust across various languages (Lieberman, 1967). Moreover, the centrality of pitch in auditory processing is evidenced at various stages in development. In the earliest phases of auditory perception, pitch, together with stress and rhythm, is preferentially available to infants before birth (Fifer & Moon, 1988). In conclusion, children’s phonological acquisition is a highly complicated process influenced by a variety of factors, including the role of different phonetic units within a given language and their relationship with other languages, the influence of surrounding speech environments, and the development of biological bases and cognitive ability in children. Consequently, no single theory is adequate to account for all the phenomena documented in studies of phonological acquisition; yet each can account for some aspects of the data (Stoel-Gammon & Sosa, 2007). The explicit dividing line between different theories is not always easy to determine because researchers borrow a feature of one theory and incorporate it into another, as the knowledge of phonological development evolves with time.

12.4 The Phonological Maturation of Mandarin Speech The progressive development of phonological acquisition involves three stages: phonological emergence, phonological stabilization, and phonological maturation. Previous studies have mainly focused on the emergence and stabilization of different

236

G. Peng and F. Chen

phonetic units of speech. The age of emergence records the first time that a child can articulate a phonetic unit and the age of stabilization indicates when a child can produce a phoneme with a certain degree of phonological accuracy and consistency (i.e., 66.7% accuracy). Unsurprisingly, the age of stabilization in most cases is later than the age of emergence. The question naturally arising is when children obtain the same speech competence as adults, in other words, the timing of phonological maturation. On the one hand, perceptual maturation refers to a perceptual competence of children which is equal to that of adult perceivers (e.g., Chen et al., 2017; Lee et al., 2012; Xi et al., 2009). On the other hand, production maturation can be evaluated either through the acoustic measurement (e.g., Chen, 2007; Ma, Chen, Wu, & Zhang, 2018; Shi & Wen, 2007) or through native adults’ perceptual judgement of children’s speech outputs (e.g., Wong, 2012, 2013). On the acquisition of Mandarin tones, early studies (e.g., Chao, 1951; Clumeck, 1980; Hua, 2002; Hua & Dodd, 2000; Li & Thompson, 1977) indicated that lexical tones were produced early with considerable accuracy before the age of three. Judgments of tone errors in these studies were typically made by native adult observers in a categorical fashion (correct or incorrect). However, it is important to note that the aforementioned early production of tones in children before three years of age does not mean that these children have the same tonal production ability as adults. Innovations in speech analysis tools have afforded greater precision in the evaluation of early tone production. A series of studies (Wong, Schwartz, & Jenkins, 2005; Wong, 2012, 2013) have reported that three- to five-year-old preschoolers have not yet fully mastered the production of Mandarin tones. Tone productions of adults (control group) and children were collected in a picture-naming task and low-pass filtered to remove lexical information. Native speakers categorized the target tones in the low-pass filtered productions in which only tone information was reserved. Children’s tone accuracy was compared to that of the adults to determine the level of mastery and developmental changes. None of the Mandarin tones produced by the three- to five-year-old children reached adult-like accuracy, suggesting a protracted course of development extending beyond age five. These findings stood in contrast to earlier studies that claimed very early acquisition (emergence or stabilization) of stable tone productions (Chao, 1951; Clumeck, 1980; Hua, 2002; Hua & Dodd, 2000; Li & Thompson, 1977). On Mandarin tone perception, although a study of three-year-old children suggested that they already achieved relatively high perceptual accuracy of all four Mandarin tones (Wong et al., 2005), research on the developmental course of categorical perception (CP) of Mandarin tones is still limited. The study of CP is useful because it offers a much more refined perceptual method for tracing the course of the stabilization and maturation of children’s fine-grained perception of Mandarin tones beyond age three. A higher degree of CP likely indicates enhanced perceptual ability in young children. Using the classic paradigm of CP Chen et al. (2017), explored how CP of Mandarin tones emerges among 70 four- to seven-year-old children and 16 adults (control group). Mandarin-speaking children exposed to a native tonal language could perceive Mandarin Tone 1 and Tone 2 categorically. The positions of the identification boundaries did not differ significantly between children and adults,

12 Speech Development in Mandarin-Speaking Children

237

Fig. 12.2 Box plots of boundary widths within each age group, adapted from Chen et al. (2017)

but the boundary widths between Tone 1 and Tone 2 did differ significantly, with much narrower boundary widths (i.e., sharper boundaries) occurring in six-year olds than five-year olds (see Fig. 12.2). Moreover, with age, the ability to distinguish more fine-grained tonal differences of between-category pairs improved gradually due to perceptual accumulation. These findings contribute to the literature in discovering the general developmental course of CP during the maturation of tone perception in young children. The indexes of boundary width (i.e., transition slope) and betweencategory discrimination accuracy can thus be utilized quantitatively to explore the developmental trajectory of perceptual ability during the process of maturation. Previous studies have mainly analyzed Mandarin consonant and vowel acquisition using conventional phonetics. In most cases, only one native adult, usually the author, did the transcription and then rated the speech naturally produced by children. The rating results were often subjective and to some extent susceptible to perceptual classification. Improvements in the acoustic analysis may help us objectively evaluate children’s production and/or perception, and further track the maturational stage. For example, Shi and Wen (2007) compared the formant patterns of different Mandarin vowels spoken by young children and adults. The vowel spaces (defined by F1 and F2 values) of adults offer a standard F1–F2 graph to track the development of vowel production. Voice onset time (VOT) has also been utilized to show the development of production (Chen, 2007) and perception (Xi et al., 2009) of Mandarin aspirated vs. unaspirated stops. Chen (2007) showed that children go through a period (1;5– 1;6) in which no distinction is made in the VOT of the stop consonants, passing to a stage (1;7–1;9) in which a systematic but not adult-like distinction is made, before reaching a final stage (1:10–2:11) in which the VOT values of stops resemble the adult model via a process of systematic refinement in VOT.

238

G. Peng and F. Chen

12.5 Future Directions in Mandarin Phonological Acquisition Through discussion of previous studies on Mandarin phonological acquisition, this chapter examined the developmental trajectories (phonological emergence, stabilization, and maturation) of individual phonetic units of speech (e.g., Mandarin consonants, vowels, or tones). However, large discrepancies were also reported in the order and age of acquisition. Conflicting results can be attributed to many factors, including the criteria used, collection approach (cross-sectional or longitudinal research), speech mode (spontaneous production or imitation), sample size, age range of subjects, living area of subjects, number of transcribers, and so on. Unsurprisingly, these differences, especially in the criteria used, result in the identification of different ages and orders of acquisition of speech sounds. In order to establish a more systematic, scientific, reliable, and representative ‘normative data set’ describing the order and age of Mandarin phonological acquisition, future studies should pay close attention to controlling influential factors. Moreover, the normative data obtained in the future should be based on a large, representative sample in order to reflect the true population and therefore minimize individual differences. Previous studies on Mandarin phonological acquisition focused mainly on the development of speech production. Studies investigating the perceptual development of Mandarin consonants and vowels are still few. Moreover, speech perception and production are presumed to be correlated constructs, exemplified by the fact that English speakers with a more fine-grained discrimination of phonetic contrast are also likely to produce the same phonemes with a greater degree of acoustic contrast (e.g., Fox, 1982). The extent of linguistic transfer between perception and production points to an important area of research in first language development (Singh & Fu, 2016). A more precise picture of the nature of the Mandarin production-perception interface will inform and deepen our understanding of how these two domains may be linked during Mandarin phonological acquisition. The concern with generalizations about the order of acquisition leaves no room for considering the nature of individual differences in phonological development. Yet any careful comparison of different children learning the same language shows differences in the individual’s paths of development. Some individual, cultural, and social factors, such as gender, socioeconomic status, sibling status, intelligence, personality, cognitive style, and parenting behaviors (in particular their language habits), have been studied in relation to phonological development (for a review, see Winitz, 1969). Discovering the variables that play an important role in children’s language acquisition will have implications for how norms should be derived and applied to a clinical population (Dodd et al., 2003). Future studies may specifically focus on evaluating the influences of these individual and social factors on Mandarin phonological acquisition. It is also meaningful to investigate the specific influence of L2 language environment (another language or another Chinese dialect) on the age and order of Mandarin sound acquisition.

12 Speech Development in Mandarin-Speaking Children

239

Moreover, behavioral research often requires overt responses and sustained concentration, which is relatively difficult for young children and infants. The development of neuroscience techniques offers us a valuable chance to uncover the neural substrates of how young brains process the smallest building blocks of speech (i.e., phonemes). Noninvasive techniques that enable the examination of language processing in infants and young children have advanced rapidly, including electroencephalography (EEG)/event-related potentials (ERPs), magnetoencephalography (MEG), functional magnetic resonance imaging (fMRI), and near-infrared spectroscopy (NIRS). Research suggested that exposure to language in the first year of life begins to set the neural architecture that supports infants’ subsequent acquisition of language (Kuhl & Rivera-Gaxiola, 2008). Recently, a few ERP studies investigated the brain responses to different Mandarin phonetic units of speech (Cheng et al., 2013; Lee et al., 2012; Liu et al., 2014; Lee & Cheng, Chap. 7 this volume). These studies used mismatch responses (MMRs), consisting of mismatch negativity (MMN) and positive mismatch response (p-MMR), as possible brain indices to investigate the development of Mandarin speech perception. The developmental patterns revealed by various noninvasive techniques are important for understanding the underlying neural and physiological bases of Mandarin speech development. Finally, as suggested by Wang (1978), how children learn a language—the transmission across generations—is clearly one of the vital questions in the whole of language change. As in historical sound change, both lexical and phonetic parameters are involved during the phonological acquisition in children. This point, often missed by most studies, helps connect microhistory, as seen in language acquisition, to mesohistory, as studied by historical linguistics. Two important studies in this area (Ferguson and Farwell, 1975; Hsieh, 1972) investigated the development of phonological production in relation to the acquisition of words, and showed that there was a primacy of lexical learning during phonological development. Importantly, the child does not progress by acquiring units like phonemes or allophones, but rather by gradually adding lexical items to his/her repertoire. The same sound appearing in different words may undergo different developmental trajectories, and the unity of the phoneme only emerges when the acquisition process is fully completed. Consequently, the basic unit of acquisition is something like the word, and many valuable conclusions would be derived from observing the lexically gradual nature of sound substitution during the long process of children’s language learning.

References Benoˆıt, C., & Le Goff, B. (1998). Audio-visual speech synthesis from French text: Eight years of models, designs and evaluation at the ICP. Speech Communication, 26(1), 117–129. Cao, J. X. (2003). One case study of early phonological development in Mandarin-speaking children. In Proceedings of the 6th National Symposium on Modern Phonetics, Tianjin, Tianjin normal University. [曹井香. 汉族儿童早期语音发展个案研究. 第六届全国现代语音学学术会议论文集. 2003].

240

G. Peng and F. Chen

Chao, Y. R. (1951). The Cantian idiolect: An analysis of the Chinese spoken by a twenty-eightmonths-old child. In W. J. Fischel (Ed.), Semitic and oriental studies (pp. 27–44). Berkeley, CA: University of California Press. Chen, F., Peng, G., Yan, N., & Wang, L. (2017). The development of categorical perception of Mandarin tones in four- to seven-year-old children. Journal of Child Language, 44(6), 1413–1434. Chen, F. Y. (2007). The development of voicing contrast in Mandarin: A longitudinal case study (Unpublished MA dissertation). College of Foreign Languages, Hunan University. [陈霏燕. 普通话儿童塞音送气对立获得研究:一个个案研究. 湖南大学硕士学位论文, 2007]. Chen, L. M., & Kent, R. D. (2009). Development of prosodic patterns in Mandarin-learning infants. Journal of Child Language, 36(1), 73–84. Chen, L. M., & Kent, R. D. (2010). Segmental production in Mandarin-learning infants. Journal of Child Language, 37(2), 341–371. Cheng, C. C. (1973). A synchronic phonology of Mandarin Chinese. The Hague: Mouton. Cheng, C. M. (1982). Analysis of present-day Mandarin. Journal of Chinese Linguistics, 10, 281– 358. Cheng, Y. Y., Wu, H. C., Tzeng, Y. L., Yang, M. T., Zhao, L. L., & Lee, C. Y. (2013). The development of mismatch responses to Mandarin lexical tones in early infancy. Developmental Neuropsychology, 38(5), 281–300. Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge, MA: MIT Press. Clumeck, H. V. (1977). Studies in the acquisition of Mandarin phonology (Unpublished doctoral dissertation). Berkerley: University of California. Clumeck, H. (1980). The acquisition of tone. In G. H. Yeni-Komshian, J. F. Kavanaugh, & C. A. Ferguson (Eds.), Child phonology: Production (Vol. 1, pp. 257–275). New York, NY: Academic Press. Dodd, B. (1995). Differential diagnosis and treatment of children with speech disorder. London: Whurr Publishers. Dodd, B., Holm, A., Hua, Z., & Crosbie, S. (2003). Phonological development: A normative study of British English-speaking children. Clinical Linguistics & Phonetics, 17(8), 617–643. Eckman, F. R. (1977). Markedness and the contrastive analysis hypothesis. Language Learning, 27, 315–330. Ferguson, C. A., & Farwell, C. B. (1975). Words and sounds in early language acquisition: English initial consonants in the first fifty words. Language, 51, 419–439. Fifer, W. P., & Moon, C. (1988). Auditory experience in the fetus. In W. P. Smotherman & S. R. Robinson (Eds.), Behavior of the fetus (pp. 175–188). Caldwell, NJ: Telford Press. Fox, R. A. (1982). Individual variation in the perception of vowels: Implications for a perceptionproduction link. Phonetica, 39(1), 1–22. Hombert, J. M., Ohala, J. J., & Ewan, W. G. (1979). Phonetic explanations for the development of tones. Language, 55(1), 37–58. Hsieh, H. I. (1972). Lexical diffusion: Evidence from child language acquisition. Glossa, 6(1), 89–104. Hsu, J. (1987). A study of the various stages of development and acquisition of Mandarin Chinese by children in Taiwan milieu (Unpublished MA dissertation). College of Foreign Languages: Fu Jen Catholic University. Hua, Z. (2002). Phonological development in specific contexts: Studies of Chinese-speaking children (Vol. 3). Clevedon: Cromwell. Hua, Z., & Dodd, B. (2000). The phonological acquisition of Putonghua (modern standard Chinese). Journal of Child Language, 27(1), 3–42. Jakobson, R. (1941/1968). Child language, aphasia and phonological universal. The Hague: Mouton. Jeng, H.-H. (1979). The acquisition of Chinese phonology in relation to Jakobson’s law of irreversible solidarity. In Proceedings of the 9th International Congress of Phonetic Sciences (Vol. 2, pp. 155–161). Copenhagen: University of Copenhagen.

12 Speech Development in Mandarin-Speaking Children

241

Kent, R. (1992). The biology of phonological development. In C. A. Ferguson, L. Menn, & C. Stoel-Gammon (Eds.), Phonological development: Models, research, implications (pp. 65–90). Timonium, MD: York Press. Kuhl, P., & Rivera-Gaxiola, M. (2008). Neural substrates of language acquisition. Annual Review of Neuroscience, 31, 511–534. Lee, C. Y., Yen, H. L., Yeh, P. W., Lin, W. H., Cheng, Y. Y., Tzeng, Y. L., & Wu, H. C. (2012). Mismatch responses to lexical tone, initial consonant, and vowel in Mandarin-speaking preschoolers. Neuropsychologia, 50(14), 3228–3239. Li, C. N., & Thompson, S. A. (1977). The acquisition of tone in Mandarin-speaking children. Journal of Child Language, 4(2), 185–199. Lieberman, P. (1967). Intonation, perception and language. Cambridge, MA: MIT Press. Lin, Y. H. (1989). Auto-segmental treatment of segmental processes in Chinese phonology (Unpublished Ph. D. dissertation). Austin: University of Texas. Liu, C. Y. (2007). The development of Mandarin phonology from 18 to 23 months in Shanghai (Unpublished MA dissertation). College of Humanities and Communication, Shanghai Normal University. [刘春燕. 18–23个月儿童普通话的语音发展 (上海地区). 上海师范大学硕士学位论文, 2007]. Liu, H. M., Chen, Y., & Tsao, F. M. (2014). Developmental changes in mismatch responses to Mandarin consonants and lexical tones from early to middle childhood. PLoS ONE, 9(4), e95587. Liu, I. M., Chuang, C. J., & Wang, S. C. (1975). Frequency count of 40,000 Chinese words. Taipei: Lucky Books Company. Locke, J. (1983). Phonological acquisition and change. New York, NY: Academic Press. Ma, J., Chen, X., Wu, Y., & Zhang, L. (2018). Effects of age and sex on voice onset time: Evidence from Mandarin voiceless stops. Logopedics Phoniatrics Vocology, 43(2), 56–62. Norman, J. (1988). Chinese. Cambridge: Cambridge University Press. Olmsted, D. (1971). Out of the mouth of babes. The Hague: Mouton. Prather, E. M., Hedrick, D. L., & Kern, C. A. (1975). Articulation development in children aged two to four years. Journal of Speech and Hearing Disorders, 40(2), 179–191. Shi, F., & Wen, B. Y. (2007). Vowel development in Mandarin-speaking children. Zhongguo Yuwen, 2007(5), 444–454. [石锋, 温宝莹. 汉语普通话儿童的元音发展. 中国语文, 2007(5), 444–454]. Shiu, H.-S. (1990). The phonological acquisition by Mandarin-speaking children: A longitudinal case study on children from 9 months through three years old (Unpublished MA thesis). Taiwan Normal University. Si, Y. Y. (2006). Mandarin phonological acquisition: A case study. Contemporary Linguistics, 2006(1), 1–16. [司玉英. 普通话儿童语音习得的个案研究. 当代语言学, 2006(1), 1–16]. Singh, L., & Fu, C. S. (2016). A new view of language development: The acquisition of lexical tone. Child Development, 87(3), 834–854. Skinner, B. F. (1957). Verbal behaviour. Englewood Cliffs, NJ: Prentice-Hall. Slobin, D. I. (1985). The crosslinguistic study of language acquisition: Theoretical issues (Vol. 2). Hillsdale, NJ: Psychology Press. Stern, C., & Stern, W. (1907). Monographien über die seelische Entwicklung des Kindes (Vol. 1). Leipzig: J. A. Barth. Stoel-Gammon, C., & Dunn, C. (1985). Normal and disordered phonology in children. Baltimore, MD: University Park Press. Stoel-Gammon, C., & Sosa, A. V. (2007). Phonological development. Oxford: Blackwell. Tsao, F. M. (2008). The effect of acoustical similarity on lexical tone perception of one-year-old Mandarin-learning infants. Chinese Journal of Psychology, 50(2), 111–124. Wang, W. S. Y. (1973). The Chinese language. Scientific American, 228, 50–60. Wang, W.S.-Y. (1978). The three scales of diachrony. In B. B. Kachru (Ed.), Linguistics in the seventies (pp. 63–75). Urbana, IL: Department of Linguistics, University of Illinois. Winitz, H. (1969). Articulatory acquisition and behavior. New York, NY: Appleton-Century-Crofts. Wong, P. (2012). Acoustic characteristics of three-year-olds’ correct and incorrect monosyllabic Mandarin lexical tone productions. Journal of Phonetics, 40(1), 141–151.

242

G. Peng and F. Chen

Wong, P. (2013). Perceptual evidence for protracted development in monosyllabic Mandarin lexical tone production in preschool children in Taiwan. The Journal of the Acoustical Society of America, 133(1), 434–443. Wong, P., Schwartz, R. G., & Jenkins, J. J. (2005). Perception and production of lexical tones by 3-year-old, Mandarin-speaking children. Journal of Speech, Language, and Hearing Research, 48(5), 1065–1079. Wu T. M., Xu, Z. Y. (1979). A preliminary analysis of language development in children during the first three years. Acta Psychologia Sinica, 11(2), 153–165. [吴天敏, 许政援. 初生到三岁儿童言语发展记录的初步分析. 心理学报, 11(2), 153–165]. Xi, J., Jiang, W., Zhang, L. J., & Shu, H. (2009). Categorical perception of VOT and lexical tones in Chinese and the developmental course. Acta Psychologia Sinica, 41, 572–579. Xie, H. (2009). The acquisition of Mandarin initial consonants (Unpublished MA dissertation). College of Foreign Languages, Hunan University. [谢衡. 汉语普通话儿童的声母习得研究. 湖南大学硕士学位论文, 2009]. Xu, Y. (1997). Contextual tonal variations in Mandarin. Journal of Phonetics, 25(1), 61–83. Yeung, H. H., Chen, K. H., & Werker, J. F. (2013). When does native language input affect phonetic perception? The precocious case of lexical tone. Journal of Memory and Language, 68(2), 123– 139. Zhong, X. (2013). A case study of consonant acquisition in Mandarin-speaking children (Unpublished MA dissertation). College of Foreign Languages, Tianjin Normal University. [钟心. 普通话儿童辅音习得个案研究. 天津师范大学硕士学位论文, 2013].

Chapter 13

Behavioral and Neurophysiological Evidence of Speech Processing in Chinese-Speaking Individuals with Autism Spectrum Disorder: A Review and Future Directions Yan H. Yu and Valerie L. Shafer Abstract Autism spectrum disorder (ASD) is a neurodevelopmental disorder that presents with core deficits in language and social communication areas. Past decades have witnessed a growing number of studies concerning this population’s language and communication skills. However, studies focusing on Chinese-speaking individuals with ASD are rare and have just begun to accumulate. This review focuses on prosody and lexical tone perception and production in Chinese-speaking individuals with ASD. We also briefly review the evidence from general ASD literature for cross-language comparisons. Similar to patterns seen in many non-tonal language speakers with ASD, Chinese-speaking individuals with ASD generally demonstrate atypical pitch in terms of both average and range of values in verbal productions. Behavioral and neurophysiological evidence suggest atypicality, such as enhanced lower-level auditory processing and reduced higher-level linguistic processing in Chinese-speaking individuals with ASD. We also report some preliminary neural intervention data on bilingual English–Mandarin-learning children with ASD. Future directions on advancing theory and practice are discussed.

13.1 Introduction Autism spectrum disorder (ASD) is a neurodevelopmental disorder with core deficits in social interaction, language, and communication, as defined by the International Classification of Diseases and Related Health Problems, tenth edition (WHO, 1992), Y. H. Yu (B) Department of Communication Sciences and Disorders, St. John’s University, Queens, NY, USA e-mail: [email protected]; [email protected] V. L. Shafer Speech-Language-Hearing Science, The Graduate Center, City University of New York, New York, NY, USA e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2020 H.-M. Liu et al. (eds.), Speech Perception, Production and Acquisition, Chinese Language Learning Sciences, https://doi.org/10.1007/978-981-15-7606-5_13

243

244

Y. H. Yu and V. L. Shafer

and the Diagnostic and Statistical Manual of Mental Disorders, fifth edition (DSM5, American Psychiatric Association, 2013). As the name suggests, individuals with such a “spectrum disorder” show a wide range of symptoms, ranging from mild to severe. The DSM-5 has abandoned the use of subcategorized diagnoses; however, researchers sometimes use subcategorical terms, such as “high-functioning autism,” “low-functioning autism,” and “Asperger’s syndrome,” especially in publications prior to the DSM-5. High-functioning autism (HFA) is an unofficial term used to describe the milder forms of ASD. Individuals with HFA usually have an intelligence quotient of 70 or above, and are generally able to use words to communicate in daily life. In contrast, low-functioning autism (LFA) is an unofficial term referring to individuals on the severe end of the autism spectrum, often with an intelligence quotient of below 70 and very limited language production and comprehension. Since the first report by Kanner in 1943, the prevalence of autism has been increasing. One of the most recent reports from the US Centers for Disease Control and Prevention (CDC) (2014) estimated that ASD occurs in 1 out of 68 children (or 14.7 per 1000 8year-old children). The prevalence of ASD in the Chinese population is at least 1.18 per 1000 in China (Sun et al., 2013; Zhang & Ji, 2005), with at least 1.3–2 million children under 13 years old affected by ASD nationwide (Huang, Jia, & Wheeler, 2013). A brief report by Tao (1987) on four cases of infantile autism marked the first published study of ASD on the Chinese population. Chinese languages, such as Mandarin and Cantonese, are tonal languages. It is currently unclear whether theories and findings based on non-tonal languages such as English can be applied to the Chinese-speaking individuals with ASD due to a paucity of cross-language studies. The landscape of research and scholarly work pertaining to ASD on the Chinese population is still relatively sparse, despite an exponential increase in the number of studies on English-speaking individuals with ASD in the last 15 years. Chinese (e.g., Mandarin or Cantonese) phonology differs from non-tonal languages, such as English, in terms of its use of pitch at the phonemic level. Specifically, the pitch pattern (e.g., high fundamental frequency (F0) vs. low-rising F0) serves to differentiate meanings of words that share the same phonetic segments (e.g., Tone 1 bi means force vs. Tone 2 bi means nose). Whether individuals with ASD whose native language is tonal show similar speech-processing deficits to those with a non-tonal native language (e.g., English or Finnish) is an important experimental and theoretical question, because, as reviewed in Tsao’s chapter (Chap. 10) and Singh’s chapter (Chap. 11) in this book, children with tonal language backgrounds may follow a somewhat divergent developmental trajectory from that of children with non-tonal language backgrounds (also see a recent review by Curtin & Werker, 2018). Our goal in this review is to provide an overview and an evaluation of the behavioral and neurological evidence examining Chinese prosody and lexical tone processing in individuals with ASD. Research on speech perception and production abilities in Chinese individuals with ASD is a recently emerging area, but significant progress has been made. Systematically summarizing our understanding of prosody and lexical tone processing in tonal language speakers, as well as how these factors

13 Behavioral and Neurophysiological Evidence of Speech …

245

associate with ASD at both behavioral and brain levels, will further the research and clinical practice in these areas. We searched the following databases: Cochrane, ERIC, Google Scholar, NCBI/PubMed, PsycINFO, and Web of Science, using the keywords: {“lexical tone,” “prosody,” “intonation,” “pitch,” or “fundamental frequency”} and {“Mandarin,” “Cantonese,” or “Chinese”} and {“Autism” or “Asperger”} in March and August 2017. We checked the bibliographies of the relevant articles and found only nine relevant studies on Chinese prosody and lexical tone perception and production across the field of behavioral and neurophysiological research. Two of the studies focused on Cantonese and seven focused on Mandarin. In this review, we examine these studies in relation to the extensively researched area of prosody, pitch production, and perception in individuals with ASD from English and other non-tonal language backgrounds. We first discuss findings from the behavioral literature and then describe recent data obtained using brain measures. This review is followed by a case study on the neuroplasticity of children with ASD. Lastly, we discuss the theoretical and clinical significance of the evidence that has accumulated thus far, and we point out gaps and challenges in understanding prosody, lexical tone perception, and production of Chinese-speaking individuals with ASD.

13.2 Prosody in Individuals with ASD: General Background The terms “prosody” and “intonation” are often used interchangeably in the literature. In this paper, we use “prosody” as a superordinate term to describe changes in pitch, intensity, duration, and voice quality (Cummins et al., 2015; Titze, 1994). In other words, prosody is a suprasegmental feature of speech that is expressed via variations in pitch (fundamental frequency), loudness (intensity), duration, stress, and rhythm (Culter & Isard, 1980). Pitch is the perceptual correlate to the frequency of vocal fold vibration (i.e., F0). During vocal production, individuals often modulate their pitch to convey different emotions and pragmatic connotations (e.g., posing a question, delivering a statement, making an imperative order, expressing surprise). In tonal languages such as Mandarin, Cantonese, and Thai, pitch (which serves as a phonemic contrast) is also referred to as “lexical tone.” Two of the most widely used diagnostic tools for ASD in English-speaking culture, the Autism Diagnostic Observation Schedule-2 (ADOS-2; Lord et al., 2012) and the Autism Diagnostic Interview-Revised (ADI-R; Rutter, Le Couteur, & Lord, 2003), include atypical prosody production as a diagnostic criterion. Atypical prosodic production and perception in individuals with ASD have been reported from the onset of research within this group (Asperger, 1944; Kanner, 1943; Simmons & Baltaxe, 1975). Some researchers proposed that distinctive and atypical vocal characteristics such as monotonous and robot- or machine-like speech may serve as

246

Y. H. Yu and V. L. Shafer

one of the earliest-appearing biological markers of later ASD diagnosis. However, there is some evidence that children with ASD may have intact prosody perception (Grossman & Tager-Flusberg, 2012; Paul, Augustyn, Klin, & Volkmar, 2005) or superior pitch processing skills (e.g., Bonnel Mottron, Peretz, Trudel, Gallun, Bonnel, 2003; Stanutz, Wapnick, & Burack, 2014). An understanding of prosodic production and perception in ASD is critical since prosody often serves as a major cue in conveying both linguistic and paralinguistic functions (Crystal & Quirk, 1964). Moreover, attention and sensitivity to prosody play a critical role in early language development (Jusczyk, 1997; Mehler et al., 1988). Children with ASD are known to have difficulties detecting vocal prosodic cues that convey irony and sarcasm (Wang, Dapretto, Hariri, Sigman, & Brookheimer, 2004; Wang, Lee, Sigman, & Dapretto, 2006, 2007). Children and adolescents with high-functioning autism (HFA) sometimes have difficulty making use of vocal cues to make inferences about a speaker’s intentions in tasks designed to probe theory of mind (ToM) (e.g., Chevallier, Noveck, Happé, & Wislon, 2011). ToM is a theoretical position that individuals, such as children with ASD, have difficulty understanding that other people have separate thoughts, intentions, and feelings that are different from one’s own (Baron-Cohen, Leslie, & Frith, 1985). In an fMRI study, Eigsti and colleagues found that, compared to typically developing peers, high-functioning children with ASD showed broader recruitment of brain areas while processing both affective and grammatical prosodic cues (Eigsti, Schuh, Mencl, Schultz, & Paul, 2012). The authors suggested that for a fairly simple language-processing task, greater recruitment of areas involved in executive functions and what they refer to as “mind-reading” functions can be interpreted as less automaticity in processing language.

13.3 Atypical Prosody Production in ASD 13.3.1 Atypical Prosody Production in ASD with Non-tonal Language Background The speech of individuals with ASD has been described as both “monotone” and “exaggerated” (Baltaxe & Simmons, 1985). Atypical prosody production has been suggested as a “bellwether” of the cognitive profiles of individuals with ASD, as well as a behavioral indicator of subtypes of ASD (e.g., hypersensitive vs. hyposensitive to auditory input) (Diehl, Berkovits, & Harrison, 2010, p. 167). However, the precise features of prosody in the speech of individuals with ASD are only recently becoming evident. A review of 16 earlier studies on prosody production in ASD by McCann and Peppé (2003) revealed many contradictory findings; these contradictions may be due to inadequate research, small sample sizes in many studies, and great variability in methodology across studies. A number of more recent studies with larger sample sizes indicate that higher mean pitch and/or wider pitch range in speech of participants with ASD is the primary

13 Behavioral and Neurophysiological Evidence of Speech …

247

prosodic difference in comparison with the speech of controls when using tasks such as lexical elicitation (e.g., Bonnel et al., 2003), sentence elicitation (Diehl & Paul, 2013) and spontaneous prosodic production (Diehl, Watson, Bennetto, McDonough, & Gunlogson, 2009; but see Quigley et al., 2016 for contrary findings). A recent systematic review and meta-analysis on 34 empirical studies of vocal production in ASD calculated a moderate effect size (Cohen’s d of 0.4–0.5), but with a discriminatory accuracy of only about 61–64% (Fusaroli, Bang, Bowler, & Gaigg, 2017). Acoustic measures (e.g., duration and intensity) other than mean and variance of pitch have also been examined, but they were not found to be stable predictors of speech produced by individuals with ASD ( Fusaroli et al., 2017). Endeavors to link pitch production with severity of ASD have led to highly inconclusive results. A computerized task, the Profiling Elements of Prosodic Systems in Children (PEPS-C), developed to assess prosody perception and production in children aged 4–16 years, has been used in a number of studies. The general finding was that children with ASD have atypical prosodic output (e.g., Diehl et al., 2009; McCann, Peppé, Gibbon, O’Hare, & Rutherford, 2007), but that they also demonstrated a considerable amount of similarity in their prosodic output/production relative to their typically developing (TD) peers and children with other types of disorders. This was especially the case with simple tasks, such as when models or prompts were provided and/or no face-to-face spontaneous interaction was required (e.g., Diehl & Paul, 2013). As Fusaroli et al. (2017) pointed out in their review paper, among the five studies that have examined the correlation between pitch measures and severity of ASD symptoms, both strength and direction of the relation varied across studies and pitch parameters. For example, Bone and colleagues analyzed speech segments of spontaneous interaction between a child and a psychologist during a standard observational evaluation session using the ADOS and found that autism severity score was negatively correlated with the median pitch slope of the turn end, but no correlation with the pitch center or pitch slope variability (Bone et al., 2014). Nadig and Shaw (2012) found no correlation between pitch range and behavioral characteristics (IQ, language, and autism severity scores) of children that were examined based on analyzing spontaneous conversation samples. Nakai and colleagues have also reported no correlation between pitch coefficient of variation and total score from the Autism Screening Questionnaire (Nakai, Takashima, Takiguchi, & Takada, 2014). Diehl et al. (2009) tested two groups of children using similar narrative elicitation tasks. There was a positive correlation between clinician’s judgment of ASD severity and the variance of ASD pitch production in the older children and teenagers (Study 1, 10–18 years old), but no such correlation in the younger children (Study 2, 6–14 years old). It is possible that the difference in measurement and methods (e.g., unprompted test condition) among these studies led to different outcomes. So far, pitch measures show promise to distinguish individuals with and without ASD, but there is no evidence that they reflect severity of the disorder. We are certainly not ready to use prosodic measures as a tool to measure the severity of ASD. Several steps need to be taken first. In particular, it will be important to develop tasks that consistently and robustly result in prosodic differences between individuals with

248

Y. H. Yu and V. L. Shafer

ASD and controls. Additional research needs to be undertaken with individuals who have language impairment, as well as ASD.

13.3.2 Atypical Prosody Production in Chinese Speakers with ASD Tonal languages, such as Mandarin and Cantonese, use pitch both as a phonemic contrast at the lexical level and as a suprasegmental cue for intonation changes (as with non-tonal languages). The “contour interaction” theory (Thorsen, 1980; Vaissiere, 1983) and the “tone sequence” theory (Pierrehumbert, 1980) posited that lexical tone, stress, and intonation are closely knitted into the final suprasegmental pitch movement output. Both prosodic pitch patterns and lexical tone are realized via F0. F0 modulation for lexical tone, however, is used to distinguish meaning at the level of the morpheme, whereas F0 modulation for prosody is used for meaning at sentence/discourse level (e.g., pragmatics). F0 can also be used in lexical stress, which signals the prominence of a syllable in a multisyllabic word or phrase. In tonal languages, lexical stress is superimposed on top of the lexical tone. The sentence-level prosodic patterns (“sentence-level tunes”), including contrastive stress and sentential intonation, can convey modality (i.e., illocutionary force, e.g., assertive vs. interrogative). These are considered “larger waves,” while lexical tone at the syllable level and lexical stress of multisyllabic morphemes are “small ripples.” Chao (1968) described the competition for F0 space between lexical tone and prosody as “small ripples riding on larger waves” (p. 39). See Yu, Wang, and Li’s chapter (Chap. 5) of this book for the neural mechanisms of lexical tone processing in healthy adults, and Lee and Cheng’s chapter (Chap. 6) of this book for the neural development of lexical tone processing in early childhood. Tsao and Liu (Chap. 10) reviewed lexical tone perception in infancy. The question relevant to this review is how tonal language speakers with ASD convey multiple levels of pitch information, and whether they demonstrate similar pitch output at the syllable, morpheme, word, phrase, and sentential levels compared to that of healthy controls. To the best of our knowledge, Chan and To (2016) is the only study that has examined the acoustic features of prosodic output in Chinese speakers with ASD. Chan and To (2016) focused on the use of sentence-final particles (SFPs) and the expressive intonation in Cantonese-speaking adults. Cantonese SFPs are bound morphemes that play a similar role to that of prosodic patterns in other languages. In Cantonese, the SFPs convey grammatical, pragmatic, and affective meaning. The authors proposed that individuals with high-functioning ASD (HFA) might show difficulties in mastering the use of SFPs and intonation, due to their known deficits in decoding pragmatic and affective cues. These two skills (the use of SFPs and the use of intonation) might interact with each other and work in a compensatory fashion. Speech samples were generated from 38 young adults (HFA group: n = 19; control group: n = 19) using spontaneous story retelling. The pitch variance

13 Behavioral and Neurophysiological Evidence of Speech …

249

was measured using sentence as a unit. Higher average pitch and larger pitch variations were found in the HFA group than the control group. The HFA groups and the controls are comparable in terms of the total frequency of SFP use, but HFA groups produced slightly fewer SFP types on average than the control groups (p = 0.072). The correlations between pitch and SFP measures were all nonsignificant, with the exception of a moderate positive correlation between the type of SFPs and the pitch variability in the HFA group only. At the individual level, some individuals with HFA showed similar pitch patterns to the healthy control counterparts. It appears that the general patterns of atypical prosody production in Cantonese speakers with HFA were very similar to those of non-tonal language speakers with HFA, which suggests that prosody impairment may be language-independent.

13.4 Prosody Perception in Children with ASD 13.4.1 Infant Development Prosodic cues are critical in assisting infants with the segmentation of running speech input into linguistically meaningful units (e.g., syllables, words, phrases). Languagespecific prosody perception and processing develop during early infancy (Bosch & Sebastián-Gallés, 1997; Friedrich, Herold, & Friederici, 2009; Jusczyk & Aslin, 1995; Sambeth, Ruohio, Alku, Fellman, & Huotilainen, 2008; Shafer, Jaeger, & Shucard, 1999; Stefanics et al., 2009; see Chap. 10 by Tsao & Liu in this book for a review on lexical tone development). For example, a language-specific preference for words with trochaic structure, which is a predominant English stress pattern, was observed in English-learning infants between 6 and 9 months of age (Jusczyk, Cutler, & Redanz, 1993), in German-learning infants between 4 and 6 months of age, but not in French-learning 6-month-olds (Höhle, Bijeljac-Babic, Herold, Weissenborn, & Nazzi, 2009). In addition, neural responses in English-exposed threemonth-old infants showed that they process Dutch, but not Italian in a similar fashion to English stories (Shafer et al., 1999). These findings are presumably due to the fact that Germanic languages (e.g., English, German, and Dutch) are stress-timed with trochaic predominance at the syllable level. In contrast, Romance languages (e.g., French and Italian) are syllable-timed and favor iambic stress patterns. The infants in Shafer et al. (1999) also showed evidence that they distinguished the greater pitch range of the Dutch compared to the English stories. This finding indicated early sensitivity to small differences in the melody of speech. Typically developing infants demonstrated an intrinsic preference for the prosodically rich child-directed speech (e.g., Vouloumanos & Werker, 2007). Infants’ early preference for higher pitch and exaggerated prosody as in child-directed speech assists early socio-communicative learning (Kuhl, Coffrey-Corina, Padden, & Dawson, 2005), and facilitates and predicts later language development (Thiessen, Hill, & Saffran,

250

Y. H. Yu and V. L. Shafer

2005). Furthermore, preference for speech correlated positively with general cognitive ability at 12 months, and weaker preference for speech over non-speech sounds correlated with more autistic-like behavior in infants who had siblings with diagnosed ASD (Curtin & Vouloumanos, 2013). Multiple studies have reported that young children with ASD have less robust preference for child-directed speech in comparison with their age-matched TD peers (e.g., Paul, Chawarska, Fowler, Cicchetti, & Volkmar, 2007; see Filipe, Watson, Vicenta, & Frota, 2017 for a review). Thus, deviant patterns of prosodic perception and processing in the first few years of life could serve as a risk factor of ASD. More prospective studies of children at risk for ASD will need to be carried out to further explore this possibility.

13.4.2 Prosodic Perception in Older Children and Adults with ASD Mixed evidence has been reported regarding acoustic tone perception and speech (word, phrase, and sentence) prosodic perception abilities in older children and adults with ASD. The majority of studies on acoustic tone perception have reported intact prosody perception in these children with ASD (Grossman & Tager-Flusberg, 2012; Paul et al., 2005). For example, individuals with ASD have superior pitch processing skills based on evidence from a variety of psychophysical measures (e.g., Bonnel et al., 2003, 2010; Stanutz et al., 2014), superior pitch direction detection in small intervals (Heaton, 2003, 2005; Heaton et al., 1999), or superior melodic contour identification (Järvinen-Pasley, Peppé, King-Smith, & Heaton, 2008). Contradictory findings included inferior pure tone discrimination when the reference tone was varied across trials (e.g., Boets, Verhoeven, Wouters, & Steyaert, 2015). See Haesen, Boets, and Wagemans (2011) for an extensive recent review. Many studies on sentence-level intonation revealed that children with ASD have deficits in intonation perception and/or production, especially at the sentence level. For example, JärvinenPasley and colleagues reported that children with ASD have unimpaired perception of word-level intonation, but deficits in understanding sentence-level intonation (e.g., Järvinen-Pasley et al., 2008). McCann and colleagues reported that most children with HFA have either expressive or receptive prosody deficits as measured by tasks that assess effect-related prosody at the single word-level, phrase-level stress, and sentence intonation (McCann et al., 2007). Prosodic production deficits have also been reported in other studies in children with HFA (e.g., Diehl & Paul, 2013); furthermore, there is also evidence that adults with HFA have difficulties using prosodic cues to extract information about mood and emotion (e.g., Rutherford, Baron-Cohen, & Wheelwright, 2002). A brain imaging study by Wang et al. (2006) showed that unlike TD controls, school-aged children with ASD were not only less accurate at interpreting prosodic cues for irony, but also showed aberrant brain activation patterns including absent activity in the medial prefrontal cortex, greater activation in the right inferior frontal gyrus, and bilaterally superior temporal sulcus (STS) regions.

13 Behavioral and Neurophysiological Evidence of Speech …

251

The evidence so far suggests that the function of the pitch information (linguistic and non-linguistic) and the size of the prosodic unit (e.g., word level vs. sentence level) influence the performance level in these populations.

13.4.3 Prosody and Lexical Tone Perception in Chinese-Speaking Individuals with ASD: Behavioral Research Crosslinguistic studies have suggested that extensive experience with a native tonal language attunes perception of pitch contour in language processing (Burnham & Francis, 1997; Gandour, 1983; Gandour & Harshman, 1978; Stevens, Keller, & Tyler, 2013; Wayland & Guion, 2004; Xu, Krishnan, & Gandour, 2006b) and music processing (Alexander, Bradlow, Ashley, & Wong, 2011; Bidelman, Hutka, & Moreno, 2013; Stevens et al., 2013. See Chap. 8 of this book for a review by Ong and colleagues). For example, Mandarin listeners can better perceive the subtle acoustic differences of Mandarin tonal categories compared to non-Mandarin speakers (Hallé, Chang, & Best, 2004; Leather, 1983; Lee, Vokoch, & Wurm, 1996). Cantonese speakers can discriminate Mandarin lexical tones better than English speakers (Lee et al., 1996). Native tonal (i.e., Thai) language speakers were faster and more accurate at discriminating pitch contour in both natural speech and musical contour tasks (Stevens et al., 2013). Mandarin listeners outperformed English listeners when discriminating Thai lexical tones post-training under both short and long interstimulus interval conditions (Wayland & Guion, 2004). Furthermore, using a multidimensional scaling method, Gandour and colleagues found that there was a perceptual dimension weighting difference between Chinese listeners and native English listeners when processing Mandarin tones. Specifically, native English listeners tended to rely on the pitch height while Chinese (Cantonese and Mandarin) listeners focused on both pitch height and pitch direction when processing Mandarin lexical tones (Gandour, 1984; Gandour & Harshman, 1978). Whether Chinese individuals with ASD demonstrate the same superior attunement to lexical tone and musical prosody in relation to non-tonal language speakers with and without ASD is an open question due to lack of cross-language research. There are only five behavioral speech perception studies on Chinese children with ASD. One study focused on lexical tone, one on intonation cues, and one on musical perception, while the remaining two studies focused on sentential prosody. These studies revealed differences between children with ASD and children with typical development in processing linguistic and non-linguistic stimuli. The findings were generally consistent with studies of prosodic perception in children of non-tonal languages. Below are the highlights of each study. Chen et al. (2016) examined lexical tone (Tone 1 and Tone 2) identification and discrimination of the Mandarin syllable /i/ in eleven 6- to 8-year-old boys with ASD.

252

Y. H. Yu and V. L. Shafer

The stimuli were from an 11-step lexical tone continuum ranging between a highlevel tone (Tone 1) and a low-rising tone (Tone 2). These children had an average language level of 3 years and 6 months. Compared to age-matched typically developing controls, children with ASD exhibited much lower discrimination accuracy and a broader category identification boundary. Furthermore, a strong negative correlation between the boundary width and the developmental language age was found in children with ASD. Li, Law, Lam, and To (2013) examined how Cantonese-speaking children with ASD implemented sentential prosodic cues and sentence-final particles (SFPs) in ironic stories to judge speaker’s belief and intent. Ironic expression is usually achieved via slow speaking rate, larger pitch variation, and greater intensity (Cutler & Bruck, 1974). As discussed above, some believe that SFPs and intonation have a trading relation with each other in presenting sentential connotation (Kwok, 1984; Yau, 1980). Li et al. (2013) tested children with and without ASD between the ages of 8.3 and 12.9 years by using 16 ironic stories and 5 complementary stories. The two groups demonstrated similar levels of comprehension when sentences did not contain prosodic or SFP cues. Participants in both groups answered the questions about the factual content of the stories with similar accuracy. A large group difference was observed for sentences with prosodic cues only, sentences with SFP cues only, and sentences with both prosodic and SFP cues. These Cantonese-speaking children with ASD failed to exploit either prosodic cues or SFP cues, similarly to Englishspeaking children with ASD (Happé, 1993). Note that the Cantonese-speaking TD children answered the questions with only slightly above chance accuracy under the prosody-only condition, suggesting that the prosody-only condition is challenging even for TD children. Jiang, Liu, Wan, and Jiang (2015) investigated discrimination and identification of music and linguistic pitch contours in 17 Mandarin-speaking individuals (age: 6.0–16.2) with high-functioning autism and 17 control children and adolescents with matching age, nonverbal IQ, and years of music training. They used a fivetone sequence for the melodic contour discrimination and identification, and disyllabic verb–object constructions as stimuli for the speech intonation discrimination and identification tasks. Participants were asked to match the auditory sequence with the visual display of the melodic contour of the music or to identify whether the disyllabic verb–object construction was a statement or question. They found that the ASD group performed worse than the control group in terms of Mandarin intonation discrimination and identification, but the ASD group performed better than the control in the melodic contour identification task. The two groups showed similar performance in the melodic contour discrimination task. Jiang and colleagues suggested that linguistic pitch may not be processed the same way as musical pitch in Chinese-speaking individuals with ASD. The Mandarin question words such as “what” (什么) and “who” (谁) can convey both statements and questions depending on their prosodic features (rising vs. falling sentence-final intonation) and/or their semantic structures (e.g., by adding the Mandarin universal quantifier “dou”/都, which means “all” in English). Su, Jin, Wan, Zhang, and Su (2014) tested 28 children (14 four- to eight-year-olds; 14 nine-

13 Behavioral and Neurophysiological Evidence of Speech …

253

to fourteen-year-olds) with high-functioning ASD and 28 age-matched TD controls using a computerized sentence comprehension task to identify statements and question sentences with either prosodic cues or semantic cues. They found that older children with ASD performed on par with their TD peers, by using either prosodic cues in ambiguous sentences or semantic cues in unambiguous sentences. However, younger children with ASD performed more poorly than the TD controls in terms of statement sentence processing under all structure conditions (prosody, semantic, and control structures). This highlighted a developmental delay in children with ASD in comprehending statement sentences containing wh-words. The findings in this study, together with several other studies on English-speaking individuals with ASD, support the claim that grammatical prosody is relatively spared in children with ASD compared to the affective or pragmatic prosody. Wang and Tsao (2015) did not aim to explore the tone language-specific processing in Mandarin-speaking ASD children; rather, they aimed to examine “emotional prosody” perception in this group of children. Given that the data were collected from tonal language-learning children, it can potentially include an interaction between tonal language experience and emotional prosody processing. Moreover, this is the only study so far that has examined emotional prosody processing in tonal languagelearning children with ASD. For both reasons, we included the study in this review. Wang and Tsao (2015) used three emotion tones (happy, sad, and angry) and a neutral tone presented in words and short sentences. Twenty-five boys with high-functioning autism (HFA) and 25 TD boys between 6 and 11 years of age were tested using an emotional prosody identification task. The study found that children with HFA performed more poorly than TD children in identifying prosodic patterns associated with “happy,” but that they did not differ in identifying prosodic patterns associated with “sad” or “angry.” This was true regardless of whether the semantic condition of the stimulus (words or sentences) was neutral or emotionally relevant. Correlation analyses revealed a strong positive association between perception accuracy of happy prosody and the pragmatic language skills and social adaptation skills of children with ASD.

13.4.4 Neural Indices of Lexical Tone Processing in Non-tonal ASD Event-related potential (ERP), recorded using electroencephalogram (EEG), and event-related field (ERF), recorded via magnetoencephalography (MEG), are most often used to examine auditory processing of pure tone and speech. Many such ERP/ERF studies have adopted an oddball paradigm in which repetition of one sound pattern is interspersed with an infrequent sound pattern. The long-latency ERP/ERF obligatory sequence of peaks, P1-N1-P2, is thought to reflect the brain’s response to the physical features of the stimuli at the scalp level with multiple neural generators in the primary and secondary auditory cortex (Näätänen & Picton, 1987; Ponton,

254

Y. H. Yu and V. L. Shafer

Eggermont, Khosla, Kwong, & Don, 2002; Scherg & Von Cramon, 1986). P1 to auditory events is observed at frontocentral sites in early childhood. P1 latency shifts ˇ earlier as the brain matures (Ceponien˙ e, Rinne, & Näätänen 2002; Choudhury & ˇ Benasich, 2011; Kushnerenko, Ceponien˙ e, Balan, Fellman, & Näätänen, 2002; Morr, Shafer, Kreuzer, & Kurtzberg, 2002 ; Shafer, Yu, & Datta, 2010; see Sharma, Glick, Deeves, & Duncan, 2015 for a review). N1 and P2 are not always apparent in young children to auditory stimuli presented at rates less than about 1 per second (Ponton, Eggermont, Kwong, & Don, 2000). The P1-N1-P2 complex does not reach full maturity until later adolescent years (Ponton et al., 2000). Bishop, Hardiman, Uwer, and von Suchodoletz (2007) reported that P1 amplitude appears to be larger and that it also peaks earlier for speech than for pure tones in children under 11 years of age. Mismatch negativity (MMN) serves as an index of automatic preattentive cortical discrimination of auditory contrast. MMN is largest at frontal sites and is best seen by subtracting the response to a frequent stimulus/pattern from the response to an infrequent stimulus pattern. MMN is larger for greater physical (acoustic) differences between two stimuli and is often larger for speech sounds that cross a phoneme boundary than for speech sounds that fall within the same phoneme category (Näätänen, Paavilainen, Rinne, & Alho, 2007). Significant maturational changes have also been evidenced in the presence, amplitude, and latency of MMN in TD children with non-tonal language backgrounds (Friederici, Friedrich, & Weber, 2002; Kushnerenko et al., 2007; Leppänen et al., 2002; Morr et al., 2002; Shafer et al., 2010; Shafer, Yu, & Datta, 2011) and tonal language backgrounds (Cheng et al., 2013, 2015; Liu, Chen, & Tsao, 2014). In particular, infants and young children often show a positive mismatch response (pMMR), rather than MMN or in addition to the MMN (Shafer et al., 2010). The pMMR may reflect greater recovery from refractoriness (because the deviant stimulus is less frequent). The P3a response is an index of involuntary attention switch elicited by a salient stimulus change or a rare stimulus change. Its latency is later than the MMN and shifts earlier progressively starting from early toddlerhood, reaching stabilization at around 12 years of age (Fuchigami et al., 1995). However, its scalp distribution matches with those of the adults only until late adolescence (Määttä et al., 2005). The first few ERP studies to examine auditory processing in children with ASD suggested superior processing of non-speech stimuli compared to typically developing controls. Oades, Walker, Geffen, and Stern (1988) tested seven children with ASD and found shorter N1 latency and larger N1 amplitude to a pure tone contrast in the ASD group than the TD controls. Ferri et al. (2003) also found that the children with low-functioning autism (LFA) showed earlier and enhanced N1 peaks to pure tone contrasts (1000 Hz vs. 1300 Hz). Enhanced MMN and/or P3a have also been evident in children with ASD for auditory tones (Ferri et al., 2003; Kujala et al., 2007). Gomot et al. (2008) used a combination of behavioral discrimination and fMRI measures and found that children with Asperger’s syndrome were hyperactive to sound, as indicated by faster discrimination reaction times and similar response accuracy. They also showed stronger activation in the prefrontal and inferior parietal cortices to complex tone discrimination.

13 Behavioral and Neurophysiological Evidence of Speech …

255

ˇ Ceponien˙ e et al. (2003) found that pitch encoding and discrimination were similar for children with ASD and a TD group as measured using P1-N2-P2-N4, MMN, and P3a. They did observe a marginally smaller P1 amplitude in the ASD group. In contrast, other studies have found delayed and diminished cortical responses to auditory and speech stimuli in children with ASD. School-aged children with ASD demonstrated delayed N1 latency to pure tones, compared to age-matched healthy controls (longer N1c latencies; Bruneau, Bonnet-Brilhault, Gomot, Adrien, & Barthélémy, 2003; Bruneau, Roux, Adrien, & Barthélémy, 1999). Another study revealed diminished and delayed P1 responses to pure tone contrasts (JanssonVerkasalo et al., 2003). Diminished P1, N2, P3, and N4 have also been reported for vowel contrasts in children with ASD (Whitehouse & Bishop, 2008). Quite a number of studies have reported that individuals with ASD have delayed and/or diminished MMN and P3a responses to phonemic change (Lepistö et al., 2005, 2006). We have also observed a delayed latency of the P1 peak to auditory words in a picture–word priming paradigm in minimally verbal 3–7-year-old children with ASD compared to age-matched controls (Cantiani et al., 2016). The controversial findings among studies may be due to experimental factors such as stimuli and tasks used, and the heterogeneous nature of the ASD population in general.

13.4.5 Neurophysiological Measures of Pitch Processing in Chinese-Speaking Individuals with ASD The pursuit of understanding how the autistic brain processes native lexical tone has only just begun. Currently, only three careful studies from the same research group have been undertaken (Huang et al., 2017; Wang, Wang, Fan, Huang, & Zhang, 2017; Yu et al., 2015). These studies suggest that children with ASD have greater difficulty in processing speech than non-speech information, but that certain types of cues (e.g., duration) may be spared. Below are the highlights of the three studies. Yu et al. (2015) were the first to examine the question of whether enhanced lower-level perceptual features such as pitch variation would hinder the processing of higher-level phonemic units of the lexical tone categories. In their first oddball experiment, three types of stimuli were used: simple pure tone contrast (standard 216 Hz vs. deviant 299 Hz), lexical tone contrast (standard /bai2/ vs. deviant /bai4/), and a nonword condition (standard /rai2/ vs. deviant /rai4/). The ASD group had larger MMN amplitudes than the TD control group at the vertex site (Cz) and had smaller MMNs than the TD group at the frontal site (Fz) for the lexical tone contrast /ba2/-/ba4/. MMN was present in the TD group for all contrast types, but for the ASD group, MMN was absent to the nonword /rai2/-/rai4/ contrast at both Fz and Cz sites. In contrast to their TD peers, enhanced P3a amplitude was observed for the pure tone contrast in the ASD group. In order to further understand the influence of lexicality, children were also tested using hummed speech. Results demonstrated that the ASD group had larger MMN amplitude than the controls at Cz but not at

256

Y. H. Yu and V. L. Shafer

Fz. The ASD group also showed larger P3a amplitude at Cz; the TD group showed a tendency toward shorter latencies for the P3a peak at Fz compared to the ASD group. The authors speculated that the reduced neural sensitivity in lexical tone processing was probably due to inadequate suppression of the irrelevant within-category pitch differences. The account of speech-specific deficits in autism for lexical tone processing proposed in Yu et al. (2015) was further supported by the findings in Wang et al. (2017). Wang et al. (2017) used vigorously controlled synthetic speech and nonspeech contrasts. The speech condition consisted of three acoustically equidistant stimuli from a nine-point continuum /ba2/ (step 1)-/ba4/ (step 5)-/ba4/ (step 9) for between-category (steps 1 and 5) and within-category (steps 5 and 9) contrasts. The non-speech condition consisted of three complex stimuli that matched with speech stimuli on all acoustic parameters, except for harmonic composition. The study used a passive listening oddball paradigm and found that the TD controls showed larger MMN to between-category than within-category, whereas the ASD group had equal MMN for the between- and within-category comparisons; this finding indicated a lack of categorical perception in children with ASD. No significant P3a was observed under any condition for either group. Results from time–frequency analysis provided further evidence for group differences. The two groups demonstrated similar phase locking to harmonic speech stimuli, but for the lexical tone condition only the TD group showed a significant inter-trial phase coherence (ITPC) difference in the theta band for the MMNs of within- versus cross-category contrasts. This evidence further suggested that children with ASD do not have categorical perception in the lexical tone condition. Duration of speech segments can also serve to distinguish meaning in some languages (e.g., Finnish and Japanese). Behavioral and neurological evidence indicates that individuals with ASD have deficits processing small durational differences in auditory and speech contrasts (Brodeur, Gordon Green, Flores, & Burack, 2014; Falter, Noreika, Wearden, & Bailey, 2012; Lambrechts, Falter-Wagner, & van Wassenhove, 2017; Lepistö et al., 2005; Maister & Plaisted-Grant, 2011; Szelag, Kowalska, Galkowski, & Pöppel, 2004). To answer the question of whether individuals with tonal language backgrounds also show deficits in temporal processing, Huang et al. (2017) used both pure tones (295 Hz) and nonsense syllables (/tý/) of two durations (250 ms vs. 350 ms) in a passive oddball paradigm. They compared the neurophysiological responses to these duration changes in school-aged children with ASD and TD peers. A delayed and diminished MMN peak was found in the ASD group in comparison with the TD control group for the pure tone stimulus condition only. In contrast, a delayed and diminished P3a peak was evidenced in the ASD group for the speech condition only. It is not entirely clear how to interpret this finding, considering that the results from the pure tone differed from that of the vowel for within-category comparisons. Clearly, additional research is necessary to fully understand how native language experience modulates auditory processing in children with ASD.

13 Behavioral and Neurophysiological Evidence of Speech …

257

13.5 Treatment Study of Children with ASD Using Transcranial Direct Current Stimulation: A Feasibility, Pilot Study Transcranial direct current stimulation (tDCS) is a noninvasive technique of applying constant low-intensity electrical currents to the scalp. This method has been extensively used in animal studies. It is well established in animal models that this type of stimulation can alter the threshold, rate, and balance of excitation and inhibition of neurons, and therefore can modulate brain functions both in vivo and in vitro (Bikson et al., 2004; Bindman, Lippold, & Redfearn, 1964; Chan, Hounsgaard, & Nicholson, 1988; Purpura & Mcmurtry, 1965; Rahman, Toshev, & Bikson, 2014; see Reato, Rahman, Bikson, & Parra, 2013 for a review). In the past decade, numerous studies have reported that noninvasive brain stimulation (NIBS) techniques such as transcranial magnetic stimulation (TMS) and transcranial direct current stimulation (tDCS) can facilitate language recovery in patients with aphasia (see Norise & Hamilton, 2016 for a review). Whether language-related plasticity in the brains of children with ASD can be modulated by noninvasive brain stimulation is an emerging area of research. The available literature on the use of tDCS in ASD is preliminary, consisting of studies with methodological limitations (see Jacobson, Koslowsky, & Lavidor, 2012 for a review), but some of the results are promising. A treatment of 20 NIBS sessions was found to improve social and behavioral scales in children with ASD with a lasting effect of six months (Gómez et al., 2017), and a single session of anodal tDCS was found to increase peak alpha frequency at the stimulation site and to decrease autisticlike behavioral symptoms (1 mA, 20 min; Amatachaya et al., 2015), and to increase syntax comprehension in children with ASD (2 mA, 30 min; Schneider & Hopp, 2011). In a randomized controlled trial, after a single session of tDCS treatment (1.5 mA, 15 min) children with attention deficit and hyperactivity disorder showed increased inhibition accuracy (Soltaninejad, Nejati, & Ekhtiari, 2015). Mandarin-learning children with ASD have shown atypical cortical oscillations as reviewed above (Wang et al., 2017). As per Wang et al. (2017), Mandarinlearning children with ASD may have deficits in inhibiting neural sensitivity to within-category lexical tone variation. We are interested in whether the sensory and cognitive functions associated with aberrant excitatory and inhibitory neuronal activities can be modulated by tDCS stimulation. If we accept the hypothesis that brain stimulation via tDCS technique can enhance the inhibitory neural activity associated with within-category speech processing, then this treatment could possibly enhance the categorization processing of speech that varies at the within-category level. The following pilot study was designed to test this hypothesis.

258

Y. H. Yu and V. L. Shafer

13.5.1 Materials and Methods Participants Data from two children with ASD (male, 8.1 years old and 9.8 years old) and two typically developing control children (male, 9.11 years old and 10.7 years old) were obtained. A 10.6-year-old nonverbal child with ASD was recruited but could not be tested due to lack of adequate compliance and was therefore excluded from the study. All four children are from the same neighborhood in the metropolitan New York City area, and all were from families in which both parents were native Mandarin speakers. Language background questionnaires from the parents indicated that all four children have had consistent Mandarin exposure from both parents since birth. According to parental reports and background questionnaires, all four children understood daily conversations in Mandarin, and all could carry out simple conversations about daily routines in Mandarin. All four children also had consistent English exposure via school since English was the language of instruction for all four children. According to parental reports, the two children with ASD were more dominant in Mandarin, while the two typically developing children were more dominant in English at the time of testing. Both children with ASD were diagnosed by certified developmental psychologists around 3 years of age, and both children with ASD had individual educational plans (IEPs) and were receiving special education via the public school system due to their ASD diagnoses. Their diagnoses were also validated by the school’s special education teacher, as well as by an experienced speech language pathologist (the first author). Informed consent from each parent and verbal assent from each child were obtained following the approved protocol by the local institutional review board. The Event-Related Potential Procedures Stimuli Disyllabic nonword stimuli with Tone 2 and Tone 3 contrast were used in an oddball paradigm. The frequent/standard stimuli were three tokens of /gu3pa1/, and the infrequent/deviant stimuli were two tokens of /gu2pa1/. The use of multiple tokens of the same lexical category was to facilitate between-category processing rather than within-category processing. Specifically, the tokens varied in non-relevant acoustic information and only the relevant tone difference could be used to correctly categorize the stimuli. These stimuli were used in Yu, Shafer, and Sussman (2017). In this study, native Mandarin adult speakers showed larger MMN responses compared to English speakers to these lexical tone differences. A total of 165 deviant (20%) and 645 standard (80%) stimuli were presented in 15 blocks with an average interstimulus interval of 675 ms (645–709 ms). Each block was separated by a 10-s break. ERP recording The electroencephalogram (EEG) was time-locked to the onset of stimuli and recorded using 65-channel sensor nets at the sampling rate of 500 Hz with a band-pass filter of 0.1–100 Hz. Two sessions of ERP recordings were collected, one occurring before the tDCS procedure and another shortly after the tDCS procedure during the same laboratory visit. The data were filtered with a band-pass filter of 0.3–15 Hz and segmented 200 ms before the onset of stimuli and 700 ms poststimulus onset. Artifact rejection, baseline correction, and average re-reference were performed in BESA 6.1. All children had at least 100 trials from the deviant condition.

13 Behavioral and Neurophysiological Evidence of Speech …

259

The amplitudes of the subtraction waves (deviant minus standard) were compared across the four participants. High-Definition-tDCS Procedure We applied high-definition (HD)-tDCS stimulation using the Soterix 1 × 1 tDCS Low-Intensity Stimulator with a Soterix 4 × 1 adaptor (Soterix Medical, New York, NY). We placed the stimulating ring electrodes around the frontocentral scalp region (C3, C4, F3, F4 as cathodes and FCz as anode). The sintered Ag/AgCl ring electrodes were fixated with an EEG cap with HD-tDCS electrode holders (Soterix Medical, New York, NY). The impedances of all five electrodes were in the range of 0.50–0.80 quality value. The anode (FCz) was set to deliver a total current of 1 mA, and the return electrodes shared the same current intensity of 1 mA (0.25 mA each). The duration of stimulation was set to 10 min. We used the intensity of 1 mA because HD-tDCS is known to deliver more focalized stimulation (Edwards, Cortes, Datta, Minhas, Wassermann, Bikson 2013), as well as to reduce an unnecessary tingling sensation that many children with ASD might not tolerate. Furthermore, an intensity of 1 mA single session of 20-min stimulation using conventional tDCS was found to elicit significant behavioral improvement in children with ASD (Amatachaya et al., 2015).

13.5.2 Preliminary Results P1-N1-P2 Results Figure 13.1 shows the amplitude of the standard condition for all four participants before and after tDCS stimulation. The two children with TD showed a P1 (80– 100 ms), followed by N1 (around 150 ms) and P2 (220 ms). The N1 was attenuated compared to adults, but it was clearly emerging for these 10-year-old children (note that N1 is attenuated at short interstimulus intervals for children under 10 years of age). The two ASD children showed only a broad P1 peak followed by an N2 around 250 ms. This pattern was observed in younger typically developing children. There was a negative shift in the response at Fz after about 100 ms for all participants after tDCS. The response at the left mastoid (LM) showed the inverse pattern. The children with ASD showed a somewhat different pattern. Specifically, ASD01 showed a similar pattern to the children with TD from about 200 to 350 ms. ASD02 appeared to show an increased response (i.e., greater Fz negativity and greater LM positivity) from 200 to 600 ms. MMN Results Figure 13.2 demonstrates the topography of the subtraction (deviant minus standard) waves. Before tDCS stimulation, in the two TD participants, there was a negative response (blue) between 150 and 250 ms at the frontocentral scalp region. Within the same time window, a predominantly positive response (red) was evidenced in both ASD participants. A robust negativity at the superior central scalp region was

260

Y. H. Yu and V. L. Shafer

Fig. 13.1 Grand average ERPs to the standard stimulus waveforms for the Fz and left mastoid. The top panel shows the two control participants, and the bottom panel shows the two ASD participants

seen in both children with ASD between 300 and 400 ms. Post-tDCS, the peak of the negativity shifted earlier in both TD children and in one of the children with ASD (Participant D in Fig. 13.2). Due to the small sample size, we can only speculate at this time what these differences indicate: The first, superior negative peak was possibly the MMN and was observed in both the children with ASD and the TD children; however, it appeared in the much later time window before and after HD-tDCS for children with ASD than found for children with TD. The latency of the negativity in the two ASD children was very similar to the four-to-seven-year-old TD children in Shafer et al. (2010). Younger children (infant to seven years of age) often show a positive mismatch response sometimes alone (infants) and sometimes preceding the MMN (Morr et al., 2002; Shafer, Yu, & Garrido-Nag, 2012). In addition, the timing of the negativity in the two children with ASD may be the late negativity (LN) observed in children with specific language impairment in Shafer, Morr, Datta, Kurtzberg, Schwartz (2005). This pattern might indicate a developmental delay in processing complex speech sounds. The ERP amplitude changes post-HD-tDCS suggest that both individuals with ASD and TD respond to the HD-tDCS treatment. In addition, these preliminary data reveal that this is a promising approach for examining the effect of HD-tDCS treatment on lexical tone processing.

13 Behavioral and Neurophysiological Evidence of Speech …

261

reference free 0.10 μV/step

Fig. 13.2 Topographical voltage maps of the subtraction wave (deviant standard) between 150 and 400 ms after the stimulus onset. Red portion shows positivity, and blue shows negativity. A and B are typical control children, and C and D are participants with ASD

13.6 General Discussion and Future Directions The literature concerning auditory processing in Chinese/tonal language-speaking individuals with ASD is sparse with only a few studies examining perception or production and with very few studies attempting to address both non-linguistic basic auditory pitch processing and higher-level linguistic pitch and prosody processing. Further research needs first to replicate findings of these few studies and then to allow synthesis of the evidence. Even so, some preliminary comparisons of the findings can be accomplished between the general literature of auditory processing in children with ASD and these studies of Chinese-speaking children with ASD.

13.6.1 Developmental Prosody and Lexical Tone Processing in Children with ASD Phonetic features that serve to distinguish lexical tone categories can be differentiated by multiple parameters along several spectral and temporal dimensions (e.g., onset F0, offset F0, contour of F0, and duration of the contour). TD children make few production errors for Mandarin lexical tones as early as 1.6 years of age in picture naming tasks (Hua & Dodd, 2000). However, based on results from perceptual judgment tasks, 6-year-old Mandarin-speaking children could not produce adultlike lexical tones. They also did not reach adultlike perception levels (Wong, 2013; Wong, Schwartz, & Jenkins, 2005). MMN responses to lexical tone change were not adultlike in school-aged children (Liu et al., 2014). Developmental changes in lexical tone perception in Chinese-learning children with ASD have only been addressed in a few studies. Certain prosodic cues (e.g., prosodic cues for irony) and lexical tones (e.g., difference between Tone 2 and Tone 3 in Mandarin) are intrinsically more

262

Y. H. Yu and V. L. Shafer

challenging to learn, even for TD children. As shown in Li et al. (2013), TD children between 8 and 12 years of age do not demonstrate high comprehension accuracy for prosodic cues when encoding irony. In Su et al. (2014), the older children with HFA performed at or near ceiling, equally as well as the TD controls for all types of sentence processing, and performance gaps were only evident in the younger children with and without ASD. Such findings highlight a developmental delay instead of a persistent deficit in children with ASD when processing statement sentences that contain wh-words (e.g., “what” or “shenme” in Mandarin). Future studies are needed to elaborate on the developmental trajectories of various prosodic cues and lexical categories in tonal language-learning children, both with and without ASD. This step is necessary before determining how to use this information to enhance clinical practice.

13.6.2 Lexical Tone and Music Processing in ASD Mounting evidence suggests that brain plasticity governing language processing can be adapted for musical processing (Koelsch et al., 2002; Maess, Koelsch, Gunter, & Friederici, 2001; Patel, Gibson, Ratner, Besson, & Holcomb, 1998). We would like to suggest that future studies explore this relationship in children from tonal language backgrounds. Frameworks such as the shared syntactic integration resource hypothesis (SSIRH) proposed that the shared neural resources between music and language processing are located in the frontal brain regions, and that these resources are recruited “when structural integration of incoming elements in a sequence is costly.” That is, these shared networks are the “processing regions” for structural integration in linguistic syntax and tonal harmony (Patel, 2003, 2014). For healthy individuals, language experience fine-tunes the production and processing abilities of critical auditory elements in both speech and non-speech domains. For example, Mandarin speakers are better at imitating and discriminating musical pitch than English speakers (Pfordresher & Brown, 2009), and they are more sensitive to both lexical pitch and non-speech (harmonic) pitch category boundaries (Xu, Gandour, & Francis, 2006a). The mutual enhancement between language and music, as seen in musicians and tonal language speakers, has been widely reported (Bidelman et al., 2013). Theoretically, such mutual enhancement should benefit music and speech processing in individuals with ASD and in other communication disorders as well. Accumulating evidence suggests that individuals with ASD who come from tonal language backgrounds share the common characteristics of superior melodic contour processing and inferior linguistic prosody processing, as observed in individuals with ASD from non-tonal language backgrounds (Järvinen-Pasley & Heaton, 2007; Jiang et al., 2015; Heaton, 2005). The disparity between music and speech-processing skills leads researchers to believe that linguistic pitch and musical pitch are processed differently in individuals with ASD, and that tonal language experience does not compensate for the linguistic prosody processing deficit in individuals with ASD

13 Behavioral and Neurophysiological Evidence of Speech …

263

(Jiang et al., 2015). Currently, there is no direct cross-language comparison. Further cross-language studies are needed to directly test whether individuals with ASD from a tonal language background have an advantage over their counterparts from a non-tonal language background on musical and linguistic prosody processing. Positive transfer from music training to speech processing has been evidenced in TD children with non-tonal language backgrounds (Moreno et al., 2009). The OPERA model hypothesized by Patel (2011) posited that the shared acoustic features in speech and music, such as F0 changes over time, are processed in an anatomically overlapping network. Patel further argued that music perception “place[s] higher demands on the encoding of certain acoustic features than does speech perception” for adequate communication. However, there is no evidence to support whether such a claim can be applied to all types of languages, especially tonal languages, in which high precision of pitch perception and production is a constant demand. Further studies should examine whether there is a positive transfer from lexical tone learning to music processing, or from music training to lexical tone learning in individuals with ASD. Answers to such questions will provide evidence for testing theory and for designing a therapeutic framework for treating tonal language speakers with ASD.

13.6.3 Effects of Experimental Variables on Chinese Individuals with ASD So far, all nine studies on Chinese-speaking individuals with ASD have recruited individuals with HFA and no studies have yet focused on minimally verbal individuals with ASD, which counts for about one-third of the ASD population. Our study (which is one of the few existing studies) suggests that some English-exposed children with ASD and with minimal verbal skills have intact (although slightly delayed) lower-level visual and auditory processing skills (Cantiani et al., 2016). Unfortunately, the tasks employed in studies examining prosody and lexical tone production and processing often require functional communication skills (e.g., answering questions) as well as competency in social interaction (e.g., initiating conversation). Such demands preclude the participation of children with poor verbal skills who have LFA. Neurophysiological methods, including EEG-ERP methods, can use tasks that do not require a response (e.g., the passive oddball paradigm used by Zhang’s research group reviewed above provides a good alternative to test individuals with LFA). However, individuals with LFA often have sensory sensitivities and some cannot tolerate wearing the sensor net. On the other hand, it is possible to desensitize some of these children so that they will be able to tolerate electrodes (see Roesler et al., 2013). Multiple studies have conducted correlational analyses to seek the relationship between prosody characteristics and severity of autistic symptoms; some did not find a correlation between prosody and the severity of ASD, but given that few studies have included children with LFA, it remains unclear whether a relationship exists. This shortcoming is particularly important given that the definition of autism

264

Y. H. Yu and V. L. Shafer

does not differentiate HFA from LHA under the Diagnostic and Statistical Manual of Mental Disorders, 5th edition (DSM-5). Recently, Eigsti and Fein (2013) compared pitch discrimination sensitivity of teenagers with optimal outcomes and teenagers with HFA and age-matched TD controls, and found that superior pitch perceptual skill is correlated with ASD symptomatology. Specifically, teenagers who were diagnosed with ASD before 5 years of age and who at the time of testing did not have any autistic symptoms performed the same as the controls. In contrast, teenagers who still maintained the ASD diagnosis continued to demonstrate heightened pitch discrimination. The heterogeneous nature of this disorder calls for further studies that examine the features of subgroups on the spectrum. As the world is becoming more plural, the proportion of bilingual individuals with ASD will also increase. Language development in bilingual children who are exposed to two languages from birth follows a different developmental trajectory starting from the first year of life (Bosch & Sebastián-Gallés, 2003; Shafer et al., 2012). There is evidence of both positive and negative transfers between the first and second language phonologies (Hambly, Wren, McLeod, & Roulstone, 2013). Wider pitch range and higher average pitch have been found in English-speaking, Germanspeaking, Mandarin-speaking monolingual individuals with ASD and Hindi–English bilinguals with ASD, but the opposite patterns were evidenced in Japanese-speaking children with ASD (Baltaxe & Simmons, 1985; Chan & To, 2016; Green & Tobin, 2009; Sharda et al., 2010; Nakai et al., 2014). It will be of theoretical and clinical interest to compare the pitch perception and lexical tone production of bilingual children with those of monolingual Chinese-speaking individuals such as in Chan and To (2016). The different language backgrounds of bilingual children with ASD (e.g., two tonal languages vs. one tonal language plus one non-tonal language; more balanced bilingual vs. one language dominant bilingual) would presumably lead to different hypotheses regarding the association between pitch perception, music perception, and lexical tone production. Stimulus complexity and task demand are both known to influence auditory processing. Differences between a pair of acoustically less-salient lexical pairs such as Tone 2 versus Tone 3 can be more challenging than an acoustically more salient pair such as Tone 1 versus Tone 3 for non-native listeners (Chandrasekaran, Krishnan, & Gandour, 2007; Yu et al., 2017). Tone sandhi (e.g., when there are two 3rd tones in a row, the first one becomes 2nd tone) also further complicates the production and perception of Tone 2 versus Tone 3 processing (see Chap. 7 of this book by Chang & Kuo). Speech versus non-speech comparisons are routinely used to measure domaingeneral versus language-specific auditory processing. Lexical status of the stimuli (e.g., word vs. nonword) seems to play a subtle role in the nature of the cortical response to such contrasts in individuals with ASD (Wang et al., 2017; Yu et al., 2015). Stimulus complexity often interacts with attention and/or memory demands ˇ in children with ASD and other learning disorders (Ceponien˙ e et al., 2003; Whitehouse & Bishop, 2008). Systematic investigation using carefully controlled tasks and varying stimulus complexity will provide important implications for theories and clinical application.

13 Behavioral and Neurophysiological Evidence of Speech …

265

13.6.4 Theoretical Implications Higher perception accuracy, faster response time, and larger brain response to relatively simple, domain-general perceptual contrasts in ASD than in controls are taken as evidence for enhanced lower-level perceptual processing (Mottron, Dawson, Soulieres, Hubert, & Burack, 2006; Mottron et al., 2013), along with evidence that sometimes less robust brain responses and less accurate behavioral responses to domain-specific complex stimuli such as vocal pitch processing and phonemic/semantic discrimination processing in some individuals with ASD are taken as evidence for deficits in information integration for global processing or deficits in higher-level linguistic processing (Cantiani et al., 2016). The neural complexity hypothesis suggested that superior performance for simple tone processing in the primary auditory cortex, along with impaired complex perceptual performance in the associative cortex, is autism-specific (Bertone, Mottron, Jelenic, & Faubert, 2005). Happé and Frith (2006) proposed the “weak central coherence” theory referring to the detail-focused processing bias in individuals with ASD. The weak central coherence (WCC) theory pointed out that the processing bias for the local/detail-focused/lower-level information over global/meaning-focused/higherlevel information may impose a negative impact on higher-level integrated processing (Happé & Frith, 2006). It is necessary for research to determine whether there is an impact of such a bias. A deficit in theory of mind (ToM) is a hallmark of ASD as mentioned above. A meta-analysis on ToM development showed that there is a two-year or greater developmental timing difference in false belief performance between Chinese-speaking children and children in North American culture (Liu, Wellman, Tardif, & Sabbagh, 2008). This significant timing difference suggests that learning a tonal language such as Cantonese or Mandarin, or growing up in Chinese culture, enhances the development of ToM. However, we do not have direct evidence regarding whether learning a tonal language will enhance the development of ToM in children with ASD. Future studies need to test ToM in Chinese-speaking individuals with ASD directly. The three neurophysiological studies on Mandarin-speaking children with ASD are consistent with the WCC theory mentioned above. You and colleagues proposed that different from children with other common developmental disorders, such as children with dyslexia and developmental language disorder (DLD), children with ASD do not have categorical perception deficits (CP deficit), but instead have a categorical precision deficit (CPR deficit). CPR allows perception of allophonic differences (You, Serniclaes, Rider, & Chabane, 2017). This proposal was based on the evidence that children with ASD made more categorical judgments than TD controls did on a natural vowel continuum, yet did not show categorical perception deficits of vowels and consonants. One important issue here is whether the vowel versus consonant difference in terms of categorical precision would be applied to lexical tone and consonant differences, since lexical tone is largely superimposed on the vowel part of the syllable. You et al. (2017) used a four-parameter logistic model including perceptual boundary, slope, and asymptotes of the identification

266

Y. H. Yu and V. L. Shafer

function to assess categorical precision. Due to differences in analysis, it is unclear if the children with ASD in Chen et al. (2016) had a CPR deficit to lexical tone processing as well. But it is clear that children with and without ASD were both near chance level when discriminating within-category lexical tone contrasts (Fig. 4 in Chen et al., 2016), and children with ASD had a shallower identification slopes, suggesting continuous perception rather than CP. Recently, there has been a lively discussion on spectral versus temporal auditory processing in the ASD literature (e.g., Huang et al., 2017; Kasai, Hashimoto, Kawakubo, Yumoto, Kamio, Itoh, Koshida, Iwanami, Nakagome, Fukuda, Yamasue, Yamada, Abe, Aoki, Kato, 2005; Lambrechts et al., 2017; Lepistö et al., 2006; see Haesen et al., 2011 for a review). Spectral information is generally processed in the right hemisphere, and rapid temporal dynamics for speech processing is more dominant in the left hemisphere (Zatorre, Evan, Meyer, & Gjedde, 1992). However, the acoustic properties of auditory cues interact with the function of the cues in a complex way. As Zatorre and Gandour (2008) pointed out, the right hemisphere dominance for pitch processing may be altered to the left when the pitch is linguistically relevant, as in lexical tone. Indeed, Mandarin speakers have shown larger left hemisphere responses to between-category lexical tone contrasts and marginally larger right hemisphere responses for within-category lexical tone contrasts (e.g., Xi, Zhang, Shu, Zhang, & Li, 2010). Frequency differences are the primary cues to distinguish the four lexical tones in Mandarin. These differ from the spectral differences of vowels and consonants in serving as the carrier frequency (fundamental). There are some intrinsic durational differences among the four lexical tones. For example, as reported in Shen (1990) and other studies, Tone 3 is consistently longer than Tone 2, while Tone 4 has the shortest duration among the four (Lin, 1965; Shen, 1990). Future studies should examine how the intrinsic durational differences among lexical tone categories in Chinese will influence the developmental patterns of speech perception and production in the ASD individuals. Future studies comparing the neural responses to spectral versus durational cues for lexical categories would provide further evidence about the neural specialization related to atypical lexical tone processing in Chinese-speaking individuals with ASD. We currently have no data on neuroplasticity in response to intervention. Children who are diagnosed with ASD usually receive language and behavioral intervention. A two-day perceptual training on Thai lexical tone alters the neural responses to the training tones in native speakers of a tonal language (i.e., Mandarin Chinese) and in those of a non-tonal language (i.e., English) (Kaan et al., 2008). Furthermore, research shows that there is an association between the range of brain activation and degree of lexical tone learning in speech and word training paradigms (Wong, Perrachione, & Parrish, 2007). Novel neural stimulation methods, such as the HDtDCS method presented in the pilot study above, can also provide a valuable way to investigate intervention-related neuroplasticity in Chinese-speaking individuals with ASD.

13 Behavioral and Neurophysiological Evidence of Speech …

267

13.7 Conclusion The behavioral and electrophysiological evidence accumulated thus far on ASD in Chinese/tonal languages suggests that brain activation patterns and behavioral measures of lexical tone perception are not typical in Chinese-speaking individuals with ASD, and that these individuals with ASD also show atypical pitch production that is similar to non-tonal language speakers with ASD. Future studies need to systematically examine the effect of the stimuli, the relationship between lexical tone and music, and brain plasticity in response to lexical tone learning in order to fill the vast gap in the literature on the topic of Chinese-speaking individuals with ASD.

Appendix 1. Behavioral Studies on Prosody and Lexical Tone in Chinese ASD Study

Participants

Method

Results

Chan and To (2016)

19 Cantonese adults with HFA 19 controls

Stimuli: Frog books Task: Narrative production

Larger pitch variations in HFA; no difference in the amount and diversity of SFPs

Chen et al. (2016)

11 boys with ASD and Stimuli: /i/ in Tone 1 14 TD 6–8 years old, and Tone 2 Mandarin-speaking Tasks: Pitch contour identification and discrimination 11 steps from 230 to 290 Hz

Wider ID boundary and poor between-category discrimination accuracy in children with ASD

Jiang et al. (2015)

17 Mandarin-speaking individuals (age: 6.0–16.2) with HFA and 17 age-matched control

Five music tone sequences DISC and ID Disyllabic verb–object linguistic pitch contours DISC and ID

Melodic contour ID: ASD > TD Melodic contour DISC: ASD = TD Intonation DISC: ASD < TD Intonation ID: ASD < TD

Su et al. (2014)

Younger group: 14 ASD, 14 TD Older group: 14 ASD, 14 TD

Computer-based question/statement task 4 conditions: 2 (prosody/semantics) by 2 (question/statement)

Statement reading with wh-words: ASD-young < TD; ASD-old = TD-old; TD-old = TD young Question reading with wh-words: no group differences (continued)

268

Y. H. Yu and V. L. Shafer

(continued) Study

Participants

Method

Results

Li et al. (2013)

13 Cantonese children with ASD and 13 age-matched TD (8.3–12.9)

16 stories to include 4 conditions: prosody-only, SPF-only, both, and neither

Judgment of speaker’s belief: ASD = TD Judgment of speaker’s intention: TD: well above chance in “both” and “SPF-only” condition, slightly above chance for “prosody-only,” below chance for neither ASD: below chance in all conditions

Wang and Tsao (2015) 25 TD, 25 HFA ASD; 6–11 years

Emotional prosody identification; pragmatic and social adaptive abilities; language abilities

Neutral semantic condition: Word context: ASD = TD Sentence context: ASD < TD identifying happy prosody Emotionally relevant semantic condition: Word context: ASD < TD identifying happy prosody Sentence context: ASD = TD Correlation between prosody identification skills and pragmatic function: positive

13 Behavioral and Neurophysiological Evidence of Speech …

269

Appendix 2. Neurophysiological Studies on Mandarin Lexical Tone Processing in Children with ASD

Study

Participants

Method

Results

Yu et al. (2015)

EXP1: 17 ASD (6.9–12.4 years) 15 TD (7.7–11.8 years) EXP2: 16 ASD (7.9–12 years) 18 TD (6.9–12.4 years)

EXP1: Tone: 216–299 Hz Real word: /bai2/-/bai4/ Nonword: /rai2/-rai4/ EXP2: Hummed speech /bai2/-/bai4/

EXP1: MMN: Pure tone: enhanced at Cz but not Fz; marginally shorter latency Speech (real word): diminished Speech (nonword): MMN only for TD at Fz EXP1: P3a: Pure tone: enhanced at Cz Real word: enhanced at Cz, delayed EXP2: MMN Enhanced at Cz EXP2: P3a Reduced at Cz, marginally delayed

Wang et al. (2017)

16 ASD; 15 TD, 8–13 years old

/ba2/-/ba4/ continuum: steps 1, 5, and 9

Speech MMN: TD: between-category > within-category ASD: similar MMN latency and amplitude for between and within Harmonic MMN: MMN amplitude: ASD > TD No P3a differences of group or stimulus conditions Inter-trial phase coherence (ITPC): Theta band speech: ITPC ASD > TD TD: between-category > within-category ASD: between-category = within-category Theta band non-speech: ITPC ASD > TD Beta band speech: ITPC ASD > TD Beta band non-speech: ITPC ASD > TD

Huang et al. (2017)

Pure tone 22 ASD, 9.6 years old 20 TD, 9.4 years old Vowel 18 ASD, 9.8 years old 17 TD, 9.4 years old

Pure tone duration contrast 350 ms versus 250 ms Vowel duration contrast

MMN Pure tone: diminished and delayed Vowel: no group difference P3a Pure tone: no group difference Vowel: diminished

270

Y. H. Yu and V. L. Shafer

References Alexander, J. A., Bradlow, A. R., Ashley, R. D., & Wong, P. (2011). Music-melody perception in tone-language and non-tone-language speakers. In Proceedings of the Psycholinguistic Representation of Tone Conference (pp. 1–4), Hong Kong. Amatachaya, A., Jensen, M. P., Patjanasoontorn, N., Auvichayapat, N., Suphakunpinyo, C., Janjarasjitt, S., … Auvichayapat, P. (2015). The short-term effects of transcranial direct current stimulation on electroencephalography in children with autism: A randomized crossover controlled trial. Behavioural Neurology, 2015, 928631. American Psychiatric Association. (2013). Diagnostic and statistical manual of mental disorders, 5th ed.. https://doi.org/10.1176/appi.books.9780890425596. Asperger, H. (1944). Autistic psychopathy in childhood. In U. Frith (Ed.), Autism and asperger syndrome (pp. 37–92). Cambridge, UK: Cambridge University Press. Baltaxe, C., & Simmons, J. (1985). Prosodic development in normal and autistic children. In E. Schopler & G. Mesibov (Eds.), Communication problems in autism (pp. 95–125). New York: Plenum. Baron-Cohen, S., Leslie, A. M., & Frith, U. (1985). Does the autistic child have a “theory of mind”? Cognition, 21(1), 37–46. Bertone, A., Mottron, L., Jelenic, P., & Faubert, J. (2005). Enhanced and diminished visuo-spatial information processing in autism depends on stimulus complexity. Brain, 128(10), 2430–2441. https://doi.org/10.1093/brain/awh561. Bidelman, G. M., Hutka, S., & Moreno, S. (2013). Tone language speakers and musicians share enhanced perceptual and cognitive abilities for musical pitch: Evidence for bidirectionality between the domains of language and music. PLoS ONE, 8(4), e60676. https://doi.org/10.1371/ journal.pone.0060676. Bikson, M., Inoue, M., Akiyama, H., Deans, J. K., Fox, J. E., Miyakawa, H., & Jefferys, J. G. R. (2004). Effects of uniform extracellular DC electric fields on excitability in rat hippocampal slices in vitro. The Journal of Physiology, 557(Pt 1), 175–190. Bindman, L. J., Lippold, O. C., & Redfearn, J. W. (1964). The action of brief polarizing currents on the cerebral cortex of the rat (1) during current flow and (2) in the production of long-lasting after-effects. The Journal of Physiology, 172, 369–382. Bishop, D. V. M., Hardiman, M., Uwer, R., & von Suchodoletz, W. (2007). Maturation of the longlatency auditory ERP: Step function changes at start and end of adolescence. Developmental Science, 10(5), 565–575. Boets, B., Verhoeven, J., Wouters, J., & Steyaert, J. (2015). Fragile spectral and temporal auditory processing in adolescents with autism spectrum disorder and early language delay. Journal of Autism and Developmental Disorders, 45(6), 1845–1857. Bone, D., Lee, C.-C., Black, M. P., Williams, M. E., Lee, S., Levitt, P., & Narayanan, S. (2014). The psychologist as an interlocutor in autism spectrum disorder assessment: Insights from a study of spontaneous prosody. Journal of Speech, Language, and Hearing Research, 57(4), 1162–1177. Bonnel, A., Mottron, L., Peretz, I., Trudel, M., Gallun, E., & Bonnel, A. M. (2003). Enhanced pitch sensitivity in individuals with autism: A signal detection analysis. Journal of Cognitive Neuroscience, 15(2), 226–235. https://doi.org/10.1162/089892903321208169. Bonnel, A., McAdams, S., Smith, B., Berthiaume, C., Bertone, A., Ciocca, V., et al. (2010). Enhanced pure-tone pitch discrimination among persons with autism but not Asperger syndrome. Neuropsychologia, 48, 2465–2475. Bosch, L., & Sebastián-Gallés, N. (1997). Native-language recognition abilities in 4-month-old infants from monolingual and bilingual environments. Cognition, 65(1), 33–69. Bosch, L., & Sebastián-Gallés, N. (2003). Simultaneous bilingualism and the perception of a language-specific vowel contrast in the first year of life. Language and Speech, 46(Pt 2–3), 217–243. https://doi.org/10.1177/00238309030460020801. Brodeur, D. A., Gordon Green, C., Flores, H., & Burack, J. A. (2014). Time estimation among lowfunctioning individuals with autism spectrum disorders: Evidence of poor sensitivity to variability

13 Behavioral and Neurophysiological Evidence of Speech …

271

of short durations. Autism Research: Official Journal of the International Society for Autism Research, 7(2), 237–244. Bruneau, N., Bonnet-Brilhault, F., Gomot, M., Adrien, J.-L., & Barthélémy, C. (2003). Cortical auditory processing and communication in children with autism: Electrophysiological/behavioral relations. International Journal of Psychophysiology: Official Journal of the International Organization of Psychophysiology, 51(1), 17–25. Bruneau, N., Roux, S., Adrien, J. L., & Barthélémy, C. (1999). Auditory associative cortex dysfunction in children with autism: Evidence from late auditory evoked potentials (N1 wave-T complex). Clinical Neurophysiology: Official Journal of the International Federation of Clinical Neurophysiology, 110(11), 1927–1934. Burnham, D., & Francis, E. (1997). The role of linguistic experience in the perception of Thai tones. In A. Abramson (Ed.). Bangkok: Chulalongkorn University Press. Cantiani, C., Choudhury, N. A., Yu, Y. H., Shafer, V. L., Schwartz, R. G., & Benasich, A. A. (2016). From sensory perception to lexical-semantic processing: An ERP study in non-verbal children with autism. PLoS ONE, 11(8), e0161637. https://doi.org/10.1371/journal.pone.0161637. ˇ Ceponien˙ e, R., Lepistö, T., Shestakova, A., Vanhala, R., Alku, P., Näätänen, R., & Yaguchi, K. (2003). Speech-sound-selective auditory impairment in children with autism: They can perceive but do not attend. Proceedings of the National Academy of Sciences of the United States of America, 100(9), 5567–5572. https://doi.org/10.1073/pnas.0835631100. ˇ Ceponien˙ e, R., Rinne, T., & Näätänen, R. (2002). Maturation of cortical sound processing as indexed by event-related potentials. Clinical Neurophysiology: Official Journal of the International Federation of Clinical Neurophysiology, 113(6), 870–882. Chan, C. Y., Hounsgaard, J., & Nicholson, C. (1988). Effects of electric fields on transmembrane potential and excitability of turtle cerebellar Purkinje cells in vitro. The Journal of Physiology, 402, 751–771. Chan, K. K. L., & To, C. K. S. (2016). Do individuals with high-functioning autism who speak a tone language show intonation deficits? Journal of Autism and Developmental Disorders, 46(5), 1784–1792. https://doi.org/10.1007/s10803-016-2709-5. Chandrasekaran, B., Krishnan, A., & Gandour, J. T. (2007). Mismatch negativity to pitch contours is influenced by language experience. Brain Research, 1128(1), 148–156. https://doi.org/10.1016/ j.brainres.2006.10.064. Chang, C. H. C., & Kuo, W.-J. (2020). Neural processing of tone sandhi in production and perception: The case of Mandarin tone 3 sandhi. In H. M. Liu, F.M. Tsao, & P. Li (Eds.), Speech perception, production and acquisition—Multidisciplinary approaches in Chinese language. Berlin: Springer. Chao, Y. R. (1968). A grammar of spoken Chinese. Berkeley, CA.: University of California Press. Chen, F., Yan, N., Pan, X., Yang, F., Ji, Z., Wang, L., & Peng, G. (2016). Impaired categorical perception of Mandarin tones and its relationship to language ability in autism spectrum disorders (pp. 233–237). Interspeech. Cheng, Y.-Y., Wu, H.-C., Tzeng, Y.-L., Yang, M.-T., Zhao, L.-L., & Lee, C.-Y. (2013). The development of mismatch responses to Mandarin lexical tones in early infancy. Developmental Neuropsychology, 38(5), 281–300. Cheng, Y.-Y., Wu, H.-C., Tzeng, Y.-L., Yang, M.-T., Zhao, L.-L., & Lee, C.-Y. (2015). Featurespecific transition from positive mismatch response to mismatch negativity in early infancy: Mismatch responses to vowels and initial consonants. International Journal of Psychophysiology: Official Journal of the International Organization of Psychophysiology, 96(2), 84–94. Chevallier, C., Noveck, I., Happé, F., & Wilson, D. (2011). What’s in a voice? Prosody as a test case for the theory of mind account of autism. Neuropsychologia, 49(3), 507–517. https://doi.org/10. 1016/j.neuropsychologia.2010.11.042. Choudhury, N., & Benasich, A. A. (2011). Maturation of auditory evoked potentials from 6 to 48 months: Prediction to 3 and 4 year language and cognitive abilities. Clinical Neurophysiology: Official Journal of the International Federation of Clinical Neurophysiology, 122(2), 320–338. Crystal, D., & Quirk, R. (1964). Systems of prosodic and paralinguistic features in English (pp. 10– 12). Mouton.

272

Y. H. Yu and V. L. Shafer

Cummins, N., Scherer, S., Krajewski, J., Schnieder, S., Epps, J., & Quatieri, T. F. (2015). A review of depression and suicide risk assessment using speech analysis. Speech Communication, 71, 10–49. Curtin, S., & Vouloumanos, A. (2013). Speech preference is associated with autistic-like behavior in 18-month-olds at risk for autism spectrum disorder. Journal of Autism and Developmental Disorders, 43(9), 2114–2120. Curtin, S., & Werker, J. F. (2018). PRIMIR on tone. Frontiers in Psychology, 9, 1007. https://doi. org/10.3389/fpsyg.2018.01007. Cutler, A., & Bruck, A. (1974). On saying what you mean without meaning what you say. In M. Galy & R. Fox (Eds.) (pp. 117–127). Chicago, Illinois. Culter, A., & Isard, S. D. (1980). The production of prosody (Vol. 1, pp. 245–269). London: Academic Press. Diehl, J. J., Berkovits, L., & Harrison, A. (2010). Is prosody a diagnostic and cognitive bellwether of autism spectrum disorders. In Speech disorders: Causes, treatments, and social effects (pp. 159– 176). New York: Nova Science. Diehl, J. J., & Paul, R. (2013). Acoustic and perceptual measurements of prosody production on the profiling elements of prosodic systems in children by children with autism spectrum disorders. Applied Psycholinguistics, 34(01), 135–161. Diehl, J. J., Watson, D. G., Bennetto, L., McDonough, J., & Gunlogson, C. (2009). An acoustic analysis of prosody in high-functioning autism. Applied Psycholinguistics, 30, 385–404. Edwards, D., Cortes, M., Datta, A., Minhas, P., Wassermann, E. M., & Bikson, M. (2013). Physiological and modeling evidence for focal transcranial electrical brain stimulation in humans: a basis for high-definition tDCS. NeuroImage, 74, 266–275. https://doi.org/10.1016/j.neuroimage. 2013.01.042 Eigsti, I. M., & Fein, D. A. (2013). More is less: Pitch discrimination and language delays in children with optimal outcomes from autism: Pitch perception, language delay, and outcomes. Autism Research, 6(6), 605–613. https://doi.org/10.1002/aur.1324. Eigsti, I. M., Schuh, J., Mencl, E., Schultz, R. T., & Paul, R. (2012). The neural underpinnings of prosody in autism. Child Neuropsychology: A Journal on Normal and Abnormal Development in Childhood and Adolescence, 18(6), 600–617. https://doi.org/10.1080/09297049.2011.639757. Falter, C. M., Noreika, V., Wearden, J. H., & Bailey, A. J. (2012). More consistent, yet less sensitive: Interval timing in autism spectrum disorders. Quarterly Journal of Experimental Psychology, 65(11), 2093–2107. https://doi.org/10.1080/17470218.2012.690770. Ferri, R., Elia, M., Agarwal, N., Lanuzza, B., Musumeci, S. A., & Pennisi, G. (2003). The mismatch negativity and the P3a components of the auditory event-related potentials in autistic lowfunctioning subjects. Clinical Neurophysiology: Official Journal of the International Federation of Clinical Neurophysiology, 114(9), 1671–1680. Filipe, M. G., Watson, L., Vicente, S. G., & Frota, S. (2017). Atypical preference for infant-directed speech as an early marker of autism spectrum disorders? A literature review and directions for further research. Clinical Linguistics & Phonetics, 1–19. Friederici, A. D., Friedrich, M., & Weber, C. (2002). Neural manifestation of cognitive and precognitive mismatch detection in early infancy. NeuroReport, 13(10), 1251–1254. Friedrich, M., Herold, B., & Friederici, A. D. (2009). ERP correlates of processing native and non-native language word stress in infants with different language outcomes. Cortex; A Journal Devoted to the Study of the Nervous System and Behavior, 45(5), 662–676. Fuchigami, T., Okubo, O., Ejiri, K., Fujita, Y., Kohira, R., Noguchi, Y., … Harada, K. (1995). Developmental changes in P300 wave elicited during two different experimental conditions. Pediatric Neurology, 13(1), 25–28. Fusaroli, R., Lambrechts, A., Bang, D., Bowler, D. M., & Gaigg, S. B. (2017). Is voice a marker for autism spectrum disorder? A systematic review and meta-analysis. Autism Research: Official Journal of the International Society for Autism Research, 10(3), 384–407. Gandour, J. T. (1983). Tone perception in Far Eastern languages. Journal of Phonetics, 11, 149–175.

13 Behavioral and Neurophysiological Evidence of Speech …

273

Gandour, J. T. (1984). Tone dissimilarity judgments by Chinese listeners. Journal of Chinese Linguistics, 12, 235–261. Gandour, J. T., & Harshman, R. A. (1978). Cross-language difference in tone perception: A multidimensional scaling investigation. Language and Speech, 21, 1–33. Gómez, L., Vidal, B., Maragoto, C., Morales, L. M., Berrillo, S., Vera Cuesta, H., … Robinson, M. (2017). Non-invasive brain stimulation for children with autism spectrum disorders: A short-term outcome study. Behavioral Sciences (Basel, Switzerland), 7(3). Gomot, M., Belmonte, M. K., Bullmore, E. T., Bernard, F. A., & Baron-Cohen, S. (2008). Brain hyper-reactivity to auditory novel targets in children with high-functioning autism. Brain: A Journal of Neurology, 131(Pt 9), 2479–2488. https://doi.org/10.1093/brain/awn172. Green, H., & Tobin, Y. (2009). Prosodic analysis is difficult … but worth it: A study in high functioning autism. International Journal of Speech-Language Pathology, 11(4), 308–315. Grossman, R. B., & Tager-Flusberg, H. (2012). “Who said that?” Matching of low- and highintensity emotional prosody to facial expressions by adolescents with ASD. Journal of Autism and Developmental Disorders, 42(12), 2546–2557. Haesen, B., Boets, B., & Wagemans, J. (2011). A review of behavioural and electrophysiological studies on auditory processing and speech perception in autism spectrum disorders. Research in Autism Spectrum Disorders, 5(2), 701–714. Hallé, P. A., Chang, Y.-C., & Best, C. T. (2004). Identification and discrimination of Mandarin Chinese tones by Mandarin Chinese vs. French listeners. Journal of Phonetics, 32(3), 395–421. Hambly, H., Wren, Y., McLeod, S., & Roulstone, S. (2013). The influence of bilingualism on speech production: A systematic review. International Journal of Language & Communication Disorders, 48(1), 1–24. Happé, F., & Frith, U. (2006). The weak coherence account: Detail-focused cognitive style in autism spectrum disorders. Journal of Autism and Developmental Disorders, 36(1), 5–25. https://doi. org/10.1007/s10803-005-0039-0. Happé, F. G. (1993). Communicative competence and theory of mind in autism: A test of relevance theory. Cognition, 48, 101–119. Heaton, P. (2003). Pitch memory, labeling and disembedding in autism. Journal of Child Psychology and Psychiatry, 44(4), 543–551. Heaton, P. (2005). Interval and contour processing in autism. Journal of Autism and Developmental Disorders, 35, 787–793. Heaton, P., Hermelin, B., & Pring, L. (1999). Can children with autistic spectrum disorders perceive affect in music? An experimental investigation. Psychological Medicine, 29(6), 1405–1410. Höhle, B., Bijeljac-Babic, R., Herold, B., Weissenborn, J., & Nazim, T. (2009). Language specific prosodic preferences during the first half year of life: Evidence from German and French infants. Infant Behavior & Development, 32(3), 262–274. Hua, Z., & Dodd, B. (2000). The phonological acquisition of Putonghua (Modern Standard Chinese). Journal of Child Language, 27(1), 3–42. Huang, A. X., Jia, M., & Wheeler, J. J. (2013). Children with autism in the people’s Republic of China: Diagnosis, legal issues, and educational services. Journal of Autism and Developmental Disorders, 43(9), 1991–2001. https://doi.org/10.1007/s10803-012-1722-6. Huang, D., Yu, L., Wang, X., Fan, Y., Wang, S., & Zhang, Y. (2017). Distinct patterns of discrimination and orienting for temporal processing of speech and nonspeech in Chinese children with autism: An event-related potential study. The European Journal of Neuroscience. https://doi.org/ 10.1111/ejn.13657. Jacobson, L., Koslowsky, M., & Lavidor, M. (2012). tDCS polarity effects in motor and cognitive domains: A meta-analytical review. Experimental Brain Research, 216(1), 1–10. https://doi.org/ 10.1007/s00221-011-2891-9. ˇ Jansson-Verkasalo, E., Ceponien˙ e, R., Kielinen, M., Suominen, K., Jäntti, V., Linna, S. L., … Näätänen, R. (2003). Deficient auditory processing in children with Asperger syndrome, as indexed by event-related potentials. Neuroscience Letters, 338(3), 197–200.

274

Y. H. Yu and V. L. Shafer

Järvinen-Pasley, A., & Heaton, P. (2007). Evidence for reduced domain-specificity in auditory processing in autism. Developmental Science, 10(6), 786–793. Järvinen-Pasley, A., Peppé, S., King-Smith, G., & Heaton, P. (2008). The relationship between form and function level receptive prosodic abilities in autism. Journal of Autism and Developmental Disorders, 38, 1328–1340. Jiang, J., Liu, F., Wan, X., & Jiang, C. (2015). Perception of melodic contour and intonation in autism spectrum disorder: Evidence from Mandarin speakers. Journal of Autism and Developmental Disorders, 45(7), 2067–2075. https://doi.org/10.1007/s10803-015-2370-4. Jusczyk, P. W. (1997). The discovery of spoken language. Cambridge: MIT Press. Jusczyk, P. W., & Aslin, R. N. (1995). Infants’ detection of the sound patterns of words in fluent speech. Cognitive Psychology, 29(1), 1–23. https://doi.org/10.1006/cogp.1995.1010. Jusczyk, P. W., Cutler, A., & Redanz, N. J. (1993). Infants’ preference for the predominant stress patterns of English words. Child Development, 64(3), 675–687. Kaan, E., Barkley, C. M., Bao, M., & Wayland, R. (2008). Thai lexical tone perception in native speakers of Thai, English and Mandarin Chinese: An event-related potentials training study. BMC Neuroscience, 9, 53. https://doi.org/10.1186/1471-2202-9-53. Kanner, L. (1943). Autistic disturbances of affective contact. Nervous Child, 2, 217–250. Kasai, K., Hashimoto, O., Kawakubo, Y., Yumoto, M., Kamio, S., Itoh, K., Koshida, I., Iwanami, A., Nakagome, K., Fukuda, M., Yamasue, H., Yamada, H., Abe, O., Aoki, S., & Kato, N. (2005). Delayed automatic detection of change in speech sounds in adults with autism: a magnetoencephalographic study. Clinical Neurophysiology: Official Journal of the International Federation of Clinical Neurophysiology, 116(7), 1655–1664. https://doi.org/10.1016/j.clinph.2005.03.007. Koelsch, S., Gunter, T. C., Cramon, D. Y., Zysset, S., Lohmann, G., & Friederici, A. D. (2002). Bach speaks: A cortical “language-network” serves the processing of music. NeuroImage, 17(2), 956–966. Kuhl, P. K., Coffrey-Corina, S., Padden, D., & Dawson, G. (2005). Links between social and linguistic processing of speech in preschool children with autism: Behavioral and electrophysiological measures. Developmental Science, 8(1), F1–F12. Kujala, T., Aho, E., Lepistö, T., Jansson-Verkasalo, E., Nieminen-von Wendt, T., von Wendt, L., & Näätänen, R. (2007). Atypical pattern of discriminating sound features in adults with Asperger syndrome as reflected by the mismatch negativity. Biological Psychology, 75(1), 109–114. ˇ Kushnerenko, E., Ceponien˙ e, R., Balan, P., Fellman, V., & Näätänen, R. (2002). Maturation of the auditory change detection response in infants: A longitudinal ERP study. NeuroReport, 13(15), 1843–1848. Kushnerenko, E., Winkler, I., Horváth, J., Näätänen, R., Pavlov, I., Fellman, V., & Huotilainen, M. (2007). Processing acoustic change and novelty in newborn infants. The European Journal of Neuroscience, 26(1), 265–274. https://doi.org/10.1111/j.1460-9568.2007.05628.x. Kwok, H. (1984). Sentence particles in Cantonese. Hong Kong: Centre of Asian Studies, The University of Hong Kong. Lambrechts, A., Falter-Wagner, C. M., & van Wassenhove, V. (2017). Diminished neural resources allocation to time processing in Autism Spectrum Disorders. NeuroImage: Clinical. https://doi. org/10.1016/j.nicl.2017.09.023. Leather, J. (1983). Speaker normalization in perception of lexical tones. Journal of Phonetics, 11, 373–382. Lee, C.-Y., & Cheng, Y.-Y. (2020). Neurophysiological studies of Mandarin lexical tone acquisition in the early childhood. In H. M. Liu, F. M. Tsao, & P. Li (Eds.), Speech perception, production and acquisition—Multidisciplinary approaches in Chinese language. Berlin: Springer. Lee, Y. S., Vokoch, D., & Wurm, L. H. (1996). Tone perception in Cantonese and Mandarin: A cross-linguistic comparison. Journal of Psycholinguistic Research, 25(5), 527–542. Lepistö, T., Kujala, T., Vanhala, R., Alku, P., Huotilainen, M., & Näätänen, R. (2005). The discrimination of and orienting to speech and non-speech sounds in children with autism. Brain Research, 1066(1–2), 147–157.

13 Behavioral and Neurophysiological Evidence of Speech …

275

Lepistö, T., Silokallio, S., Nieminen-von Wendt, T., Alku, P., Näätänen, R., & Kujala, T. (2006). Auditory perception and attention as reflected by the brain event-related potentials in children with Asperger syndrome. Clinical Neurophysiology: Official Journal of the International Federation of Clinical Neurophysiology, 117(10), 2161–2171. https://doi.org/10.1016/j.clinph.2006.06.709. Leppänen, P. H. T., Richardson, U., Pihko, E., Eklund, K. M., Guttorm, T. K., Aro, M., & Lyytinen, H. (2002). Brain responses to changes in speech sound durations differ between infants with and without familial risk for dyslexia. Developmental Neuropsychology, 22(1), 407–422. https://doi. org/10.1207/S15326942dn2201_4. Li, J. P. W., Law, T., Lam, G. Y. H., & To, C. K. S. (2013). Role of sentence-final particles and prosody in irony comprehension in Cantonese-speaking children with and without autism spectrum disorders. Clinical Linguistics & Phonetics, 27(1), 18–32. https://doi.org/10.3109/02699206.2012. 734893. Lin, M. C. (1965). The pitch indicator and the pitch characteristics of tones in Standard Chinese. Acta Acoustica (China), 2, 8–15. Liu, D., Wellman, H. M., Tardif, T., & Sabbagh, M. A. (2008). Theory of mind development in Chinese children: A meta-analysis of false-belief understanding across cultures and languages. Developmental Psychology, 44(2), 523–531. https://doi.org/10.1037/0012-1649.44.2.523. Liu, H.-M., Chen, Y., & Tsao, F.-M. (2014). Developmental changes in mismatch responses to mandarin consonants and lexical tones from early to middle childhood. PLoS ONE, 9(4), e95587. https://doi.org/10.1371/journal.pone.0095587. Lord, C., Rutter, M., DiLavore, P. C., Risi, S., Gotham, K., & Bishop, S. L. (2012). Autism diagnostic observation schedule (2nd ed.). Torrance, CA: Western Psychological Services. Määttä, S., Saavalainen, P., Könönen, M., Pääkkönen, A., Muraja-Murro, A., & Partanen, J. (2005). Processing of highly novel auditory events in children and adults: An event-related potential study. NeuroReport, 16(13), 1443–1446. Maess, B., Koelsch, S., Gunter, T. C., & Friederici, A. D. (2001). Musical syntax is processed in Broca’s area: An MEG study. Nature Neuroscience, 4(5), 540–545. Maister, L., & Plaisted-Grant, K. C. (2011). Time perception and its relationship to memory in autism spectrum conditions. Developmental Science, 14(6), 1311–1322. McCann, J., & Peppé, S. (2003). Prosody in autism spectrum disorders: A critical review. International Journal of Language and Communication Disorders, 38, 325–350. McCann, J., Peppé, S., Gibbon, F. E., O’Hare, A., & Rutherford, M. (2007). Prosody and its relationship to language in school-aged children with high-functioning autism. International Journal of Language & Communication Disorders, 42(6), 682–702. Mehler, J., Jusczyk, P., Lambertz, G., Halsted, N., Bertoncini, J., & Amiel-Tison, C. (1988). A precursor of language acquisition in young infants. Cognition, 29(2), 143–178. Moreno, S., Marques, C., Santos, A., Santos, M., Castro, S. L., & Besson, M. (2009). Musical training influences linguistic abilities in 8-year-old children: More evidence for brain plasticity. Cerebral Cortex , 19(3), 712–723. https://doi.org/10.1093/cercor/bhn120. Morr, M. L., Shafer, V. L., Kreuzer, J. A., & Kurtzberg, D. (2002). Maturation of mismatch negativity in typically developing infants and preschool children. Ear and Hearing, 23(2), 118–136. Mottron, L., Bouvet, L., Bonnel, A., Samson, F., Burack, J. A., Dawson, M., & Heaton, P. (2013). Veridical mapping in the development of exceptional autistic abilities. Neuroscience and Biobehavioral Reviews, 37(2), 209–228. https://doi.org/10.1016/j.neubiorev.2012.11.016. Mottron, L., Dawson, M., Soulières, I., Hubert, B., & Burack, J. (2006). Enhanced perceptual functioning in autism: An update, and eight principles of autistic perception. Journal of Autism and Developmental Disorders, 36(1), 27–43. https://doi.org/10.1007/s10803-005-0040-7. Näätänen, R., Paavilainen, P., Rinne, T., & Alho, K. (2007). The mismatch negativity (MMN) in basic research of central auditory processing: A review. Clinical Neurophysiology: Official Journal of the International Federation of Clinical Neurophysiology, 118(12), 2544–2590. Näätänen, R., & Picton, T. (1987). The N1 wave of the human electric and magnetic response to sound: A review and an analysis of the component structure. Psychophysiology, 24(4), 375–425.

276

Y. H. Yu and V. L. Shafer

Nadig, A., & Shaw, H. (2012). Acoustic and perceptual measurement of expressive prosody in high-functioning autism: Increased pitch range and what it means to listeners. Journal of Autism and Developmental Disorders, 42(4), 499–511. https://doi.org/10.1007/s10803-011-1264-3. Nakai, Y., Takashima, R., Takiguchi, T., & Takada, S. (2014). Speech intonation in children with autism spectrum disorder. Brain & Development, 36(6), 516–522. https://doi.org/10.1016/j.bra indev.2013.07.006. Norise, C., & Hamilton, R. H. (2016). Non-invasive brain stimulation in the treatment of post-stroke and neurodegenerative Aphasia: Parallels, differences, and lessons learned. Frontiers in Human Neuroscience, 10, 675. https://doi.org/10.3389/fnhum.2016.00675. Oades, R. D., Walker, M. K., Geffen, L. B., & Stern, L. M. (1988). Event-related potentials in autistic and healthy children on an auditory choice reaction time task. International Journal of Psychophysiology: Official Journal of the International Organization of Psychophysiology, 6(1), 25–37. Ong, J. H., Tan, S. H., Chan, A. H. D., & Wong, F. C. K. (2020). The effect of musical experience and disorder on lexical tone perception, production, and learning: A review. In H. M. Liu, F. M. Tsao, & P. Li (Eds.), Speech perception, production and acquisition—Multidisciplinary approaches in Chinese language. Berlin: Springer. Patel, A. D. (2003). Language, music, syntax and the brain. Nature Neuroscience, 6(7), 674–681. https://doi.org/10.1038/nn1082. Patel, A. D. (2011). Why would musical training benefit the neural encoding of speech? The OPERA hypothesis. Frontiers in Psychology, 2, 142. https://doi.org/10.3389/fpsyg.2011.00142. Patel, A. D. (2014). Can nonlinguistic musical training change the way the brain processes speech? The expanded OPERA hypothesis. Hearing Research, 308, 98–108. https://doi.org/10.1016/j.hea res.2013.08.011. Patel, A. D., Gibson, E., Ratner, J., Besson, M., & Holcomb, P. J. (1998). Processing syntactic relations in language and music: An event-related potential study. Journal of Cognitive Neuroscience, 10(6), 717–733. Paul, R., Augustyn, A., Klin, A., & Volkmar, F. R. (2005). Perception and production of prosody by speakers with autism spectrum disorders. Journal of Autism and Developmental Disorders, 35(2), 205–220. Paul, R., Chawarska, K., Fowler, C., Cicchetti, D., & Volkmar, F. (2007). Listen to my children and you shall hear: Auditory preferences in toddlers with autism spectrum disorders. Journal of Speech, Language and Hearing Research, 50, 1350–1364. Pfordresher, P. Q., & Brown, S. (2009). Enhanced production and perception of musical pitch in tone language speakers. Attention, Perception & Psychophysics, 71(6), 1385–1398. https://doi. org/10.3758/APP.71.6.1385. Pierrehumbert, J. B. (1980). The phonology and phonetics of English intonation (MIT dissertation). Ponton, C., Eggermont, J. J., Khosla, D., Kwong, B., & Don, M. (2002). Maturation of human central auditory system activity: Separating auditory evoked potentials by dipole source modeling. Clinical Neurophysiology: Official Journal of the International Federation of Clinical Neurophysiology, 113(3), 407–420. Ponton, C. W., Eggermont, J. J., Kwong, B., & Don, M. (2000). Maturation of human central auditory system activity: Evidence from multi-channel evoked potentials. Clinical Neurophysiology: Official Journal of the International Federation of Clinical Neurophysiology, 111(2), 220–236. Purpura, D. P., & Mcmurtry, J. G. (1965). Intracellular activities and evoked potential changes during polarization of motor cortex. Journal of Neurophysiology, 28, 166–185. Quigley, J., McNally, S., & Lawson, S. (2016). Prosodic patterns in interaction of low-risk and at-risk-of-autism spectrum disorders infants and their mothers at 12 and 18 months. Language Learning and Development, 12(3), 295–310. Rahman, A., Toshev, P. K., & Bikson, M. (2014). Polarizing cerebellar neurons with transcranial direct current stimulation. Clinical Neurophysiology: Official Journal of the International Federation of Clinical Neurophysiology, 125(3), 435–438. https://doi.org/10.1016/j.clinph.2013. 10.003.

13 Behavioral and Neurophysiological Evidence of Speech …

277

Reato, D., Rahman, A., Bikson, M., & Parra, L. C. (2013). Effects of weak transcranial alternating current stimulation on brain activity—A review of known mechanisms from animal studies. Frontiers in Human Neuroscience, 7, 687. Roesler, C. P., Flax, J., MacRoy-Higgins, M., Fermano, Z., Morgan-Byrne, J., & Benasich, A. A. (2013). Sensory desensitization training for successful net application and EEG/ERP acquisition in difficult to test children. Communication Disorders Quarterly, 35(1), 14–20. Rutherford, M. D., Baron-Cohen, S., & Wheelwright, S. (2002). Reading the mind in the voice: A study with normal adults and adults with Asperger syndrome and high functioning autism. Journal of Autism & Developmental Disorders, 32, 189–194. Rutter, M., Le Couteur, A., & Lord, C. (2003). ADI-R. Autism diagnostic interview revised. Manual. Los Angeles: Western Psychological Services. Sambeth, A., Ruohio, K., Alku, P., Fellman, V., & Huotilainen, M. (2008). Sleeping newborns extract prosody from continuous speech. Clinical Neurophysiology: Official Journal of the International Federation of Clinical Neurophysiology, 119(2), 332–341. https://doi.org/10.1016/j.clinph.2007. 09.144. Scherg, M., & Von Cramon, D. (1986). Evoked dipole source potentials of the human auditory cortex. Electroencephalography and Clinical Neurophysiology, 65(5), 344–360. Schneider, H. D., & Hopp, J. P. (2011). The use of the Bilingual Aphasia Test for assessment and transcranial direct current stimulation to modulate language acquisition in minimally verbal children with autism. Clinical Linguistics & Phonetics, 25(6–7), 640–654. https://doi.org/10. 3109/02699206.2011.570852. Shafer, V. L., Morr, M. L., Kreuzer, J. A., & Kurtzberg, D. (2000). Maturation of mismatch negativity in school-age children. Ear and Hearing, 21(3), 242–251. Shafer, V., Morr, M., Datta, H., Kurtzberg, D., Schwartz, R. G. (2005). Neurophysiological indexes of speech processing deficits in children with specific language impairment. Journal of Cognitive Neuroscience, 17, 1168–1180. https://doi.org/10.1162/0898929054475217. Shafer, V. L., Shucard, D. W., & Jaeger, J. J. (1999). Electrophysiological indices of cerebral specialization and the role of prosody in language acquisition in 3-month-old infants. Developmental Neuropsychology, 15(1), 73–109. Shafer, V. L., Yu, Y. H., & Datta, H. (2010). Maturation of speech discrimination in 4- to 7-yrold children as indexed by event-related potential mismatch responses. Ear and Hearing, 31(6), 735–745. https://doi.org/10.1097/AUD.0b013e3181e5d1a7. Shafer, V. L., Yu, Y. H., & Datta, H. (2011). The development of English vowel perception in monolingual and bilingual infants: Neurophysiological correlates. Journal of Phonetics, 39(4), 527–545. https://doi.org/10.1016/j.wocn.2010.11.010. Shafer, V. L., Yu, Y. H., & Garrido-Nag, K. (2012). Neural mismatch indices of vowel discrimination in monolingually and bilingually exposed infants: Does attention matter? Neuroscience Letters, 526(1), 10–14. https://doi.org/10.1016/j.neulet.2012.07.064. Sharda, M., Subhadra, T. P., Sahay, S., Nagaraja, C., Singh, L., Mishra, R., … Singh, N. C. (2010). Sounds of melody—Pitch patterns of speech in autism. Neuroscience Letters, 478(1), 42–45. https://doi.org/10.1016/j.neulet.2010.04.066. Sharma, A., Glick, H., Deeves, E., & Duncan, E. (2015). The P1 biomarker for assessing cortical maturation in pediatric hearing loss: A review. Otorinolaringologia, 65(4), 103–114. Shen, X. S. (1990). The prosody of Mandarin Chinese (Vol. 118). Berkeley: University of California Press. Simmons, J. Q., & Baltaxe, C. (1975). Language patterns of adolescent autistics. Journal of Autism and Childhood Schizophrenia, 5(4), 333–351. Singh, L. (2020). Early word recognition and word learning in Mandarin learning children. In H. M. Liu, F. M. Tsao, & P. Li (Eds.), Speech perception, production and acquisition—Multidisciplinary approaches in Chinese language. Berlin: Springer. Soltaninejad, Z., Nejati, V., & Ekhtiari, H. (2015). Effect of anodal and cathodal transcranial direct current stimulation on DLPFC on modulation of inhibitory control in ADHD. Journal of Attention Disorders. https://doi.org/10.1177/1087054715618792.

278

Y. H. Yu and V. L. Shafer

Stanutz, S., Wapnick, J., & Burack, J. A. (2014). Pitch discrimination and melodic memory in children with autism spectrum disorders. Autism: The International Journal of Research and Practice, 18(2), 137–147. https://doi.org/10.1177/1362361312462905. Stefanics, G., Háden, G. P., Sziller, I., Balázs, L., Beke, A., & Winkler, I. (2009). Newborn infants process pitch intervals. Clinical Neurophysiology: Official Journal of the International Federation of Clinical Neurophysiology, 120(2), 304–308. https://doi.org/10.1016/j.clinph.2008.11.020. Stevens, C. J., Keller, P. E., & Tyler, M. D. (2013). Tonal language background and detecting pitch contour in spoken and musical items. Psychology of Music, 41(1), 59–74. https://doi.org/10.1177/ 0305735611415749. Su, Y. (Esther), Jin, Y., Wan, G.-B., Zhang, J.-S., & Su, L.-Y. (2014). Interpretation of wh-words in Mandarin-speaking high-functioning children with autism spectrum disorders. Research in Autism Spectrum Disorders, 8(10), 1364–1372. Sun, X., Allison, C., Matthews, F. E., Sharp, S. J., Auyeung, B., Baron-Cohen, S., & Brayne, C. (2013). Prevalence of autism in mainland China, Hong Kong and Taiwan: A systematic review and meta-analysis. Molecular Autism, 4(1), 7. Szelag, E., Kowalska, J., Galkowski, T., & Pöppel, E. (2004). Temporal processing deficits in highfunctioning children with autism. British Journal of Psychology (London, England: 1953), 95(Pt 3), 269–282. https://doi.org/10.1348/0007126041528167. Tao, K. T. (1987). Brief report: Infantile autism in China. Journal of Autism and Developmental Disorders, 17(2), 289–296. Thiessen, E. D., Hill, E. A., & Saffran, J. F. (2005). Infant-directed speech facilitates word segmentation. Infancy, 7, 53–71. Thorsen, N. G. (1980). A study of perception of sentence intonation—Evidence from Danish. The Journal of the Acoustical Society of America, 67(3), 1014–1030. Titze, I. R. (1994). Principles of voice production. Englewood Cliffs, N.J.: Prentice Hall. Tsao, F.-M., & Liu, H.-M. (2020). Lexical tone perception development in infancy. In H. M. Liu, F. M. Tsao, & P. Li (Eds.), Speech perception, production and acquisition—Multidisciplinary approaches in Chinese language. Berlin: Springer. Vaissiere, J. (1983). Language-independent prosodic features. In Prosody: Models and measurements (pp. 53–66). https://doi.org/10.1007/978-3-642-69103-4_5. Vouloumanos, A., & Werker, J. F. (2007). Listening to language at birth: Evidence for a bias for speech in neonates. Developmental Science, 10(2), 159–164. Wang, A. T., Dapretto, M., Hariri, A. R., Sigman, M., & Bookheimer, S. Y. (2004). Neural correlates of facial affect processing in children and adolescents with autism spectrum disorder. Journal of the American Academy of Child and Adolescent Psychiatry, 43(4), 481–490. https://doi.org/10. 1097/00004583-200404000-00015. Wang, A. T., Lee, S. S., Sigman, M., & Dapretto, M. (2006). Neural basis of irony comprehension in children with autism: The role of prosody and context. Brain, 129(4), 932–943. Wang, A. T., Lee, S. S., Sigman, M., & Dapretto, M. (2007). Reading affect in the face and voice: Neural correlates of interpreting communicative intent in children and adolescents with autism spectrum disorders. Archives of General Psychiatry, 64(6), 698–708. https://doi.org/10.1001/arc hpsyc.64.6.698. Wang, J. E., & Tsao, F. M. (2015). Emotional prosody perception and its association with pragmatic language in school-aged children with high-function autism. Research in Developmental Disabilities, 37, 162–170. https://doi.org/10.1016/j.ridd.2014.11.013. Wang, X., Wang, S., Fan, Y., Huang, D., & Zhang, Y. (2017). Speech-specific categorical perception deficit in autism: An event-related potential study of lexical tone processing in Mandarin-speaking children. Scientific Reports, 7, 43254. https://doi.org/10.1038/srep43254. Wayland, R. P., & Guion, S. G. (2004). Training English and Chinese listeners to perceive Thai tones: A preliminary report. Language Learning, 54(4), 681–712. Whitehouse, A. J. O., & Bishop, D. V. M. (2008). Do children with autism “switch off” to speech sounds? An investigation using event-related potentials. Developmental Science, 11(4), 516–524. https://doi.org/10.1111/j.1467-7687.2008.00697.x.

13 Behavioral and Neurophysiological Evidence of Speech …

279

Wong, P. C. M., Perrachione, T. K., & Parrish, T. B. (2007). Neural characteristics of successful and less successful speech and word learning in adults. Human Brain Mapping, 28(10), 995–1006. https://doi.org/10.1002/hbm.20330. Wong, P. (2013). Perceptual evidence for protracted development in monosyllabic Mandarin lexical tone production in preschool children in Taiwan. The Journal of the Acoustical Society of America, 133(1), 434–443. Wong, P., Schwartz, R. G., & Jenkins, J. J. (2005). Perception and production of lexical tones by 3-year-old, Mandarin-speaking children. Journal of Speech, Language, and Hearing Research: JSLHR, 48(5), 1065–1079. https://doi.org/10.1044/1092-4388(2005/074). World Health Organization. (2004). ICD-10: International statistical classification of diseases and related health problems: Tenth revision, 2nd ed. World Health Organization. https://apps.who. int/iris/handle/10665/42980. Xi, J., Zhang, L., Shu, H., Zhang, Y., & Li, P. (2010). Categorical perception of lexical tones in Chinese revealed by mismatch negativity. Neuroscience, 170(1), 223–231. https://doi.org/10. 1016/j.neuroscience.2010.06.077. Xu, Y., Gandour, J. T., & Francis, A. L. (2006a). Effects of language experience and stimulus complexity on the categorical perception of pitch direction. The Journal of the Acoustical Society of America, 120(2), 1063–1074. Xu, Y., Krishnan, A., & Gandour, J. T. (2006b). Specificity of experience-dependent pitch representation in the brainstem. NeuroReport, 17(15), 1601–1605. https://doi.org/10.1097/01.wnr.000 0236865.31705.3a. Yau, S. C. (1980). Sentential connotations in Cantonese. Fangyan, 1, 35–52. You, R. S., Serniclaes, W., Rider, D., & Chabane, N. (2017). On the nature of the speech perception deficits in children with autism spectrum disorders. Research in Developmental Disabilities, 61, 158–171. https://doi.org/10.1016/j.ridd.2016.12.009. Yu, K., Wang, R., & Li, P. (2020). Native and nonnative processing of acoustic and phonological information of lexical tones in Chinese: Behavioral and neural correlates. In H. M. Liu, F. M. Tsao, & P. Li (Eds.), Speech perception, production and acquisition—Multidisciplinary approaches in Chinese language. Berlin: Springer. Yu, L., Fan, Y., Deng, Z., Huang, D., Wang, S., & Zhang, Y. (2015). Pitch processing in tonallanguage-speaking children with autism: An event-related potential study. Journal of Autism and Developmental Disorders, 45(11), 3656–3667. https://doi.org/10.1007/s10803-015-2510-x. Yu, Y. H., Shafer, V. L., & Sussman, E. S. (2017). Neurophysiological and behavioral responses of mandarin lexical tone processing. Frontiers in Neuroscience, 11, 95. https://doi.org/10.3389/ fnins.2017.00095. Zatorre, R. J., Evans, A. C., Meyer, E., & Gjedde, A. (1992). Lateralization of phonetic and pitch discrimination in speech processing. Science (New York, N.Y.), 256(5058), 846–849. Zatorre, R. J., & Gandour, J. T. (2008). Neural specializations for speech and pitch: Moving beyond the dichotomies. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 363(1493), 1087–1104. https://doi.org/10.1098/rstb.2007.2161. Zhang, X., & Ji, C. (2005). Autism mental retardation of young children in China. Biomedical and Environmental Sciences, 18(5), 334.