The Handbook of Speech Perception 1119184088, 9781119184089

A wide-ranging and authoritative volume exploring contemporary perceptual research on speech, updated with new original

301 102 6MB

English Pages 784 [770] Year 2021

Recommend Papers

The Oxford Handbook of Philosophy of Perception 9780199600472, 0199600473

The Oxford Handbook of the Philosophy of Perception is a survey by leading philosophical thinkers of contemporary issues

98 84 35MB Read more

The Oxford Handbook of Philosophy of Perception 9780191669057, 9780199600472, 0191669059

The Oxford Handbook of Philosophy of Perception is a survey by leading philosophical thinkers of contemporary issues and

111 104 8MB Read more

The Oxford Handbook of Freedom of Speech 019882758X, 9780198827580

Freedom of speech is central to the liberal democratic tradition. It touches on every aspect of our social and political

505 53 4MB Read more

The Oxford Handbook of Freedom of Speech 9780192562630, 9780198827580, 0192562630

Freedom of speech is central to the liberal democratic tradition. It touches on every aspect of our social and political

110 8 1MB Read more

The Oxford Handbook of Freedom of Speech 9780198827580, 019882758X

The Oxford Handbook on Freedom of Speech provides a critical analysis of the foundations, rationales, and ideas that und

114 17 4MB Read more

A Guide to Speech Production and Perception 9780748636532

What roles do the speaker and the listener play in communication processes? Providing an overall system view, this innov

107 101 6MB Read more

Perception of Space and Motion (Handbook of Perception and Cognition) [2nd ed.] 0122405307, 9780122405303, 9780080538617

During the past 25 years, the field of space and motion perception has rapidly advanced. Once thought to be distinct per

674 20 27MB Read more

Temporal sequence in the perception of speech [Reprint 2018 ed.] 9783111352886, 9783110997798

145 83 6MB Read more

What Makes Sound Patterns Expressive? : The Poetic Mode of Speech Perception 0822311704, 9780822311645, 9780822378365

322 6 12MB Read more

What Makes Sound Patterns Expressive?: The Poetic Mode of Speech Perception 9780822378365

Poets, academics, and those who simply speak a language are subject to mysterious intuitions about the perceptual qualit

108 28 12MB Read more

The Handbook of Speech Perception
1119184088, 9781119184089

Author / Uploaded
Jennifer S Pardo
Lynne C Nygaard
Robert E Remez
David B Pisoni

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

The Handbook of Speech Perception

Blackwell Handbooks in Linguistics This outstanding multivolume series covers all the major subdisciplines within linguistics today and, when complete, will offer a comprehensive survey of linguistics as a whole. The most recent publications in the series can be found below. To see the full list of titles available in the series, please visit www.wiley. com/go/linguistics‐handbooks

The Handbook of Clinical Linguistics Edited by Martin J. Ball, Michael R. Perkins, Nicole Müller, & Sara Howard

The Handbook of English Pronunciation Edited by Marnie Reed & John M. Levis

The Handbook of Pidgin and Creole Studies Edited by Silvia Kouwenberg & John Victor Singler

The Handbook of Discourse Analysis, Second Edition Edited by Deborah Tannen, Heidi E. Hamilton, & Deborah Schiffrin

The Handbook of Language Teaching Edited by Michael H. Long & Catherine J. Doughty

The Handbook of Bilingual and Multilingual Education Edited by Wayne E. Wright, Sovicheth Boun, & Ofelia Garcia

The Handbook of Computational Linguistics and Natural Language Processing Edited by Alexander Clark, Chris Fox, & Shalom Lappin

The Handbook of Portuguese Linguistics Edited by W. Leo Wetzels, Joao Costa, & Sergio Menuzzi

The Handbook of Language and Globalization Edited by Nikolas Coupland The Handbook of Hispanic Sociolinguistics Edited by Manuel Diaz‐Campos The Handbook of Language Socialization Edited by Alessandro Duranti, Elinor Ochs, & Bambi B. Schieffelin The Handbook of Intercultural Discourse and Communication Edited by Christina Bratt Paulston, Scott F. Kiesling, & Elizabeth S. Rangel The Handbook of Historical Sociolinguistics Edited by Juan Manuel Hernandez‐Campoy & Juan Camilo Conde‐Silvestre The Handbook of Hispanic Linguistics Edited by Jose Ignacio Hualde, Antxon Olarrea, & Erin O’Rourke The Handbook of Conversation Analysis Edited by Jack Sidnell & Tanya Stivers The Handbook of English for Specific Purposes Edited by Brian Paltridge & Sue Starfield The Handbook of Spanish Second Language Acquisition Edited by Kimberly L. Geeslin The Handbook of Chinese Linguistics Edited by C.‐T. James Huang, Y.‐H. Audrey Li, & Andrew Simpson The Handbook of Language Emergence Edited by Brian MacWhinney & William O’Grady The Handbook of Korean Linguistics Edited by Lucien Brown & Jaehoon Yeon The Handbook of Speech Production Edited by Melissa A. Redford The Handbook of Contemporary Semantic Theory, Second Edition Edited by Shalom Lappin & Chris Fox The Handbook of Classroom Discourse and Interaction Edited by Numa Markee The Handbook of Narrative Analysis Edited by Anna De Fina & Alexandra Georgakopoulou

The Handbook of Translation and Cognition Edited by John W. Schwieter & Aline Ferreira The Handbook of Linguistics, Second Edition Edited by Mark Aronoff & Janie Rees‐Miller The Handbook of Technology and Second Language Teaching and Learning Edited by Carol A. Chapelle & Shannon Sauro The Handbook of Psycholinguistics Edited by Eva M. Fernandez & Helen Smith Cairns The Handbook of Dialectology Edited by Charles Boberg, John Nerbonne, & Dominic Watt The Handbook of the Neuroscience of Multilingualism Edited by John W. Schwieter The Handbook of English Linguistics, Second Edition Edited by Bas Aarts, April McMahon & Lars Hinrichs The Handbook of Language Contact, Second Edition Edited by Raymond Hickey The Handbook of Informal Language Learning Edited by Mark Dressman & Randall William Sadler The Handbook of World Englishes, Second Edition Edited by Braj B. Kachru, Yamuna Kachru, & Cecil L. Nelson The Handbook of TESOL in K‐12 Edited by Luciana C. de Oliveira The Handbook of Asian Englishes Edited by Kingsley Bolton, Werner Botha, & Andy Kirkpatrick The Handbook of Historical Linguistics Edited by Richard D. Janda, Brian D. Joseph, & Barbara S Vance The Handbook of Advanced Proficiency in Second Language Acquisition Edited by Paul A. Malovrh & Alessandro G. Benati The Handbook of Language and Speech Disorders, Second Edition Edited by Jack S. Damico, Nicole Müller, & Martin J. Ball The Handbook of Speech Perception, Second Edition Edited by Jennifer S. Pardo, Lynne C. Nygaard, Robert E. Remez, & David B. Pisoni

The Handbook of Speech Perception Second Edition

Edited by

Jennifer S. Pardo, Lynne C. Nygaard, Robert E. Remez, and David B. Pisoni

This second edition first published 2021 © 2021 John Wiley & Sons, Inc. Edition History Blackwell Publishing Ltd (1e, 2005) All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions. The right of Jennifer S. Pardo, Lynne C. Nygaard, Robert E. Remez, and David B. Pisoni to be identified as the authors of the editorial material in this work has been asserted in accordance with law. Registered Offices John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA Editorial Office The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com. Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats. Limit of Liability/Disclaimer of Warranty While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. Library of Congress Cataloging‐in‐Publication Data Names: Pardo, Jennifer S., editor. | Nygaard, Lynne C., editor. | Remez, Robert Ellis, editor. | Pisoni, David B., editor. Title: The handbook of speech perception / edited by Jennifer S. Pardo, Lynne C. Nygaard, Robert E. Remez, and David B. Pisoni. Description: Second edition. | Hoboken, NJ : Wiley-Blackwell, 2021. | Series: Blackwell handbooks in linguistics | Includes bibliographical references and index. Identifiers: LCCN 2021008797 (print) | LCCN 2021008798 (ebook) | ISBN 9781119184089 (cloth) | ISBN 9781119184072 (adobe pdf) | ISBN 9781119184102 (epub) Subjects: LCSH: Speech perception. Classification: LCC P37.5.S68 H36 2021 (print) | LCC P37.5.S68 (ebook) | DDC 401/.95–dc23 LC record available at https://lccn.loc.gov/2021008797 LC ebook record available at https://lccn.loc.gov/2021008798 Cover Design: Wiley Cover Image: Aleksandra Ekster, Non-objective composition/Wikiart Set in 10/12pt Palatino by SPi Global, Pondicherry, India 10 9 8 7 6 5 4 3 2 1

Contents

List of Contributors vii Foreword to the Second Edition xvi Foreword to the First Edition xviii Prefacexxi Part I Sensing Speech 1 Perceptual Organization of Speech Robert E. Remez 2 Primacy of Multimodal Speech Perception for the Brain and Science Lawrence D. Rosenblum and Josh Dorsi 3 How Does the Brain Represent Speech? Oiwi Parker Jones and Jan W. H. Schnupp 4 Perceptual Control of Speech K. G. Munhall, Anja‐Xiaoxing Cui, Ellen O’Donoghue, Steven Lamontagne, and David Lutes Part II Perception of Linguistic Properties

1 3 28 58 97

123

5 Features in Speech Perception and Lexical Access 125 Sheila E. Blumstein 6 Speaker Normalization in Speech Perception 145 Keith Johnson and Matthias J. Sjerps 7 Clear Speech Perception: Linguistic and Cognitive Benefits 177 Rajka Smiljanic 8 A Comprehensive Approach to Specificity Effects in Spoken‐Word Recognition 206 Conor T. McLennan and Sara Incera 9 Word Stress in Speech Perception 239 Anne Cutler and Alexandra Jesse 10 Slips of the Ear 266 Z. S. Bond 11 Phonotactics in Spoken‐Word Recognition 286 Michael S. Vitevitch and Faisal M. Aljasser

vi Contents 12 P erception of Formulaic Speech: Structural and Prosodic Characteristics of Formulaic Expressions 309 Diana Van Lancker Sidtis and Seung Yun Yang Part III Perception of Indexical Properties

333

13 Perception of Dialect Variation Cynthia G. Clopper 14 Who We Are: Signaling Personal Identity in Speech Diana Van Lancker Sidtis and Romi Zäske 15 Perceptual Integration of Linguistic and Non‐Linguistic Properties of Speech Lynne C. Nygaard and Christina Y. Tzeng 16 Perceptual Learning of Accented Speech Tessa Bent and Melissa Baese‐Berk 17 Perception of Indexical Properties of Speech by Children Susannah V. Levi

335

Part IV Speech Perception by Special Listeners

365 398 428 465 485

18 S peech Perception by Children: The Structural Refinement and Differentiation Model 487 Susan Nittrouer 19 Santa Claus, the Tooth Fairy, and Auditory‐Visual Integration: Three Phenomena in Search of Empirical Support 517 Mitchell S. Sommers 20 Some Neuromyths and Challenging Questions about Cochlear Implants 540 Cynthia R. Hunter and David B. Pisoni 21 Speech Perception Following Focal Brain Injury 570 Emily B. Myers Part V Theoretical Perspectives

603

coustic Cues to the Perception of Segmental Phonemes 605 22 A Lawrence J. Raphael 23 On the Relation between Speech Perception and Speech Production 632 Jennifer S. Pardo and Robert E. Remez 24 Speech Perception and Reading Ability: What Has Been Learned from Studies of Categorical Perception, Nonword Repetition, and Speech in Noise? 656 Susan Brady and Axelle Calcus 25 Cognitive Audiology: An Emerging Landscape in Speech Perception 697 David B. Pisoni Index 733

List of Contributors

Faisal M. Aljasser is an associate Professor of Applied Linguistics in the Department of English language and Translation, College of Arabic Language and Social Studies at Qassim University. He received his Ph.D. in Applied Linguistics from Newcastle University, UK in 2008. His research centers on the production and perception of Arabic as a native language and as a second language. Melissa Baese‐Berk is the David M. and Nancy L. Petrone Faculty Scholar and Associate Professor of Linguistics at the University of Oregon, where she directs the Speech Perception and Production Laboratory. She earned her PhD from Northwestern University in 2010. Her research focuses on speech perception and production, with special attention to speakers and listeners who do not share a native language with their interlocutor. Her work has been supported by grants from the National Science Foundation. Recent publications have appeared in Cognition, Journal of the Acoustical Society of America, Journal of Phonetics, and Attention, Perception, and Psychophysics. Tessa Bent is Professor of Speech, Language and Hearing Sciences and director of the Speech Perception Laboratory at Indiana University. She received her Ph.D. in Linguistics from Northwestern University in 2005. Her research focuses on children’s and adults’ perception and representation of variable speech signals, with a focus on regional dialects and non‐native accents. This research is currently supported by a grant from the National Science Foundation and has previously been supported by the National Institutes of Health. She is a Fellow of the Acoustical Society of America. Sheila E. Blumstein is the Albert D. Mead Professor Emerita of Cognitive, Linguistic, and Psychological Sciences at Brown University. She received her Ph.D. in Linguistics from Harvard University in 1970. She spent her entire professional career from 1970 until 2018 at Brown University. Her research focuses on the neural basis of speech and language and the processes and mechanisms underlying speaking and understanding. She received a Claude Pepper Award from the National Institutes of Health and The Silver Medal in Speech Communication from the Acoustical Society of America and was elected Fellow to a number of professional societies.

viii List of Contributors Z. S. Bond, Professor Emerita, Ohio University, earned a Ph.D. in linguistics, with psychology and hearing and speech sciences as concentrations, from the Ohio State University. She has worked at the University of Alberta, Ohio University, Ohio State University and the Aerospace Medical Research Laboratory at Wright‐ Patterson Air Force Base. Her research areas include phonetics, psychology of language, speech perception, and language contact. Currently she is analyzing the pronunciation of Latvian in recordings from WW I. She has published papers in various journals including Journal of the Acoustical Society of America, Language and Speech, Perception and Psychophysics, and Journal of Phonetics. She is a member of Acoustical Society of America, Association for the Advancement of Baltic Studies, Linguistic Society of America and a foreign member of Latvian Academy of Science. Ann R. Bradlow is the Abraham Harris Professor of Linguistics and Associate Dean for Graduate Studies at Northwestern University. She received her PhD in Linguistics from Cornell University in 1993, and then completed postdoctoral fellowships in Psychology at Indiana University (1993‐1996) and Hearing Science at Northwestern University (1996‐1998). Over the past three decades, Bradlow has pursued an interdisciplinary research program in acoustic phonetics and speech perception with a focus on speech intelligibility under conditions of talker‐, listener‐, and situation‐related variability. Her work has been supported by grants from the National Institutes of Health. Recent publications have appeared in the Journal of the Acoustical Society of America, Journal of the Association for Laboratory Phonology, International Journal of Audiology, Applied Psycholinguistics, Journal of Phonetics, Language & Speech, and Bilingualism, Language, & Cognition. Susan Brady received her Ph.D. in Cognitive Psychology at the University of Connecticut in 1975 and presently is an Emerita Professor of Psychology at the University of Rhode Island. She has held additional positions at the University of Sussex, St. Andrews University, and Haskins Laboratories. Concentrating on topics in the field of reading, her research has focused primarily on the roles of speech perception and verbal working memory in individual differences in reading ability. Likewise, she has endeavored to translate the implications of the larger body of reading research for practice, and has conducted professional development projects for educators. Axelle Calcus is an Assistant Professor of Psychology at Université Libre de Bruxelles (Brussels, Belgium). She received her PhD in Psychology in 2015, and worked as a postdoctoral researcher at Boston University (MS, United States), University College London (London, UK) and Ecole Normale Supérieure (Paris, France). Her main research interest focuses on the development of perception of speech in noise in children with and without hearing difficulties. Her work has been supported by awards from H2020 (European Commission). She is a member of the board of the Belgian association for audiology, and of the executive committee of the Belgian association for psychological sciences. Her recent work has been published in Developmental Science and eLife.

List of Contributors ix Cynthia G. Clopper is Professor of Linguistics at Ohio State University and a Fellow of the Acoustical Society of America. She received her Ph.D. in Linguistics and Cognitive Science from Indiana University and held post‐doctoral positions in Psychology at Indiana University and in Linguistics at Northwestern University before joining the faculty at Ohio State. Her major areas of expertise are phonetics, speech perception, sociophonetics, and laboratory phonology. Dr. Clopper’s current research projects examine the effects of geographic mobility and linguistic experience on cross‐dialect lexical processing, the relationships between linguistic and indexical sources of variation in speech processing, and regional prosodic variation in American English. Anja‐Xiaoxing Cui is a postdoctoral fellow at the University of British Columbia and visiting professor of systematic musicology at Osnabrück University. She studied psychology and piano performance before receiving her Ph.D. in Cognitive Neuroscience from Queen’s University in 2019. Her research centers on auditory processing and the interactions of music and learning, and has been supported by NSERC and the Social Sciences and Humanities Research Council of Canada. Anja received additional support through the German Academic Exchange Service and the German Academic Scholarship Foundation. Anne Cutler is Distinguished Professor at the MARCS Institute, Western Sydney University, Australia. She studied languages and psychology in Melbourne, Berlin and Bonn, took a PhD in psycholinguistics at the University of Texas, held positions at MIT, Sussex University and the MRC Applied Psychology Unit (Cambridge, UK), and then from 1993 to 2013 was Director and Comprehension Group head at the Max Planck Institute for Psycholinguistics in Nijmegen, The Netherlands. Her research concerns listening to spoken language, and in particular how native language experience tailors speech decoding processes. She is an elected member of national academies in Europe, the US and Australia. Josh Dorsi is a Postdoctoral Scholar in the Neurology Department of the Penn State College of Medicine. He received his Ph.D. in Psychology from the University of California, Riverside in 2019. His research investigates the role of multisensory and lexical information in supporting speech perception, as well as the role of crossmodal correspondences in speech and language pathologies. Some recent publications of this work have appeared in Attention, Perception, & Psychophysics; The Quarterly Journal of Experimental Psychology; and the Journal of Cognitive Psychology. Cynthia R. Hunter is Assistant Professor of Speech‐Language‐Hearing and Director of the Speech Perception, Cognition, and Hearing Laboratory at the University of Kansas. She earned her Ph.D. in Psychology from the StateUniversity of New York at Buffalo in 2016, and completed postdoctoral fellowships at Indiana University and the Indiana Clinical and Translational Sciences Institute. Her research centers on the neural and cognitive factors that allow individuals with and without hearing loss to understand speech in adverse listening conditions.

x List of Contributors Her recent workhas appeared in Ear and Hearing, Journal of Speech, Language, and Hearing Research, Neuropsychologia, and Brain and Language. Sara Incera is an Assistant Professor of Psychology and Director of the Multilingual Laboratory at Eastern Kentucky University. She received her Ph.D. in Psychology from Cleveland State University in 2016, Conor McLennan was her Ph.D. advisor. Her research interests include foreign accents, bilingualism, and language development across the lifespan. Her most recent work has focused on the relationships between language and emotion. Her articles have been published in Cognition & Emotion, Journal of Psycholinguistic Research, Mind & Language, Aging, Neuropsychology, & Cognition, International Journal of Bilingualism, Acta Psychologica, and Bilingualism: Language & Cognition. Alexandra Jesse is an Associate Professor of Psychological and Brain Sciences and Director of the Language, Intersensory Perception, and Speech Laboratory at the University of Massachusetts Amherst. After receiving her Ph.D. in Psychology from the University of California Santa Cruz in 2005, she held a research position at the Max Planck Institute for Psycholinguistics until 2010. Her research focuses on speech perception, particularly on audiovisual speech and aging, and has been supported by the National Institutes of Health, the German Research Foundation, and the Netherlands Organization for Scientific Research. Some recent publications have appeared in Cognition, Journal of Experimental Psychology: Learning, Memory and Cognition, and Biological Psychology. Keith Johnson is Professor of Linguistics and Chair of the Department of Linguistics at UC Berkeley. He received his PhD in Linguistics from Ohio State University in 1988 and held research positions in Psychology at Indiana University, in Linguistics at UCLA, and in Speech and Hearing Science at University of Alabama, Birmingham, and academic positions at Ohio State University, and Berkeley. His research is on perceptual processes involved in compensating for phonetic talker differences. Steven Lamontagne is a PhD candidate at Queen’s University and a visiting doctoral scholar at Harvard’s McLean Hospital. In 2017, he received his MSc in Cognitive Neuroscience at Queen’s University, where he studied interactions between reward and stress circuitry using animal models. His current research, which is supported by NSERC, centers on the neurophysiological correlates of reward learning and cognitive control in people with treatment‐resistant major depressive disorder. Some recent publications of his research have appeared in Psychopharmacology, Physiology & Behavior, and Behavioural Brain Research. Susannah V. Levi is an Associate Professor of Communicative Sciences and Disorders and Director of the Acoustic Phonetics and Perception Lab at New York University. She received her Ph.D. in Linguistics from the University of Washington in 2004. She completed a postdoctoral fellowship in the Speech Research Lab at Indiana University with David Pisoni. Her research focuses on the relationship between linguistic and speaker information during of spoken language processing.

List of Contributors xi Her research has been supported by grants from the National Science Foundation and the National Institutes of Health. Some recent publications have appeared in the Journal of Speech, Language, and Hearing Research, Journal of the Acoustical Society of American, and Cognitive Science. David Lutes received his M.Sc. in Cognitive Neuroscience at Queen’s University in 2019, where he used virtual reality devices to study the impact that various image characteristics have on the brain’s ability to effectively fuse separate images in binocular vision. To further his interest in the applications of virtual reality, David is continuing his education into video game development, as well as public health and neuroscience. Conor T. McLennan is a Professor, Chair of the Department of Psychology, and Director of the Language Research Laboratory at Cleveland State University. He received his Ph.D. from the University at Buffalo in 2003. His research interests include language perception, bilingualism, cognitive aging, and other topics in language, memory, and perception. His research has been supported by the National Science Foundation and the National Institutes of Health, and has been published in a variety of journals, including Aging, Neuropsychology, & Cognition, Attention, Perception, & Psychophysics, Cognition & Emotion, Journal of Experimental Psychology: Learning, Memory, & Cognition, and Language & Speech. K. G. Munhall is a professor in the Department of Psychology at Queen’s University. He received his Ph.D. in psychology from McGill University in 1984. His research focuses on sensorimotor processing in speech production, audiovisual speech perception, and perceptual and cognitive factors in conversational interaction. His work has been supported by grants from the National Institute on Deafness and Other Communication Disorders, the Natural Sciences and Engineering Research Council of Canada (NSERC), and the Canadian Institutes of Health Research (CIHR). Some recent publications of his work have appeared in Journal of the Acoustical Society of America, Experimental Brain Research, Multisensory Research, and Attention, Perception, & Psychophysics. Emily B. Myers is an Associate Professor of Speech, Language, and Hearing Sciences and Psychological Sciences at the University of Connecticut. She received her PhD from Brown University in 2005. Her work focuses on the processes that allow a listener to map the speech signal to meaning, how these processes are instantiated in the brain, and how the system breaks down in cases of language disorder. Her work has been funded by the National Institutes of Health and the National Science Foundation. Susan Nittrouer received her PhD from the City University of New York in Speech and Hearing Science. After a post‐doctoral fellowship at Haskins Laboratories she worked at Boys Town National Research Hospital, Utah State University, and the Ohio State University. Currently she is Professor of Speech, Language, and Hearing Sciences at the University of Florida. Her research focuses on the intersection between auditory and language development, and on the challenges encountered by

xii List of Contributors children with risk factors for developmental language delays, including hearing loss, poverty, or conditions leading to dyslexia. Susan’s goal is to develop more effective interventions for these children. Lynne C. Nygaard is Professor of Psychology and Director of the Center for Mind, Brain, and Culture, and the Speech and Language Communication Laboratory at Emory University, USA. Her research on the perceptual, cognitive, biological, and social underpinnings of human spoken communication has appeared in many journals, including Psychological Science, Brain and Language, and Cognitive Science. Ellen O’Donoghue is a Ph.D. Candidate at The University of Iowa, in the Department of Psychological and Brain Sciences. She received her M.Sc. in Cognitive Psychology from Queen’s University in 2018. Her research concerns the fundamental mechanisms that support learning and categorization across species, with particular emphasis on humans and pigeons. Jennifer S. Pardo is Professor of Psychology and Director of the Speech Communication Laboratory at Montclair State University. She received her Ph.D. in Cognitive Psychology from Yale University in 2000, and has held academic positions at Barnard College, Wesleyan University, and The New School for Social Research. Her research centers on the production and perception of spoken language in conversational interaction and on understanding variation and convergence in phonetic form, and has been supported by grants from the National Science Foundation and the National Institutes of Health. Some recent publications of this work have appeared in Journal of Memory & Language, Journal of Phonetics, Language & Speech, and Attention, Perception, & Psychophysics. Oiwi Parker Jones is a Hugh Price Fellow at Jesus College, University of Oxford. He did his doctoral research in Oxford on NLP with a focus on the application of machine learning to endangered languages. From there he trained as an imaging and computational neuroscientist at University College London and Oxford. His primary interest is in the development of a neural speech prosthetic. This includes basic research on speech and language in the brain, including work on clinical populations. His papers have been published in journals like Science and Brain and at machine learning conferences like NeurIPS, ICLR, andICML. David B. Pisoni is Distinguished Professor of Psychological and Brain Sciences and Chancellor’s Professor of Cognitive Science at Indiana University, Bloomington, USA, and Professor in the Department of Otolaryngology at Indiana University School of Medicine, Indianapolis, USA. He has made significant contributions in basic, applied, and clinical research in areas of speech perception, production, synthesis, and spoken language processing. Lawrence J. Raphael is Professor Emeritus of both the Graduate School of CUNY and Adelphi University. He was a research associate at Haskins Laboratories for 26 years. His research interests include speech perception, speech acoustics and the physiology of the speech mechanism. His research has been published in a variety of scholarly journals. He is a co‐author of Speech Science Primer, 6th edition and

List of Contributors xiii co‐editor of The Biographical Dictionary of the Phonetic Sciences, Language and Cognition and Producing Speech. Professor Raphael is a Fellow of the New York Academy of Sciences. Robert E. Remez is Professor of Psychology at Barnard College, Columbia University, USA, and Chair of the Columbia University Seminar on Language and Cognition. His research has been published in many scientific and technical journals, including American Psychologist, Developmental Psychology, Ear and Hearing, Experimental Aging Research, Journal of Cognitive Neuroscience, and Journal of Experimental Psychology. Lawrence D. Rosenblum is a Professor of Psychology at the University of California, Riverside. He studies multisensory speech and talker perception as well as ecological acoustics. His research has been supported by grants from the National Science Foundation, National Institutes of Health, and the National Federation of the Blind. He is the author of numerous publications including the book See What I’m Saying: The Extraordinary Powers of our Five Senses. His research has been featured in Scientific American, The New York Times, and The Economist. Jan W. H. Schnupp is a sensory neuroscientist with a long standing interest in the processing of auditory information by the central nervous system. He received his DPhil from the University of Oxford in 1996, and he held visiting and faculty positions at the University of WIsconsin, the Italian Institute of Technology and the University of Oxford before taking up a professorship at the City University of Hong Kong. His research interests range widely, from central representations of auditory space to pitch and timbre, temporal predictive coding and auditory pattern learning. His work has been funded by the Wellcome Trust, BBSRC, MRC, and the UGC and HMRF of Hong Kong. He has published over 80 papers in numerous neuroscience and general science journals and he coauthored the textbook "Auditory Neuroscience". Diana Van Lancker Sidtis (formerly Van Lancker) is Professor Emeritus of Communicative Sciences and Disorders at New York University, where she served as Chair from 1999‐2002; Associate Director of the Brain and Behavior Laboratory at the Nathan Kline Institute, Orangeburg, NY; and a certified and licensed speech‐ language pathologist (from Cal State LA). Her education includes an MA from the University of Chicago, PhD from Brown University, and an NIH Postdoctoral Fellowship at Northwestern University. Dr. Sidtis has continued to mentor students and perform research in speech science, voice studies, and neurolinguistics. She is author of over 100 scientific papers and review chapters, and coauthor, with Jody Kreiman, of Foundations of Voice Studies, Wiley‐Blackwell. Her second book, Foundations of Familiar Language, is scheduled to appear in 2021. Matthias J. Sjerps received his Ph.D. in Cognitive Psychology at the Max Planck Institute for Psycholinguistics in Nijmegen, The Netherlands. He has held post‐ doc positions at the Max Planck Institute, The Radboud University of Nijmegen,

xiv List of Contributors and at the University of California at Berkeley. His main research line has been centered on the perception of speech sounds, with a specific focus on how listeners resolve variability in speech sounds. His work has been supported by grants from the European Committee (Marie curie grant) and Max Planck Gesellschaft. Some recent publications of this work have appeared in Nature Communications, Journal of Phonetics, Journal of Experimental Psychology: Human Perception and Performance. Since 2019 he is working as a researcher for the Dutch Inspectorate of Education, focusing on methods of risk‐assessment of schools and school‐boards. Rajka Smiljanic is Professor of Linguistics and Director of UT Sound Lab at the University of Texas at Austin. She received her Ph.D. from the Linguistics Department at the University of Illinois Urbana‐Champaign, after which she worked as a Research Associate in the Linguistics Department at Northwestern University. Her work is concentrated in the areas of experimental phonetics, cross‐ language and second language speech production and perception, clear speech, and intelligibility variation. Her recent work appeared in the Journal of the Acoustical Society of America, Journal of Speech, Language, and Hearing Research, and Journal of Phonetics. She was elected Fellow of the Acoustical Society of America in 2018 and is currently serving as a Chair of the Speech Communication Technical Committee. Mitchell S. Sommers is Professor of Psychological and Brain Sciences at Washington University in St. Louis. He received his PhD in Psychology from the University of Michigan and worked as a postdoctoral Fellow at Indiana University. His work focuses on changes in hearing and speech perception in older adults and individuals with dementia of the Alzheimer’s type. His work has been published in Ear & Hearing, Journal of the Acoustical Society of America, and Journal of Memory and Language, among others. He received a career development award from the Brookdale Foundation and his work has been supported by NIH, NSF, and the Pfeifer Foundation. Christina Y. Tzeng is Assistant Professor of Psychology at San José State University. She received her Ph.D. in Psychology from Emory University in 2016. Her research explores the cognitive mechanisms that underlie perceptual learning of variation in spoken language and has been supported by the American Psychological Association. She has published her research in journals such as Cognitive Science, Journal of Experimental Psychology: Human Perception and Performance, and Psychonomic Bulletin and Review. Michael S. Vitevitch is Professor of Psychology and Director of the Spoken Language Laboratory at the University of Kansas. He received his Ph.D. in Cognitive Psychology from the University at Buffalo in 1997, and was an NIH post‐doctoral trainee at Indiana University before taking an academic position at the University of Kansas in 2001. His research uses speech errors, auditory illusions, and the mathematical tools of network science to examine the processes and representations that are involved in the perception and production of spoken language. His work has been supported by grants from the National Institutes of Health, and has been published in Psychology journals such as Journal of Experimental Psychology:

List of Contributors xv General, Cognitive Psychology, and Psychological Science, as well as journals in other disciplines such as Journal of Speech, Language, and Hearing Research, and Entropy. Seung Yun Yang, Ph.D., CCC‐SLP, is an assistant professor in the department of Communication, Arts, Sciences, and Disorders. She is also a member of the Brain and Behavior Laboratory at the Nathan Kline Institute for Psychiatric Research in Orangeburg, New York. She received her doctorate from the Department of Communicative Sciences and Disorders at New York University. Her research primarily focuses on understanding the neural bases of nonliteral language and on understanding how prosody is conveyed and understood in the context of spoken language. Her research works have been published in peer‐reviewed journals such as Journal of American Speech, Language, and Hearing Research and Clinical Linguistics & Phonetics. Romi Zäske is a researcher at the University Hospital Jena and the Friedrich Schiller University of Jena, Germany. She received her Ph.D. from the Friedrich Schiller University of Jena in 2010, and has conducted research projects at Glasgow University, UK, and at the University of Bern, Switzerland. Her research centers on the cognitive and neuronal mechanisms subserving human voice perception and memory, including individual differences, and has been supported by grants of the Deutsche Forschungsgemeinschaft (DFG). Some recent publications of her work have appeared in Royal Society Open Science, Behavior Research Methods, Attention, Perception, & Psychophysics, Cortex, and Journal of Neuroscience.

Foreword to the Second Edition

Two remarkable developments have taken hold since the publication of the first edition of The Handbook of Speech Perception in 2006. The first is directly connected to the study of speech perception and stands as a testament to the maturity and vitality of this relatively new field of research. The second, though removed from the study of speech perception, provides a timely pointer to the central theme of this book. Both of these developments are so overbearing that they simply cannot go without notice as I write this preface in the last quarter of 2020. They also help us to see how the complex landscape of speech perception research intersects with some of the most challenging and exciting scientific frontiers of our time. The first of these developments is the appearance of virtual assistants such as Apple’s Siri, Amazon’s Alexa, and Microsoft’s Cortana. While it is a well‐worn cliché to mark time by technological developments, the rapid adoption of these speech technologies over the past decade is hard to ignore when thinking about speech communication. The domain of speech perception now includes both humans and machines as both talkers and listeners. What exactly does machine speech recognition have to do with the body of research presented in the chapters of this handbook, all of which address human speech perception? These speaker‐ hearer machines certainly do not perceive speech in a human‐like way; Siri, Alexa, and Cortana do not sense speech as do human ears and eyes, their machine learning algorithms do not result in neurocognitive representations of linguistic properties, and they are not participants in the relationships and social meanings encoded in the indexical properties of speech. In his preface to the first edition of this handbook, Michael Studdert‐Kennedy noted that “alphabetic writing and reading have no independent biological base; they are, at least in origin, parasitic on spoken language.” Studdert‐Kennedy went on to suggest that, “speech production and perception, writing and reading, form an intricate biocultural nexus” (my italics). With the invention of virtual assistants, spoken language once again participates in a symbiotic relationship with a new medium of verbal communication. Within this complex and evolving ecology of spoken–written–digital language, the study of human speech perception continues to reveal, in increasing detail, the contours of this biocultural nexus. Immersion into this field of inquiry, made so accessible by the carefully selected and recently updated collection of chapters in this handbook, is so stimulating precisely because it illuminates the milestones that mark the path to, through, and beyond this nexus.

Foreword to the Second Edition xvii The second development that cannot go without mention as this updated handbook goes to the printing press in 2020 is the startling spread of the Covid‐19 virus through human communities across the globe. Speech sounds, words, ideas, and (unfortunately) viruses are all transmitted from person to person through the air that we breathe during social interactions. With the covering of visible speech gestures by virus‐blocking masks, the awkward turn‐taking of video conferencing tools with single‐track audio channels, and the social distancing that protects us from the Covid‐19 virus, this pandemic is a constant reminder of the multimodal nature of speech perception and of the centrality of in‐person social interaction for seamless speech communication. The arrangement of the first three major sections of this handbook – I: Sensing Speech, II: Perception of Linguistic Properties, and III: Perception of Indexical Properties – provides the scaffold for an understanding of speech perception as far more than perception of a particular auditory signal. Instead, the chapters in these sections, along with the applications and theories covered in the remaining two sections – IV: Speech Perception by Special Listeners, and V: Theoretical Perspectives – develop the overarching argument that the observation, measurement, and modeling of speech perception must be conducted from a vantage point that encompasses its broad cognitive and social context. This central point is brought home in the final chapter by David Pisoni, one of the founders of the field and editors of this handbook: “. . . hearing and speech perception do not function as independent autonomous streams of information or discrete processing operations that take place in isolation from the structure and functioning of the whole information‐processing system. While it is clear that the early stages of speech recognition in listeners with normal hearing are heavily dependent on the initial encoding and registration of highly detailed sensory information, audibility and the sensory processing of speech is only half of the story”.

The chapters in this handbook provide a superbly sign‐posted map of the full story. Any compendium of knowledge on a particular topic represents a body of knowledge that developed in a specific time and place. The contributors to this handbook cover several generations of researchers spread over many academic disciplines working primarily on both sides of the North Atlantic Ocean. Yet, the scientific study of speech perception as presented in this outstanding handbook is still relatively young and localized. Perhaps one of the lasting lessons of the current pandemic is that we are all even more connected than we thought. New ideas and new ways of knowing can circulate as extensively, though maybe not quite as quickly, as a virus. This bodes well for the future of speech perception research. Ann R. Bradlow Northwestern University

Foreword to the First Edition

Historically, the study of audition has lagged behind the study of vision, partly, no doubt, because seeing is our first sense, hearing our second. But beyond this, and perhaps more importantly, instruments for acoustic control and analysis demand a more advanced technology than their optic counterparts: having a sustained natural source of light, but not of sound, we had lenses and prisms long before we had sound generators and oscilloscopes. For speech, moreover, early work revealed that its key perceptual dimensions are not those of the waveform as it impinges on the ear (amplitude, time), but those of its time‐varying Fourier transform, as it might appear at the output of the cochlea (frequency, amplitude, time). So it was only with the invention of instruments for analysis and synthesis of running speech that the systematic study of speech perception could begin: the sound spectrograph of R. K. Potter and his colleagues at Bell Telephone Laboratories in New Jersey during World War II, the Pattern Playback of Franklin Cooper at Haskins Laboratories in New York, a few years later. With these devices and their successors, speech research could finally address the first task of all perceptual study: definition of the stimulus, that is, of the physical conditions under which perception occurs. Yet, a reader unfamiliar with the byways of modern cognitive psychology who chances on this volume may be surprised that speech perception, as a distinct field of study, even exists. Is the topic not subsumed under general auditory perception? Is speech not one of many complex acoustic signals to which we are exposed, and do we not, after all, simply hear it? It is, of course, and we do. But due partly to the peculiar structure of the speech signal and the way it is produced, partly to the peculiar equivalence relation between speaker and hearer, we also do very much more. To get a sense of how odd speech is, consider writing and reading. Speech is unique among systems of animal communication in being amenable to transduction into an alternative perceptuomotor modality. The more or less continuously varying acoustic signal of an utterance in any spoken language can be transcribed as a visual string of discrete alphabetic symbols, and can then be reproduced from that string by a reader. How we effect the transforms from analog signal to discrete

Foreword to the First Edition xix message, and back again, and the nature of the percept that mediates these transforms are central problems of speech research. Notice that without the alphabet as a means of notation, linguistics itself, as a field of study, would not exist. But the alphabet is not merely a convenient means of representing language; it is also the primary objective evidence for our intuition that we speak (and language achieves its productivity) by combining a few dozen discrete phonetic elements to form an infinite variety of words and sentences. Thus, the alphabet, recent though it is in human history, is not a secondary, purely cultural aspect of language. The inventors of the alphabet brought into consciousness previously unexploited segmental properties of speech and language, much as, say, the inventors of the bicycle discovered previously unexploited cyclic properties of human locomotion. The biological nature and evolutionary origins of the discrete phonetic categories represented by the alphabet are among many questions on which the study of speech perception may throw light. To perceive speech is not merely to recognize the holistic auditory patterns of isolated words or phrases, as a bonobo or some other clever animal might do; it is to parse words from a spoken stream, and segments from a spoken word, at a rate of several scores of words per minute. Notice that this is not a matter of picking up information about an objective environment, about banging doors, passing cars, or even crying infants; it is a matter of hearers recognizing sound patterns coded by a conspecific speaker into an acoustic signal according to the rules of a natural language. Speech perception, unlike general auditory perception, is intrinsically and ineradicably intersubjective, mediated by the shared code of speaker and hearer. Curiously, however, the discrete linguistic events that we hear (segments, syllables, words) cannot be reliably traced in either an oscillogram or a spectrogram. In a general way, their absence has been understood for many years as due to their manner of production: extensive temporal and spectral overlap, even across word boundaries, among the gestures that form neighboring phonetic segments. Yet, how a hearer separates the more or less continuous flow into discrete elements is still far from understood. The lack of an adequate perceptual model of the process may be one reason why automatic speech recognition, despite half a century of research, is still well below human levels of performance. The ear’s natural ease with the dynamic spectro‐temporal patterns of speech contrasts with the eye’s difficulties: oscillograms are impossible, spectrograms formidably hard, to read – unless one already knows what they say. On the other hand, the eye’s ease with the static linear string of alphabetic symbols contrasts with the ear’s difficulties: the ear has limited powers of temporal resolution, and no one has ever devised an acoustic alphabet more efficient than Morse code, for which professional rates of perception are less than a tenth of either normal speech or normal reading. Thus, properties of speech that lend themselves to hearing (exactly what they are, we still do not know) are obstacles to the eye, while properties of writing that lend themselves to sight are obstacles to the ear. Beyond the immediate sensory qualities of speech, a transcript omits much else that is essential to the full message. Most obvious is prosody, the systematic variations in pitch, loudness, duration, tempo, and rhythm across words, phrases, and

xx Foreword to the First Edition sentences that convey a speaker’s intentions, attitudes, and feelings. What a transcript leaves out, readers put back in, as best they can. Some readers are so good at this that they become professional actors. Certain prosodic qualities may be peculiar to a speaker’s dialect or idiolect, of which the peculiar segmental properties are also omitted from a standard transcript. What role, if any, these and other indexical properties (specifying a speaker’s sex, age, social status, person, and so on) may play in the perception of linguistic structure remains to be seen. I note only that, despite their unbounded diversity within a given language, all dialects and idiolects converge on a single phonology and writing system. Moreover, and remarkably, all normal speakers of a language can, in principle if not in fact, understand language through the artificial medium of print as quickly and efficiently as through the natural medium of speech. Alphabetic writing and reading have no independent biological base; they are, at least in origin, parasitic on spoken language. I have dwelt on them here because the human capacity for literacy throws the biological oddity of speech into relief. Speech production and perception, writing and reading, form an intricate biocultural nexus at the heart of modem western culture. Thanks to over 50 years of research, superbly reviewed in all its diversity in this substantial handbook, speech perception offers the student and researcher a ready path into this nexus. Michael Studdert‐Kennedy Haskins Laboratories New Haven, Connecticut

Preface

The Second Edition of the Handbook of Speech Perception presents a collection of essays on the research and theory that have guided our understanding of human speech perception. From their origins in psychoacoustic assessment of phonetics for telecommunication systems, the concerns of research have broadened with the growth of cognitive science and neuroscience. Now truly interdisciplinary in span, studies of speech perception include basic research on the perception of linguistic form while encompassing investigations of multisensory speech perception, speech perception with sensory prostheses, speech perception across the life span, speech perception in neuropathological disorders, as well as the study of the interchange of linguistic, paralinguistic, and indexical attributes of speech. Empirical practice has often turned to speech as a way to assess the potential of a new idea, making speech perception an intellectual crossroad for the subfields that compose contemporary behavioral neuroscience. This intellectual and scientific convergence is also reflected in the topics, large and small, that are represented here. The Second Edition, specifically, showcases new concerns, presents new understanding of lines of classic investigation, and offers a critical assay of technical and theoretical developments across the field of research. Editors face many decisions in composing a handbook, one that can be useful for student and researcher alike. Early in our discussions, we understood that we would not be creating a comprehensive review of method and theory in research on speech perception. For one reason, technical methods and technical problems evolve rapidly as researchers explore one or another opportunity. For another, the Annual Reviews already exist and can satisfactorily offer a snapshot of a field at a particular instant. Aiming higher, we asked each of the contributors to articulate a point of view to introduce the reader to the major issues and findings in the field. The result is a broad‐ranging and authoritative collection of essays offering perspectives on exactly the critical questions that are likely to move a rapidly changing field of research. The twenty‐five chapters are organized into five sections. Each essay provides an informed and critical exposition of a topic central to understanding, including: (1) a synthesis of current research and debate; (2) a narrative comprising clear examples and findings from the research literature and the author’s own research program; and (3) a forward look toward anticipated developments in the field.

xxii Preface In Part I, Sensing Speech, four chapters cover a wide range of foundational issues in the field. Robert Remez discusses the perceptual organization of speech and how it differs from other auditory signals; Lawrence Rosenblum and Josh Dorsi present an argument and evidence for the primacy of multimodal speech perception; Jan Schnupp and Oiwi Parker‐Jones describe the representation of speech in the brain; and Kevin Munhall and colleagues explain the role of perception in controlling speech production. In Part II, Perception of Linguistic Properties, eight chapters survey major topics in human speech perception. Shelia Blumstein describes the role of linguistic features in speech perception and lexical access; Keith Johnson and Matthias Sjerps discuss perceptual accommodation of differences between individual talkers; Rajka Smilianic examines the differences between casual and clear speech; Conor McLennan and Sara Incera Burkert present a critical appraisal of specificity effects in spoken word identification; Anne Cutler and Alexandra Jesse discuss the role of lexical stress in the perception of spoken words; Zinny Bond considers speech misperception in an essay on slips of the ear; Michael Vitevitch and Faisal Aljasser assess the contribution of phonotactic knowledge to speech perception; and Diana Van Lancker‐Sidtis and Sun‐Yeung Yang discuss the implications of the use of formulaic speech. Part III is devoted to the Perception of Indexical Properties, those aspects of the speech of individual talkers that make them identifiable. The five chapters in this section include a discussion of the perception of dialectal variation, by Cynthia Clopper; the resolution of the spoken signals of individual identity, by Diana Van Lancker‐Sidtis and Romy Zäske; the integration of linguistic and non‐linguistic properties of speech, by Lynne Nygaard and Christina Tzeng; an essay on perceptual learning of accented speech, by Tessa Bent and Melissa Baese‐Berk; and an appraisal of the ability of children to notice indexical properties of speech, by Suzanne Levi. In Part IV, the handbook considers Speech Perception by Special Listeners. Susan Nittrouer examines speech perception by children; Mitchell Sommers describes accounts of audiovisual speech perception in older adults; Cynthia Hunter and David Pisoni consider speech perception in prelingually deaf children when a cochlear implant is used; and Emily Myers examines the perception of speech following focal brain injury. Part V includes four essays each offering a Theoretical Perspective on a new or classic concern of the field. Lawrence Raphael provides a detailed retrospective on the acoustic cues to segmental phonetic perception; Jennifer Pardo and Robert Remez offer a review of the influential idea that perception of speech relies on the dynamics of the production of speech; Susan Brady and Axelle Calcus consider the relation between reading and speech perception; and David Pisoni provides a review of the emerging field of cognitive audiology. The scope of the topics encompassed in the Handbook of Speech Perception reflects the wide‐ranging research community that studies speech perception. This includes neighboring fields: audiology, speech and hearing sciences, behavioral neuroscience, cognitive science, computer science and electrical engineering,

Preface xxiii linguistics, physiology and biophysics, otology, and experimental psychology. The chapters are accessible to nonspecialists while also engaging to specialists. While the Handbook of Speech Perception takes a place among the many excellent companion volumes in the Wiley Blackwell series on language and linguistics, the collection is unique in its emphasis on the specific concerns of the perception of spoken language. If the advent of a handbook can be viewed as a sign of growth and maturity of a discipline, the appearance of this Second Edition is evidence of the longevity of research interest in spoken language. This new edition of the Handbook of Speech Perception brings the diverse field together for the researcher who, while focusing on a specific aspect of speech perception, might desire a clearer understanding of the aims, methods, and prospects for advances across the field. In addition to the critical survey of developments across a wide range of research on human speech perception, we also anticipate the Handbook facilitating the development of multidisciplinary research on speech perception. We cannot conclude without acknowledging the many individuals on whose creativity, knowledge, and cooperation this endeavor depended, namely, the authors whose essays compose the Handbook of Speech Perception. A venture of this scope cannot succeed without the conscientious care of a publisher to protect the project, and we have received the benefit of this attention from many people at Wiley, originating with Tanya McMullin who was instrumental at the start of the project, Angela Cohen, Rachel Greenberg, and Clelia Petracca. With our sincere thanks, Jennifer S. Pardo Bedford, New York Lynne C. Nygaard Atlanta, Georgia Robert E. Remez New York, New York David B. Pisoni Bloomington, Indiana

Part I

Sensing Speech

1 Perceptual Organization of Speech ROBERT E. REMEZ Barnard College, Columbia University, United States

How does a perceiver resolve the linguistic properties of an utterance? This question has motivated many investigations within the study of speech perception and a great variety of explanations. In a retrospective summary over 30 years ago, Klatt (1989) reviewed a large sample of theoretical descriptions of the perceiver’s ability to project the sensory effects of speech, exhibiting inexhaustible variety, into a finite and small number of linguistically defined attributes, whether features, phones, phonemes, syllables, or words. While he noted many distinctions between the accounts, with few exceptions they exhibited a common feature. Each presumed that perception begins with a speech signal, well composed and fit to analyze. This common premise shared by otherwise divergent explanations of perception obliges the models to admit severe and unintended constraints on their applicability. To exist within the limits set by this simplifying assumption, the models apply implicitly to a world in which speech is the only sound; moreover, only a single talker ever speaks at once. Although this designation is easily met in laboratory samples, it is safe to say that it is rare in vivo. Moreover, in their exclusive devotion to the perception of speech the models are tacitly modular (Fodor, 1983), even those that deny it. Despite the consequences of this dedication of perceptual models to speech and speech alone, there has been a plausible and convenient way to persist in invoking the simplifying assumption. This fundamental premise survives intact if a preliminary process of perceptual organization finds a speech signal, follows its patterned variation amid the effects of other sound sources, and delivers it whole and ready to analyze for linguistic properties. The indifference to the conditions imposed by the common perspective reflects an apparent consensus at the time that the perceptual organization of speech is simple, automatic, and accomplished by generic means. However, despite the rapidly established perceptual coherence of the

The Handbook of Speech Perception, Second Edition. Edited by Jennifer S. Pardo, Lynne C. Nygaard, Robert E. Remez, and David B. Pisoni. © 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.

4 Sensing Speech constituents of a speech signal, the perceptual organization of speech cannot be reduced to the available and well‐established principles of auditory perceptual organization.

Perceptual organization and the gestalt legacy A generic auditory model of organization The dominant contemporary account of auditory perceptual organization has been auditory scene analysis (Bregman, 1990). This theory of the resolution of auditory sensation into streams, each issuing from a distinct source, developed empirically in the cognitive era, though its intellectual roots run deep. The gestalt psychologist Wertheimer (1923/1938) established the basic premises of the account in a legendary article, the contents of which are roughly known to all students of introductory psychology. In visible and audible examples, Wertheimer described the coalescence of elementary figures into groups and contours, arguing that sensory experience is organized in patterns, and is not registered as a mere spatter of individual receptor states. By considering a series of hypothetical cases, and without knowing the sensory physiology that would not be described for decades (Mountcastle, 1998), he justified organizing principles of similarity, proximity, closure, symmetry, common fate, continuity, set, and habit. Hindsight suggests that Wertheimer framed the problem astutely, or so it now seems given our contemporary understanding of the functions of the sensory periphery that integrate the action of visual and auditory receptors (Hochberg, 1974). Setting the indefinitely elastic principle of habit aside, the simple gestalt‐derived criteria of grouping are arguably reducible to two functions: (1) to compose an inventory of sensory elements; and (2) to create contours or groups on the principle that like binds to like. Whether groups occur due to the spectral composition of auditory elements, their common onset or offset, proximity in frequency, symmetry of rate of change in an auditory dimension, harmonic relationship, the interpolation of brief gaps, and so on, each is readily understood as a case in which similarity between a set of auditory sensory elements promotes grouping automatically. A group composed according to these functions forms a sensory contour or perceptual stream. It is a small but necessary extrapolation to assert that an auditory contour consists of elements originating from a single source of sound, and therefore that perceptual organization parses sensory experience into concurrent streams, each issuing from a different sound‐producing event (Bregman & Pinker, 1978). In a series of ongoing experiments, researchers adopted Wertheimer’s auditory conjectures, and calibrated the resolution of auditory streams by virtue of the historic principles and their derived corollaries. For example, Bregman and Campbell (1971) reported that auditory streams formed when a sequence of 100 ms tones differing in frequency was presented to listeners. According to a procedure that has become standard, the series of brief tones was presented repetitively to

Perceptual Organization of Speech 5 listeners, who were asked to report the order of tones in the series. Instead of hearing a sequence of high and low pitches, though, listeners grouped tones into two streams each composed of similar elements, one of high pitch and the other of low pitch (see Figure 1.1). Critically, the perception of the order of elements was veridical within streams, but perception of the intercalation order across the streams was erroneous. In another example, Bregman, Ahad, and Van Loon (2001) reported that a sequence of 65 ms bursts of band‐limited noise were grouped together or split into separate perceptual streams as a function of the similarity in center frequency of the noise bursts. A sizable literature of empirical tests of this kind spans 50 years, and calibrates the sensory conditions of grouping by one or another variant of similarity. A compilation of the literature is offered by Bregman (1990), and the theoretical yield of this research is summarized by Darwin (2008). Typically, studies of auditory‐perceptual organization have reported that listeners are sensitive to quite subtle properties in the formation of auditory groups. It is useful to consider an exemplary case, for the detailed findings of auditory amalgamation and segregation define the characteristics of the model and ultimately determine its applicability to speech. In a study of concurrent grouping of harmonically related tones by virtue of coincident onset, a variant of similarity in a temporal dimension, Dannenbring and Bregman (1978) reported that synchronized tones were grouped together, but a discrepancy as brief as 35 ms in lead or lag in one component was sufficient to disrupt coherence with other sensory constituents, and to split it into a separate stream. There are many similar cases documenting the exquisite sensitivity of the auditory sensory channel in segregating streams on the basis of slight departures from similarity: in frequency (Bregman & Campbell, 1971), in frequency change (Bregman & Doehring, 1984), in fundamental frequency (Steiger & Bregman, 1982), in common modulation (Bregman

100 ms 2500 Hz 2000 Hz 1600 Hz

Frequency

550 Hz

430 Hz 350 Hz

Time

Figure 1.1 This sequence of tones presented to listeners by Bregman and Campbell (1971) was reported as two segregated streams, one of high and another of low tones. Critically, the intercalation of the high and low streams (that is, the sequence: high, high, low, low, high, low) was poorly resolved. Source: Based on Bregman & Campbell, 1971.

6 Sensing Speech et al., 1985), in spectrum (Dannenbring & Bregman, 1976; Warren et al., 1969), due to brief interruptions (Miller & Licklider, 1950), in common onset/offset (Bregman & Pinker, 1978), in frequency continuity (Bregman & Dannenbring, 1973, 1977), and in melody and meter (Jones & Boltz, 1989); these are reviewed by Bregman (1990), Remez et al. (1994), and Remez & Thomas (2013).

Gestalt principles of organization applied to speech. Because explanations of speech perception have depended on an unspecified account of perceptual organization, it has been natural for auditory scene analysis to be a theory of first resort for understanding the perceptual solution to the cocktail party problem (Cherry, 1953; McDermott, 2009), specifically, of attending to a single stream of speech amid other sound sources. However, this premise was largely unsupported by direct evidence. The crucial empirical cases that had formed the model had rarely included natural sources of sound – neither instruments of the orchestra (though see Iverson, 1995), which are well modeled physically (Rossing, 1990), nor ordinary mechanical sources (Gaver, 1993), nor the sounds of speech, with several provocative exceptions. It is instructive to consider some of the cases in which tests of perceptual organization using speech sounds appeared to confirm the applicability to speech of the general auditory account of perceptual organization. In one case establishing grouping by similarity, a repeating series of syllables of the form CV‐V‐CV‐V was observed to split into distinct streams of like syllables, one of CVs and another of Vs, much as gestalt principles propose (Lackner & Goldstein, 1974). Critically, this perceptual organization precluded the perceptual resolution of the relative order of the syllables across stream, analogous to the index of grouping used by Bregman & Campbell (1971). In another case calibrating grouping by continuity, a series of vowels formed a single perceptual stream only when formant frequency transitions leading into and out of the vowel nuclei were present (Dorman, Cutting, & Raphael, 1975). Without smooth transitions, the spectral discontinuity at the juncture between successive steady‐state vowels exceeded the tolerance for grouping by closure – that is, the interpolation of gaps – and the perceptual coherence of the vowel series was lost. In another case examining organization by the common fate, or similarity in change of a set of elements, a harmonic component of a steady‐state vowel close to the center frequency of a formant was advanced or delayed in onset relative to the rest of the harmonics composing the synthetic vowel (Darwin & Sutherland, 1984). At a lead or lag of 32 ms, consistent with findings deriving from arbitrary patterns, the desynchronized harmonic segregated into a different stream than the synchronous harmonics composing the vowel. In consequence, when the leading or lagging harmonic split, the phonemic height of the vowel was perceived to be different, as if the perceptual estimate of the center frequency of the first formant had depended on the grouping. In each of these instances, the findings with speech sounds were well explained by the precedents of prior tests using arbitrary patterns of sound created with oscillators and noise generators.

Perceptual Organization of Speech 7 These outcomes should have seemed too good to be true. It was as if an account defined largely through tests of ideal notions of similarity in simple auditory sequences proved to be adequate to accommodate the diverse acoustic constituents and spectral patterns of natural sound. With hindsight, we can see that accepting this conclusion requires two credulous assumptions. One assumption is that tests using arbitrary trains of identical reduplicated syllables, meticulously phased harmonic components, and sustained steady‐state vowels adequately express the ordinary complexity of speech and the perceiver’s ordinary sensitivity. A sufficient test of organization by the generic principles of auditory scene analysis is more properly obliged to incorporate the kind of spectra that defined the technical description of speech perception. A closer approximation to the conditions of ordinary listening must characterize the empirical tests. Tests composed without imposing these assumptions reveal a set of functions rather different from the generic auditory model at work in the perceptual organization of speech. A second assumption, obliged by the generic auditory account of organization is that the binding of sensory elements into a coherent contour, ready to analyze, occurs automatically, with neither attention nor effort. This premise had been asserted, though not secured by evidence. Direct attempts at an assay have been clear. These studies showed plainly that, whether a sound is speech or not, its acoustic products, sampled auditorily, are resolved into a contour distinct from the auditory background only by the application of attention (Carlyon et al., 2001, 2003; Cusack, Carlyon, & Robertson, 2001; Cusack et al., 2004). Without attention, contours fail to form and sounds remain within an undifferentiated background. Deliberate intention can also affect the listener’s integration or segregation of an element within an auditory sensory contour, by an application of attentional focus (for instance, Billig, Davis, & Carlyon, 2018)

The plausibility of the generic account of perceptual organization A brief review of the acoustic properties of speech One challenge of perceptual organization facing a listener is simple to state: to find and follow a speech stream. This would be an easy matter were the acoustic constituents of a speech signal or their auditory sensory correlates unique to speech, if the speech signal were more or less stationary in its spectrum, or if the acoustic elements and the auditory impressions they evoke were similar moment by moment. None of these is true, however, which inherently undermines the plausibility of any attempt to formalize the perceptual organization of speech as a task of determining successive or simultaneous similarities in auditory experience. First, the acoustic effects of speech are distributed across six octaves of audibility. The sensory contour of an utterance is widely distributed across frequency. Second, none of the multitude of naturally produced vocal sounds composing a speech signal is unique to speech. Arguably, the physical models of speech production

8 Sensing Speech succeed so well because they exploit an analogy between vocal sound and acoustic resonance (Fant, 1960; Stevens & House, 1961). Third, one signature aspect of speech is the presence of multiple acoustic maxima and minima in the spectrum, and the variation over time in the frequencies at which the acoustic energy is concentrated (Stevens & Blumstein, 1981). This frequency variation of the formant centers is interrupted at stop closures, creating an acoustic spectrum that is both nonstationary and discontinuous. Fourth, the complex pattern of articulation by which talkers produce consonant holds and approximations creates heterogeneous acoustic effects consisting of hisses, whistles, clicks, buzzes, and hums (Stevens, 1998). The resulting acoustic pattern of speech consists of a nonstationary, discontinuous series of periodic and aperiodic elements, none of which in detail is unique to a vocal source. The diversity of acoustic constituents of speech is readily resolved as a coherent stream perceptually, though the means by which this occurs challenges the potential of the generic auditory account. Although some computational implementations of gestalt grouping have disentangled spoken sources of simple nonstationary spectra (Parsons, 1976; Summerfield, 1992), these have occurred for a signal free of discontinuities, as occurs in the production of sustained, slowly changing vowels. Slow and sustained change in the spectrum, though, is hardly typical of ordinary speech, which is characterized by consonant closures that impose rapid spectral changes and episodes of silence of varying duration. To resolve a signal despite silent discontinuities requires grouping by closure to extrapolate across brief silent gaps. To invoke generic auditory properties in providing this function would oppose present evidence, though. For example, in an empirical attempt to discover the standard for grouping by closure (Neff, Jestead, & Brown, 1982), the temporal threshold for gap detection was found to diverge from the tolerance of discontinuity in grouping. On such evidence, it is unlikely that a generic mechanism of extrapolation across gaps is responsible for the establishment of perceptual continuity, whether in auditory form or in the perception of speech. Evidence from tests of auditory form suggests that harmonic relations and amplitude comodulation promote grouping, albeit weakly (Bregman, Levitan, & Liao, 1990). That is, sharing a fundamental frequency or pulsing at a common rate promote auditory integration. These two characteristics are manifest by oral and nasal resonances and by voiced frication. This may be the most promising principle to explain the coherence of voiced speech by generic auditory means, for an appeal to similarity in frequency variation between the formants is unlikely to explain their coherence. Indeed, the pattern of frequency variation of the first formant typically differs from that of the second, and neither first nor second resemble the third, due to the different articulatory origins of each (Fant, 1960). To greatly simplify a complex relation, the center frequency of the first formant often varies with the opening and closing of the jaw, while the frequency of the second formant varies with the advancement and retraction of the tongue, and the frequency of the third formant alternates in its articulatory correlate. Accordingly, different patterns of frequency variation are observed in each resonance due to the relative independence of the control of these articulators (see Figure 1.2). Even were generic

Perceptual Organization of Speech 9 Natural speech

Frequency

A

B

Frequency

Sinewave replica

Time

Figure 1.2 A comparison of natural and sinewave versions of the sentence “The steady drip is worse than a drenching rain”: (A) natural speech; (B) sinewave replica.

auditory functions to bind the comodulated formants into a single stream, without additional principles of perceptual organization a generic gestalt‐derived parsing mechanism that aims to compose perceptual streams of similar auditory elements would fail; indeed, it would fracture the acoustically diverse components of a single speech signal into streams of similar elements, one of hisses, another of buzzes, a third of clicks, and so on, deriving an incoherent profusion of streams despite the common origin of the acoustic elements in phonologically governed sound production (Lackner & Goldstein, 1974; Darwin & Gardner, 1986; Remez et al., 1994). Apart from this consideration in principle, a small empirical literature exists on which to base an adequate account of the perceptual organization of speech.

A few clues In measures 13–26 of the first movement of Schubert’s Symphony no. 8 in B minor (D. 759, the “Unfinished”), the parts played by oboe and clarinet, a unison melody, fuse so thoroughly that no trace of oboe or clarinet quality remains. This instance in which two sources of sound are treated perceptually as one led Broadbent and

10 Sensing Speech Ladefoged (1957) to attempt a study that offered a clue to the nature of the perceptual organization of speech. Beginning with a synthetic sentence composed of two formants, they created two single formant patterns, one of the first formant and the other of the second, each excited at the same fundamental frequency. Concurrently, the two formants evoked an impression of an English sentence; singly, each evoked an impression of an unintelligible buzz. In one test condition, the formants were presented dichotically, in analogy to an oboe and a clarinet playing in unison. This resulted in perception of a single voice speaking the sentence, as if two spatially distinct sources had combined. Despite the dissimilarities in spatial locus of the components, this outcome is consistent with a generic auditory account of organization on grounds of harmonicity and amplitude comodulation. However, when each formant was rung on a different fundamental, subjects no longer reported a single voice, as if fusion failed to occur because neither harmonicity nor amplitude comodulation existed to oppose the spatial dissimilarity of the components. It is remarkable, nonetheless, that in view of these multiple breaches of similarity, subjects accurately reported the sentence “What did you say before that?” although in this condition it seemed to be spoken by two talkers, one at each ear, each speaking at a different pitch. In other words, listeners reported divergent perceptual states: (1) the splitting of the auditory streams due to dissimilar pitch; and (2) the combining of auditory streams to form speech. Although a generic gestalt‐derived account can explain a portion of the results, it cannot explain the combination of spatially and spectrally dissimilar formant patterns to compose a single speech stream. In fine detail, research on perception in a speech mode also raised this topic, though indirectly. This line of research sought to calibrate the difference in the resolution of auditory form and phonetic form of speech, thereby to identify psychoacoustic and psychophysical characteristics that are unique to speech perception. By opposing acoustic patterns evoking speech perception with nonspeech control patterns, the perceptual effect of variation in an acoustic correlate of a phonetic contrast was compared to the corresponding effect of the same acoustic property removed from the phonetically adequate context. For instance, Mattingly et al. (1971) examined the discriminability of a second formant frequency transition as an isolated acoustic pattern and within a synthetic syllable in which its variation was correlated with the perception of the place of articulation of a stop consonant. A finding of different psychophysical effect, roughly, Weber’s law for auditory form and categorical perception for phonetic form, was taken as the signature of each perceptual mode. In a variant of the method specifically pertinent to the description of perceptual organization, Rand (1974) separated the second formant frequency transition, the correlate of the place contrast, from the remainder of a synthetic syllable and arrayed the acoustic components dichotically. In consequence, the critical second formant frequency transition presented to one ear was resolved as an auditory form while it also contributed to the phonetic contrast it evoked in apparent combination with the formant pattern presented to the other ear. In other words, with no change in the acoustic conditions, a listener could resolve the properties of the auditory form of the formant‐frequency transition or

Perceptual Organization of Speech 11 the phonetic contrast it evoked when combined with the rest of the synthetic acoustic pattern. The dichotic presentation permitted two perceptual organizations of the same element concurrently, due to the spatial and temporal disparity that blocked fusion on generic auditory principles, and due to the phonetic potential of the fused components. This phenomenon of concurrent auditory and phonetic effects of a single acoustic element was described as duplex perception (Liberman, Isenberg, & Rakerd, 1981; Nygaard, 1993; Whalen & Liberman, 1996), and it has been explained as an effect of a peremptory aspect of phonetic organization and analysis.1 No matter how the evidence ultimately adjudicates the psychophysical claims, it is instructive to note that the generic auditory functions of perceptual organization only succeed in rationalizing the split of the dichotic components into separate streams, and fail to provide a principle by which the combination of elements occurs.

Organization by coordinate variation A classic understanding of the perception of speech derives from study of the acoustic correlates of phonetic contrasts and the physical and articulatory means by which they are produced (reviewed by Raphael, Chapter 22; also, see Fant, 1960; Liberman et al., 1959; Stevens & House, 1961). In addition to calibrating the perceptual response to natural samples of speech, researchers also used acoustic signals produced synthetically in detailed psychoacoustic studies of phonetic identification and differentiation. In typical terminal analog speech synthesis, the short‐term spectra characteristic of the natural samples are preserved, lending the synthesis a combination of natural vocal timbre and intelligibility (Stevens, 1998). Acoustic analysis of speech, and synthesis that allows for parametric variation of speech acoustics, have been important for understanding the normative aspects of perception, that is, the relation between the typical or likely auditory form of speech sounds encountered by listeners and the perceptual analysis of phonetic properties (Diehl, Molis & Castleman, 2001; Lindblom, 1996; Massaro, 1994). However, a singular focus on statistical distributions of natural samples and on synthetic idealizations of natural speech discounts the adaptability and versatility of speech perception, and deflects scientific attention away from the properties of speech that are potentially relevant to understanding perceptual organization. Because grossly distorted speech remains intelligible (e.g. Miller, 1946; Licklider, 1946) when many of the typical acoustic correlates are absent, it is difficult to sustain the hypothesis that finding and following a speech stream crucially depends on meticulous registration of the brief and numerous acoustic correlates of phonetic contrasts described in classic studies. But, if the natural acoustic products of vocalization do not determine the perceptual organization and analysis of speech, what does? An alternative to this conceptualization was prompted by the empirical use of a technique that combines digital analysis of speech spectra and digital synthesis of time‐varying sinusoids (Remez et al., 1981). This research has revealed the perceptual effectiveness of acoustic patterns that exhibit the gross spectro‐temporal

12 Sensing Speech characteristics of speech without incorporating the fine acoustic structure of vocally produced sound. Perceptual research with these acoustic materials and their relatives (noise‐band vocoded speech: Shannon et al., 1995; acoustic chimeras: Smith, Delgutte, & Oxenham, 2002; Remez, 2008) has permitted an estimate of a listener’s sensitivity to the time‐varying patterns of speech spectra independent of the sensory elements of which they are composed. The premise of sinewave replication is simple, though in practice it is as laborious as other forms of copy synthesis (Remez et al., 2011). Three or four tones, each approximating the center frequency and amplitude of an oral, nasal, or fricative resonance, are created to imitate the coarse‐grain attributes of a speech sample. Lacking the momentary aperiodicities, harmonic spectra, broadband formants, and regular pulsing of natural and most synthetic speech, a sinewave replica of an utterance differs acoustically and qualitatively from speech while remaining intelligible. A spectrogram of a sinewave sentence is shown in the bottom panel of Figure 1.2; a comparison of short‐term spectra of natural speech and both synthetic and sinewave imitations is shown in Figure 1.3. It is significant that three or four tones reproducing a natural formant pattern evoke an experience in a naive listener of several concurrent whistles changing in pitch and loudness, and do not automatically elicit an impression of speech. The listener’s attention is free to follow the course of the auditory form of each component tone. Certainly, this aspect of a sinewave pattern is salient auditorily, and little of the raw quality prompts attention to the tones as a single compound contour. Studies show that listeners are well able to attend to individual tone components and to focus on the pattern of pitch changes each evokes over the run of a few seconds (Remez & Rubin, 1984, 1993). In other words, the immediate experience of the listener is accurately predicted by a generic auditory account, because acoustic elements that change frequency at different rates to different extents, onsetting and offsetting at different moments in different frequency ranges are dissimilar along many dimensions that specify separate perceptual streams according to gestalt principles. Once instructed that the tones compose synthetic speech, a listener readily reports linguistic properties as if hearing the original natural utterance on which the sinewave replica was modeled. If attention to a complex, broadband contour is characteristic of the perceptual organization of speech, its sufficient condition is met in the absence of natural acoustic vocal products. Performance levels reported with this kind of copy synthesis have varied with the proficiency of the synthesis, although it has often been possible to achieve very good intelligibility, rivalling natural speech (for instance, Remez et al., 2008). Within this range of performance levels, these acoustic conditions pose a crucial test of a gestalt‐derived account of perceptual organization, for a perceiver must integrate the tones in order to compose a single sensory contour segregated from the background, ready to analyze for the linguistic properties borne on the pattern of the signal. Several tests support this claim of true integration preliminary to analysis. In direct assessments, the intelligibility of sinewave replicas of speech exceeded intelligibility predicted from the presentation of individual tones (Remez et al.,

Natural

Amplitude (5 dB steps)

Perceptual Organization of Speech 13

0

1

2

3

4

5

Frequency (kHz)

Synthetic

Sinewave

Figure 1.3 A comparison of the short‐term spectrum of natural speech (top); terminal analog synthetic speech (middle); and sinewave replica (below). Note the broadband resonances and harmonic spectra in natural and synthetic speech, in contrast to the sparse, nonharmonic spectrum of the three tones.

14 Sensing Speech 1981, 1987, 1994). This superadditive performance is evidence of integration, and it persisted even when the tones came from separate spatial sources, violating similarity in location (Remez et al., 1994; see also Broadbent & Ladefoged, 1957). In combining the individual tones into a single time‐varying coherent stream, however, this complex organization, which is necessary for phonetic analysis, does not exclude an auditory organization as independently resolvable streams of tones (Remez & Rubin, 1984, 1993; Roberts, Summers, & Bailey, 2015). In fact, the perceiver’s resolution of the pitch contour associated with the frequency pattern of tonal constituents is acute whether or not the fusion of the tones supporting phonetic perception occurs (Remez et al., 2001). On this evidence rests the claim that sinewave replicas are bistable, exhibiting two simultaneous and exclusive organizations. Even if the sensory causes of these perceptual impressions were strictly parallel, the bistable occurrence of auditory and phonetic perceptual organization is not amenable to further simplification. A sinewave replica of speech allows two organizations, much as celebrated cases of visual bistability do: the duck–rabbit figure, Woodworth’s equivocal staircase, Rubin’s vase, and Necker’s cube. Unlike the visual cases of alternating stability, the bistability that occurs in the perception of sinewave speech is simultaneous. A conservative description of these findings is that an organization of the auditory properties of sinewave signals occurs according to gestalt‐derived principles that promote segregation of the tones into separate contours. Phonetic perceptual analysis fails to apply or to succeed under that organization. However, the concurrent variation of the tones also satisfies a non‐gestalt principle of coordinate auditory variation despite local dissimilarities, and this promotes integration of the components into a single broadband stream. This organization, binding diverse components into a single complex sensory contour, is susceptible to phonetic analysis.

The perceptual organization of speech Characteristics of the perceptual coherence of speech While much remains to be discovered about perceptual organization that depends on sensitivity to complex coordinate variation, research on the psychoacoustics and perception of speech from a variety of laboratories permits a rough sketch of the parameters. The portrait of perceptual organization offered here gathers evidence from different research programs that aimed to address a range of perceptual questions, for there is no unified attempt at present to understand the organization of perceptual streams that approach the acoustic variety and distributed frequency breadth of speech. Overall, these results expose the perceptual organization of speech as fast, unlearned, nonsymbolic, keyed to complex patterns of sensory variation, indifferent to sensory quality, and requiring attention whether elicited or exerted. The evidence that perceptual organization of speech is fast rests on long‐ established findings that an auditory trace fades rapidly. Although estimates vary

Perceptual Organization of Speech 15 with the task used to calibrate the durability of unelaborated auditory sensation, all of the measures reflect the urgency with which the fading trace is recoded into a more stable phonetic form (Howell & Darwin, 1977; Pisoni & Tash, 1974). It is unlikely that much of the auditory form of speech persists beyond a tenth of a second, and it has decayed beyond recurrent access by 400 ms. The sensory integration required for perceptual organization is tied to this pace. Contrary to this notion of perceptual organization as exceedingly rapid, an extended version of auditory scene analysis (Bregman, 1990) proposes a resort to a cognitive mechanism occurring well after primitive grouping takes place, to function as a supplement to the gestalt‐based mechanism. Such knowledge‐based mechanisms also feature as a method to resolve difficult grouping in artifactual approaches to perceptual organization (e.g. Cooke & Ellis, 2001). However, the formal or practical advantages that this method achieves come at a clear cost, namely, to reject boundary conditions that subscribe to the natural auditory limits of perceptual organization. The propensity to organize an auditory pattern by virtue of complex coordinate variation is apparently unlearned, or nearly so. In tests with infant listeners, 14‐week‐old subjects exhibited the pattern of adult sensitivity to dichotically arrayed components of synthetic syllables (Eimas & Miller, 1992; cf. Whalen & Liberman, 1987; Vouloumanos & Werker, 2007; Rosen & Iverson, 2007). In this case, the pattern of perceptual effects evident in infants was contingent on the integration of sensory elements despite detailed failures of auditory similarity on which gestalt grouping depends. Perhaps it is an exaggeration to claim that this organizational function is strictly unlearned, for even the youngest subject in the sample had been encountering airborne sound for three months, and undeniably had the opportunity to refine their sensitivity through this exposure. However, the development of sensitivity to complex auditory patterns cannot plausibly result from a history of meticulous trial and error in listeners of such a tender age, nor is it likely to reflect specific knowledge of the auditory effects that typify American English phonetic expression. It is far likelier that this sensitivity represents the emergence of an organizational component of listening that must be present for speech perception to develop (Houston & Bergeson, 2014), and 14‐week‐old infants still have several months ahead of them before the phonetic properties of speech become conspicuous (Jusczyk, 1997). Research on sinewave replicas of speech has shown that the perceptual organization of speech is nonsymbolic and keyed to patterns of sensory variation. The evidence is provided by tests (Remez et al., 1994; Remez, 2001; Roberts, Summers, & Bailey, 2010) that used tone analogs of sentences in which a sinewave replicating the second formant was presented to one ear while tone analogs of the first, third, and fricative formants were presented to the other ear. In such conditions, much as Broadbent and Ladefoged had found, perceptual fusion readily occurs despite the violation of spatial dissimilarity and the absence of other attributes to promote gestalt‐based grouping. To sharpen the test, an intrusive tone was presented in the same ear with the tone analogs of the first, third, and fricative tones. This single tone presented by itself does not evoke phonetic impressions, and is perceived as

16 Sensing Speech an auditory form without symbolic properties: it merely changes in pitch and loudness without phonetic properties. In order to resolve the speech stream under such conditions, a listener must reject the intrusive tone despite its spatial similarity to the first, third, and fricative tones of the sentence, and appropriate the tone analog of the second formant to form the speech stream despite its spatial displacement from the tones with which it combines. Control tests established that a tone analog of the second formant alone failed to evoke an impression of phonetic properties. Performance of listeners in a transcription task, a rough estimate of phonetic coherence, was good if the intrusive tone did not vary in a speechlike manner. That is, an intrusive tone of constant frequency or of alternating frequency had no effect on the perceptual organization of speech. When the intrusive tone exhibited the tempo and range of frequency variation appropriate for a second formant, without supplying the proper variation that would combine with other tones to form an intelligible stream, performance suffered. It was as if the criterion for integration of a tone was specific to its frequency variation under conditions in which it was nonetheless unintelligible. Since the advent of the telephone, it has been obvious that a listener’s ability to find and follow a speech stream is indifferent to distortion of natural auditory quality. The lack of spectral fidelity in early forms of speech technology made speech sound phony, literally, yet it was readily recognized that this lapse of natural quality did not compromise the usefulness of speech as a communication channel (Fletcher, 1929). This fact indicates clearly that the functions of perceptual organization hardly aim to collect aspects of sensory stimulation that have the precise auditory quality of natural speech. Indeed, Liberman and Cooper (1972) argued that early synthesis techniques evoked phonetic perception because the perceiver cheerfully forgave departures from natural quality that were often extreme. In techniques such as speech chimeras (Smith, Delgutte, & Oxenham, 2002) and sinewave replication, the acoustic properties of intelligible signals lie beyond the productive capability of a human vocal tract, and the impossibility of such spectra as vocal sound does not evidently block the perceptual organization of the sound as speech. The variation of a spectral envelope can be taken by listeners to be speechlike despite acoustic details that give rise to impressions of gross unnaturalness. Findings of this sort contribute a powerful argument against psychoacoustic explanations of speech perception generally (e.g. Holt, 2005; Lotto & Kluender, 1998; Lotto, Kluender, & Holt, 1997; Toscano & McMurray, 2010), and perceptual organization specifically. Ordinary subjective experience of speech suggests that perceptual organization is unbidden, for speech seems to pop right out of a nearby commotion. Yet studies reveal that sensory contours, whether simple or complex, form only with attention. In speech, as with simpler contours, the primitive segregation of figure and ground is at stake. Attention permits perceptual analysis to apply to a broadband contour of heterogeneous acoustic composition. Opposing this axiom – that sensory contours require attention to form – findings with sinewave replicas of utterances show that the perceptual organization of speech requires attention and is not an automatic consequence of a class of sensory effects. This feature differs from the automatically engaged process proposed in strict modular terms by Liberman and Mattingly

Perceptual Organization of Speech 17 (1985). With sinewave signals, most subjects fail to notice that concurrent tones can cohere unless they are asked specifically to listen for speech (Remez et al., 1981; also see Liebenthal et al., 2003), indicating that the auditory forms alone do not evoke speech perception. Critically, a listener who is asked to attend to arbitrary tone patterns as if listening to speech fails to report phonetic impressions (Remez et al., 1981), indicating that signal structure as well as phonetic attention are required for the organization and analysis of speech. A neural population code representing the speech spectrum without attention cannot be responsible for both the stable albeit unintegrated auditory form of sinewave speech and the stable integrated coherent contour that is susceptible to phonetic analysis (cf. Engineer et al., 2008). In this regard, general auditory perceptual organization is similar to speech perception in requiring attention for auditory figures to form (e.g. Carlyon et al., 2001). Of course, a natural vocal signal exhibits the phenomenal quality of speech, and this is evidently sufficient to elicit a productive form of attention for perceptual organization to ensue. This premise cautions against the use of passive listening procedures to identify supposed automatic functions of linguistic analysis of speech (e.g. Zevin et al., 2010). Such studies merely fail to secure attention. A listener whose attention is free to wander cannot be considered inattentive to the sounds delivered without instruction. In such conditions, performance arguably reflects a mix of cognitive states evoked with attention and vegetative excitation evoked without attention.

Generic auditory organization and speech perception The intelligibility of sinewave replicas of utterances, of noise‐band vocoded speech, and of speech chimeras reveals that a perceiver can find and follow a speech signal composed of dissimilar acoustic and auditory constituents, in contrast to the principles on which gestalt‐based generic functions operate. These findings show that perceptual organization of speech can occur solely by virtue of attention to the complex coordinate variation of an acoustic pattern. The use of such exotic acoustic signals for the proof creates some uncertainty that ordinary speech perception is satisfactorily characterized by tests using these acoustic oddities. An argument of Remez et al. (1994) for considering these tests to be a useful index of the perception of commonplace speech signals begins by noting that phonetic perception of sinewave replicas of utterances depends on a simple instruction to listen to the tones as speech. Because the disposition to hear sinewave words and sentences appears readily, without arduous or lengthy training, this prompt adaptation to phonetic organization and analysis suggests that the ordinary cognitive resources of speech perception are operating for sinewave speech. Although some form of short‐term perceptual learning might be involved, the swiftness of the appearance of adequate perceptual function is evidence that any special induction to accommodate sinewave signals is a marginal component of perception. Despite all, natural speech consists of large stretches of glottal pulsing, which creates amplitude comodulation over time and harmonic relations between concurrent portions of the spectrum. This has led to a reasonable proposal (Barker & Cooke, 1999; Darwin, 2008) that generic auditory grouping functions, although

18 Sensing Speech not necessary for the perceptual organization of speech, contribute to perceptual organization when speech spectra satisfy the gestalt criteria. The consistent finding that speech spectra organize quickly – on the order of milliseconds – and generic auditory grouping takes time to build – on the order of seconds – may justify doubt in the asserted privilege of gestalt‐based grouping by similarity. A critical empirical test was provided by Carrell and Opie (1992), which offers an index of the plausibility of the claim. In the test, the intelligibility of sinewave sentences was compared in two acoustic conditions: (1) three‐tone time‐varying sinusoids; and (2) three‐tone time‐varying sinusoids on which a regular amplitude pulse was imposed. Although the tone patterns in the first condition were not susceptible to gestalt‐based grouping, because they failed to exhibit similarity in each of the relevant dimensions that we have discussed, the pulsed tone patterns in the second condition exhibited amplitude comodulation and harmonicity in its complex spectra (Bregman, Levitan, & Liao, 1990). All other things being equal, the perceptual organization attributable to complex coordinate variation should have been reinforced by perceptual organization attributable to similarity that triggers generic auditory grouping. Indeed, Carrell and Opie found that pulsed sentences were more intelligible than smoothly varying sinusoids, as if the spectral components once bound more securely were more successfully analyzed. The assertion offered by Barker and Cooke (1999) about this phenomenon is that generic auditory functions can reinforce the grouping of speech signals, although on close examination the evidence does not yet warrant an endorsement of a hybrid model of perceptual organization. Carrell and Opie had used a range of pulse rates and conditions in their study, and reported that the intelligibility gain attributable to pulsing a sinewave sentence was restricted to a pulse rate in the range of 50–100 Hz. No benefit of pulsing was observed for a pulse rate of 200 Hz. While this topic merits additional examination, the available evidence encourages a doubtful conclusion about this hypothetical hybrid character of perceptual organization, which would necessarily be limited in applicability to speech signals produced by low bass voices; its benefit would not extend to tenors, to say nothing of altos and sopranos. Most generously, we might conclude that the relation of primitive gestalt‐based generic auditory grouping and the more abstract organization by sensitivity to coordinate variation cannot be defined without stronger evidence, and that it is premature to conclude that the gestalt set plays a prominent or even a secondary role in the perceptual organization of speech.

Implications of perceptual organization for theories of speech perception The nature of speech cues What causes the perception of speech? A classic answer takes a linguistically significant contrast – voicing, for instance – and provides an inventory of natural acoustic correlates of a careful articulation of the contrast (e.g. Lisker, 1978).

Perceptual Organization of Speech 19 A perceptual account that reverses the method depicts a meticulous listener collecting individual acoustic correlates as they land and assembling them in a stream, thereby to tally the strength with which a constellation of cues indicates the likely occurrence of a linguistic constituent. Klatt’s retrospective survey of perceptual accounts describes many normative approaches that treat the acoustic signal as a straightforward composite of acoustic correlates. The function of perceptual organization, usually omitted in such accounts, establishes the perceiver’s compliance with the acoustic products of a specific source of sound, and in the case of speech it is the probabilistic function that finds and tracks the likely acoustic products of vocalization. However, it is clear from evidence of several sorts – tolerance of distortion, effectiveness of impossible signals, forgiveness of departures from natural timbre – that the organizational component of perception that yields a speech stream fit to be analyzed cannot collect acoustic cues piecemeal, as this simple view describes. The functions of perceptual organization act, instead, as if attuned to a complex form of regular if unpredictable spectro‐temporal variation within which the specific acoustic and auditory elements matter far less than their overall configuration. The evolving portrait of speech perception that includes organization and analysis recasts the raw cue as the property of perception that gives speech its phenomenality, though not its phonetic effect. The transformation of natural speech to chimera, to noise‐band vocoded signal, and to sinewave replica is phonetically conservative, preserving the fine details of subphonemic variation while varying to the extremes of timbre or auditory quality. It is apparent that the competent listener derives phonetic impressions from the properties that these different kinds of signal share, and derives qualitative impressions from their unique attributes. The shared attribute, for want of a more precise description, is a complex modulation of spectrum envelopes, although the basis for the similar effect of the infinitely sharp peaks of sinewave speech and the far coarser spectra of chimerical and noise‐band vocoded speech has still to be explained. None of these manifests the cues present in natural speech despite the success of listeners in understanding the message. The conclusion supported by these findings is clear: phonetic perception does not require the sensory registration of natural speech cues. Instead, the organizational component of speech perception operates on a spectro‐temporal grain that is requisite both for finding and following a speech signal and for analyzing its linguistic properties. The speech cues that seemed formerly to bear the burden of stimulating phonetic analyzers into action appear in hindsight to provide little more than auditory quality subordinate to the phonetic stream. An additional source of evidence is encountered in the phenomenal experience of perceivers who listen to speech via electrocochlear prostheses (Goh et al., 2001; Liebenthal et al., 2003). Intelligibility of speech perceived via cochlear implant is often excellent, rivaling that of normal hearing, and recent studies with infant and juvenile subjects (Svirsky et al., 2000) suggest that this form of sensory substitution is effective even at the earliest stages of language development (see Hunter & Pisoni, Chapter 20). The mechanism of acoustic transduction at the auditory periphery is anomalous, it goes without saying, and the phenomenal experience of

20 Sensing Speech listeners receiving this appliance to initiate neural activity differs hugely from the ordinary auditory experience of natural speech. Despite the absence of veridical perceptual experience of the raw qualities of natural speech, electrocochlear prostheses are effective in the self‐regulation of speech production by its users, and are effective perceptually despite the abject deficit in faithfully presenting natural acoustic elements of speech. What brings about the perception of speech, then? Without the acoustic moments, there is no stream of speech, but the stream itself plays a causal role beyond that which has been attributed to momentary cues since the beginning of technical study of speech.

A constraint on normative descriptions of speech perception The application of powerful statistical techniques to problems in cognitive psychology has engendered a variety of normative, incidence‐based accounts of perception. Since the 1980s, a technology of parallel computation based loosely on an idealization of the neuron has driven the creation of a proliferation of devices that perform intelligent acts. The exact modeling of neurophysiology is rare in this enterprise, though probabilistic models attired as neural nets enjoy a hopeful if unearned appearance of naturalness that older, algorithmic explanations of cognitive processes unquestionably lack. As a theory of human cognitive function, it is more truthful to say that deep learning implementations characterize the human actor as an office full of clerks at an insurance company, endlessly tallying the incidence of different states in one domain (perhaps age and zip code, or the bitmap of the momentary auditory effect of a noise burst in the spectrum) and associating them (perhaps in a nonlinear projection) with those in another domain (perhaps the risk of major surgery, or the place of articulation of a consonant). In the perception of speech and language, the ability of perceivers to differentiate levels of linguistic structure has been attributed to a sensitivity to inhomogeneities in distributions of specific instances of sounds, words, and phrases. Although a dispute has taken shape about the exact dimensions of the domain within which sensitivity to distributions can be useful (e.g. Peña et al., 2002; contra Seidenberg, MacDonald, & Saffran, 2002), there is confident agreement that a distributional analysis of a stream of speech is performed in order to derive a linguistic phonetic segmental sequence. Indeed, this is claimed as one key component of language acquisition in early childhood (Saffran, Aslin, & Newport, 1996). The presumption of this assertion obliges a listener to establish and maintain in memory a distribution of auditory tokens projectable into phonetic types (Holt & Lotto, 2006). This is surely false. The rapid decay of an auditory trace of speech leaves it uniquely unfit for functions requiring memory lasting longer than 100 ms, and for this reason it is simply implausible that stable perceptual categories rest on durable representations of auditory exemplars of speech samples. Moreover, the notion of perceptual organization presented in this chapter argues that a speech stream is not usefully represented as a series of individual cues, whether for perceptual organization or for analysis. In fact, in order to determine that a particular acoustic moment is a cue, a perceptual function already sensitive to coordinate

Perceptual Organization of Speech 21 variation must apply. Whether or not a person other than a researcher compiling entries in the Dictionary of American Regional English can become sensitive to distributions of linguistic properties as such, it is exceedingly unlikely that the perceptual resolution of linguistic properties in utterances is much influenced by representations of the statistical properties of speech sounds. Indeed, the neural clerks would be free to tally what they will, but perception must act first to provide the instances.

Multisensory perceptual organization Fifty years ago, Sumby and Pollack (1954) conducted a pioneering study of the perception of speech presented in noise in which listeners could also see the talkers whose words they aimed to recognize. The point of the study was to calibrate the level at which the speech signal would become so faint in the noise that to sustain adequate performance attention would switch from an inaudible acoustic signal to the visible face of the talker. In fact, the visual channel contributed to intelligibility at all levels of performance, indicating that the perception of speech is ineluctably multisensory. But how does the perceiver determine the audible and visible composition of a speech stream? This problem (reviewed by Rosenblum & Dorsi, Chapter 2) is a general form of the listener’s specific problem of perceptual organization, understood as a function that follows the speechlike coordinate variation of a sensory sample of an utterance. To assign auditory effects to the proper source, the perceptual organization of speech must capture the complex sound pattern of a phonologically governed vocal source, sensing the spectro‐temporal variation that transcends the simple similarities on which the gestalt‐derived principles rest. It is obvious that gestalt principles couched in auditory dimensions would fail to merge auditory attributes with visual attributes. Because auditory and visual dimensions are simply incommensurate, it is not obvious that any notion of similarity would hold the key to audiovisual combination. The properties that the two senses share – localization in azimuth and range, and temporal pattern – are violated freely without harming audiovisual combination, and therefore cannot be requisite for multisensory perceptual organization. The phenomena of multimodal perceptual organization confound straightforward explanation in yet another instructive way. Audiovisual speech perception can be fine under conditions in which the audible and visible components are useless separately for conveying the linguistic properties of the message (Rosen, Fourcin, & Moore, 1981; Remez et al., forthcoming). This phenomenon alone disqualifies current models that assert that phoneme features are derived separately in each modality as long as they are taken to stem from a single event (Magnotti & Beauchamp, 2017). In addition, neither spatial alignment nor temporal alignment of the audible and visible components must be veridical for multimodal perceptual organization to deliver a coherent stream fit to analyze (see Bertelson, Vroomen, & de Gelder, 1997; Conrey & Pisoni, 2003; Munhall et al., 1996). Under such discrepant conditions, audiovisual integration occurs despite the perceiver’s evident awareness of the spatial and temporal misalignment, indicating a

22 Sensing Speech divergence in the perceptual organization of events and the perception of speech. In consequence, it is difficult to conceive of an account of such phenomena by means of perceptual organization based on tests of similar sensory details applied separately in each modality. Instead, it is tempting to speculate that an account of perceptual organization of speech can ultimately be characterized in dimensions that are removed from any specific sensory modality, and yet be expressed in parameters that are appropriate to the sensory samples available at any moment.

Conclusion Perceptual organization is the critical function by which a listener resolves the sensory samples into streams specific to worldly objects and events. In the perceptual organization of speech, the auditory correlates of speech are resolved into a coherent stream that is fit to be analyzed for its linguistic and indexical properties. Although many contemporary accounts of speech perception are silent about perceptual organization, it is unlikely that the generic auditory functions of perceptual grouping provide adequate means to find and follow the complex properties of speech. It is possible to propose a rough outline of an adequate account of the perceptual organization of speech by drawing on relevant findings from different research projects spanning a variety of aims. The evidence from these projects suggests that the critical organizational functions that operate for speech are that it is fast, unlearned, nonsymbolic, keyed to complex patterns of coordinate sensory variation, indifferent to sensory quality, and requiring attention whether elicited or exerted. Research on other sources of complex natural sound has the potential to reveal whether these functions are unique to speech or are drawn from a common stock of resources of unimodal and multimodal perceptual organization.

Acknowledgments In conducting some of the research described here and in writing this chapter, the author is grateful for the sympathetic understanding of Samantha Caballero, Mariah Marrero, Lyndsey Reed, Hannah Seibold, Gabriella Swartz, Philip Rubin, and Michael Studdert‐Kennedy. This work was supported by a grant from the National Science Foundation (SBE 1827361).

NOTE 1 It is notable that the literature on duplex perception contains meager direct evidence that the auditory and phonetic properties of the duplex acoustic test items are available simultaneously. The empirical evaluation of auditory and phonetic form employed

Perceptual Organization of Speech 23 sequential measures, sometimes separated by a week, that assessed the perception of auditory form in one test and phonetic form in another. Evidence is provided that phonetic perception is distinct from a generic auditory process, but the literature is silent on the criteria of perceptual organization required for phonetic analysis.

REFERENCES Barker, J., & Cooke, M. (1999). Is the sine‐ wave cocktail party worth attending? Speech Communication, 27, 159–174. Bertelson, P., Vroomen, J., & de Gelder, B. (1997). Auditory–visual interaction in voice localization and in bimodal speech recognition: The effects of desynchronization. In C. Benoît & R. Campbell (Eds), Proceedings of the Workshop on Audio‐Visual Speech Processing: Cognitive and computational approaches (pp. 97–100). Rhodes, Greece: ESCA. Billig, A. J., Davis, M. H., & Carlyon, R. P. (2018). Neural decoding of bistable sounds reveals an effect of intention on perceptual organization. Journal of Neuroscience, 38, 2844–2853. Bregman, A. S. (1990). Auditory scene analysis. Cambridge, MA: MIT Press. Bregman, A. S., Abramson, J., Doehring, P., & Darwin, C. J. (1985). Spectral integration based on common amplitude modulation. Perception & Psychophysics, 37, 483–493. Bregman, A. S., Ahad, P. A., & Van Loon, C. (2001). Stream segregation of narrow‐ band noise bursts. Perception & Psychophysics, 63, 790–797. Bregman, A. S., & Campbell, J. (1971). Primary auditory stream segregation and perception of order in rapid sequence of tones. Journal of Experimental Psychology, 89, 244–249. Bregman, A. S., & Dannenbring, G. L. (1973). The effect of continuity on auditory stream segregation. Perception & Psychophysics, 13, 308–312. Bregman, A. S., & Dannenbring, G. L. (1977). Auditory continuity and

amplitude edges. Canadian Journal of Psychology, 31, 151–158. Bregman, A. S., & Doehring, P. (1984). Fusion of simultaneous tonal glides: The role of parallelness and simple frequency relations. Perception & Psychophysics, 36, 251–256. Bregman, A. S., Levitan, R., & Liao, C. (1990). Fusion of auditory components: effects of the frequency of amplitude modulation. Perception & Psychophysics, 47, 68–73. Bregman, A. S., & Pinker, S. (1978). Auditory streaming and the building of timbre. Canadian Journal of Psychology, 32, 19–31. Broadbent, D. E., & Ladefoged, P. (1957). On the fusion of sounds reaching different sense organs. Journal of the Acoustical Society of America, 29, 708–710. Carlyon, R. P., Cusack, R., Foxton, J. M., & Robertson, I. H. (2001). Effects of attention and unilateral neglect on auditory stream segregation. Journal of Experimental Psychology: Human Perception and Performance, 27, 115–127. Carlyon, R. P., Plack, C. J., Fantini, D. A., & Cusack, R. (2003). Crossmodal and non‐sensory influences on auditory streaming. Perception, 32, 1393–1402. Carrell, T. D., & Opie, J. M. (1992). The effect of amplitude comodulation on auditory object formation in sentence perception. Perception & Psychophysics, 52, 437–445. Cherry, E. (1953). Some experiments on the recognition of speech, with one and two ears. Journal of the Acoustical Society of America, 25, 975–979.

24 Sensing Speech Conrey, B. L., & Pisoni, D. B. (2003). Audiovisual asynchrony detection for speech and nonspeech signals. In J.‐L. Schwartz, F. Berthommier, M.‐A. Cathiard, & D. Sodoyer (Eds), Proceedings of AVSP 2003: International Conference on Audio‐Visual Speech Processing. St. Jorioz, France September 4–7, 2003 (pp. 25–30). Retrieved September 24, 2020, from https://www.isca‐speech.org/archive_ open/avsp03/av03_025.html. Cooke, M., & Ellis, D. P. W. (2001). The auditory organization of speech and other sources in listeners and computational models. Speech Communication, 35, 141–177. Cusack, R., Carlyon, R. P., & Robertson, I. H. (2001). Auditory midline and spatial discrimination in patients with unilateral neglect. Cortex, 37, 706–709. Cusack, R., Deeks, J., Aikman, G., & Carlyon, R. P. (2004). Effects of location frequency region and time course of selective attention on auditory scene analysis. Journal of Experimental Psychology: Human Perception and Performance, 30, 643–656. Dannenbring, G. L., & Bregman, A. S. (1976). Stream segregation and the illusion of overlap. Journal of Experimental Psychology: Human Perception and Performance, 2, 544–555. Dannenbring, G. L., & Bregman, A. S. (1978). Streaming vs. fusion of sinusoidal components of complex tones. Perception & Psychophysics, 24, 369–376. Darwin, C. J. (2008). Listening to speech in the presence of other sounds. Philosophical Transactions B: Biological Sciences, 363, 1011–1021. Darwin, C. J., & Gardner, R. B. (1986). Mistuning a harmonic of a vowel: Grouping and phase effects on vowel quality. Journal of the Acoustical Society of America, 79, 838–844. Darwin, C. J., & Sutherland, N. S. (1984). Grouping frequency components of vowels: When is harmonic not a harmonic? Quarterly Journal of Experimental Psychology, 36A, 193–208.

Diehl, R. L., Molis, M. R., & Castleman, W. A. (2001). Adaptive design of sound systems. In E. Hume & K. Johnson (Eds), The role of speech perception in phonology (pp. 123–139). San Diego: Academic Press. Dorman, M. F., Cutting, J. E., & Raphael, L. J. (1975). Perception of temporal order in vowel sequences with and without formant transitions. Journal of Experimental Psychology: Human Perception and Performance, 104, 121–129. Eimas, P., Miller, J. (1992). Organization in the perception of speech by young infants. Psychological Science, 3, 340–345. Engineer, C. T., Perez, C. A., Chen, Y. H., et al. (2008). Cortical activity patterns predict speech discrimination ability. Nature Neuroscience, 11, 603–608. Fant, C. G. M. (1960). The acoustic theory of speech production. The Hague: Mouton. Fletcher, H. (1929). Speech and hearing. New York: Van Nostrand. Fodor, J. A. (1983). The modularity of mind. Cambridge, MA: MIT Press. Gaver, W. W. (1993). What in the world do we hear? An ecological approach to auditory event perception. Ecological Psychology, 5, 285–313. Goh, W. D., Pisoni, D. B., Kirk, K. I., & Remez, R. E. (2001). Audio‐visual perception of sinewave speech in an adult cochlear implant user: A case study. Ear and Hearing, 22, 412–419. Hochberg, J. (1974). Organization and the gestalt tradition. In E. C. Carterette and M. P. Friedman (Eds), Handbook of perception: Vol. 1. Historical and philosophical roots of perception(pp. 179–210). New York: Academic Press. Holt, L. L. (2005). Temporally nonadjacent nonlinguistic sounds affect speech categorization. Psychological Science, 16, 305–312. Holt, L. L., & Lotto, A. J. (2006). Cue weighting in auditory categorization: Implications for first and second language acquisition. Journal of the Acoustical Society of America, 119, 3059–3071.

Perceptual Organization of Speech 25 Houston, D. M., & Bergeson, T. R. (2014). Hearing versus listening: Attention to speech and its role in language acquisition in deaf infants with cochlear implants, Lingua, 139, 10–25. Howell, P., & Darwin, C. J. (1977). Some properties of auditory memory for rapid formant transitions. Memory & Cognition, 5, 700–708. Iverson, P. (1995). Auditory stream segregation by musical timbre: Effects of static and dynamic acoustic attributes. Journal of Experimental Psychology: Human Perception and Performance, 21, 751–763. Jones, M. R., & Boltz, M. (1989). Dynamic attending and responses to time. Psychological Review, 96, 459–491. Jusczyk, P. W. (1997). The discovery of spoken language. Cambridge, MA: MIT Press. Klatt, D. H. (1989). Review of selected models of speech perception. In W. Marslen‐Wilson (Ed.), Lexical representation and process (pp. 169–226). Cambridge, MA: MIT Press. Lackner, J. R., & Goldstein, L. M. (1974). Primary auditory stream segregation of repeated consonant–vowel sequences. Journal of the Acoustical Society of America, 56, 1651–1652. Liberman, A. M., & Cooper, F. S. (1972). In search of the acoustic cues. In A. Valdman (Ed.), Papers in linguistics and phonetics to the memory of Pierre Delattre (pp. 329–338). The Hague: Mouton. Liberman, A. M., Ingemann, F., Lisker, L., et al. (1959). Minimal rules for synthesizing speech. Journal of the Acoustical Society of America, 31, 1490–1499. Liberman, A. M., Isenberg, D., & Rakerd, B. (1981). Duplex perception of cues stop consonants: Evidence for a phonetic mode. Perception & Psychophysics, 30, 133–143. Liberman, A. M., & Mattingly, I. G. (1985). The motor theory of speech perception revised. Cognition, 21, 1–36. Licklider, J. C. R. (1946). Effects of amplitude distortion upon the

intelligibility of speech. Journal of the Acoustical Society of America, 18, 429–434. Liebenthal, E., Binder, J. R., Piorkowski, R. L., & Remez, R. E. (2003). Short‐term reorganization of auditory analysis induced by phonetic experience. Journal of Cognitive Neuroscience, 15, 549–558. Lindblom, B. (1996). Role of articulation in speech perception: Clues from production. Journal of the Acoustical Society of America, 99, 1683–1692. Lisker, L. (1978). Rapid vs. rabid: A catalog of acoustic features that may cue the distinction. Haskins Laboratories Status Report on Speech Perception, SR‐54, 127–132. Lotto, A. J., & Kluender, K. R. (1998). General contrast effects in speech perception: Effect of preceding liquid on stop consonant identification. Perception & Psychophysics, 60, 602–619. Lotto, A. J., Kluender, K. R., & Holt, L. L. (1997) Perceptual compensation for coarticulation by Japanese quail. Journal of the Acoustical Society of America, 102, 1135–1140. Magnotti, J. F., & Beauchamp, M. S. (2017). A causal inference model explains perception of the McGurk effect and other incongruent audiovisual speech. PLOS Computational Biology, 13, e1005229. Massaro, D. W. (1994). Psychological aspects of speech perception: Implications for research and theory. In M. A. Gernsbacher (Ed.), Handbook of psycholinguistics (pp. 219–263). San Diego: Academic Press. Mattingly, I. G., Liberman, A. M., Syrdal, A. K., & Halwes, T. G. (1971). Discrimination in speech and nonspeech modes. Cognitive Psychology, 2, 131–157. McDermott, J. H. (2009). The cocktail party problem. Current Biology, 19, R1024–1027. Miller, G. A. (1946). Intelligibility of speech: effects of distortion. In Transmission and reception of sounds under combat conditions (pp. 86–108). Washington, DC: National Defense Research Committee. Miller, G. A., & Licklider, J. C. R. (1950). The intelligibility of interrupted speech.

26 Sensing Speech Journal of the Acoustical Society of America, 22, 167–173. Mountcastle, V. B. (1998). Perceptual neuroscience. Cambridge, MA: Harvard University Press. Munhall, K. G., Gribble, P., Sacco, L., & Ward, M. (1996). Temporal constraints on the McGurk effect. Perception & Psychophysics, 58, 351–362. Neff, D. L., Jesteadt, W., & Brown, E. L. (1982). The relation between gap discrimination and auditory stream segregation. Perception & Psychophysics, 31, 493–501. Nygaard, L. C. (1993). Phonetic coherence in duplex perception: Effects of acoustic differences and lexical status. Journal of Experimental Psychology, 19, 268–286. Parsons, T. W. (1976). Separation of speech from interfering speech by means of harmonic selection. Journal of the Acoustical Society of America, 60, 911–918. Peña, M., Bonatti, L. L., Nespor, M., & Mehler, J. (2002). Signal‐driven computations in speech processing. Science, 298, 604–607. Pisoni, D. B., Tash, J. (1974). Reaction times to comparisons within and across phonetic categories. Perception & Psychophysics, 15, 285–290. Rand, T. C. (1974). Dichotic release from masking for speech. Journal of the Acoustical Society of America, 55, 678–680. Remez, R. E. (2001). The interplay of phonology and perception considered from the perspective of perceptual organization. In E. Hume & K. Johnson (Eds), The role of speech perception in phonology (pp. 27–52). San Diego: Academic Press. Remez, R. E. (2008). Sine‐wave speech. In E. M. Izhikovitch (Ed.), Encyclopedia of computational neuroscience. Scholarpedia, 3, 2394. Remez, R. E., Dubowski, K. R., Davids, M. L., et al. (2011). Estimating speech spectra by algorithm and by hand for synthesis from natural models. Journal of the Acoustical Society of America, 130, 2173–2178.

Remez, R. E., Dubowski, K. R., Ferro, D. F., & Thomas, E. F. (forthcoming) Primitive audiovisual integration in the perception of speech. Remez, R. E., Ferro, D. F., Wissig, S. C., & Landau, C. A. (2008). Asynchrony tolerance in the perceptual organization of speech. Psychonomic Bulletin & Review, 15, 861–865. Remez, R. E., Pardo, J. S., Piorkowski, R. L., & Rubin, P. E. (2001). On the bistability of sine wave analogues of speech. Psychological Science, 12, 24–29. Remez, R. E., & Rubin, P. E. (1984). On the perception of intonation from sinusoidal sentences. Perception & Psychophysics, 35, 429–440. Remez, R. E., & Rubin, P. E. (1993). On the intonation of sinusoidal sentences: Contour and pitch height. Journal of the Acoustical Society of America, 94, 1983–1988. Remez, R. E., Rubin, P. E., Berns, S. M., et al. (1994). On the perceptual organization of speech. Psychological Review, 101, 129–156. Remez, R. E., Rubin, P. E., Nygaard, L. C., & Howell, W. A. (1987). Perceptual normalization of vowels produced by sinu soidal voices. Journal of Experimental Psychology: Human Perception and Performance, 13, 41–60. Remez, R. E., Rubin, P. E., Pisoni, D. B., & Carrell, T. D. (1981). Speech perception without traditional speech cues. Science, 212, 947–950. Remez, R. E., & Thomas, E. F. (2013). Early recognition of speech. Wiley Interdisciplinary Reviews: Cognitive Science, 4, 213–223. Roberts, B., Summers, R. J., & Bailey, P. J. (2010). The perceptual organization of sine‐wave speech under competitive conditions. Journal of the Acoustical Society of America, 128, 804–817. Roberts, B., Summers, R. J., & Bailey, P. J. (2015). Acoustic source characteristics, across‐formant integration, and speech intelligibility under competitive conditions. Journal of Experimental

Perceptual Organization of Speech 27 Psychology: Human Perception and Performance Psychology, 41, 680–691. Rosen, S. M., Fourcin, A. J., & Moore, B. C. J. (1981). Voice pitch as an aid to lipreading. Nature, 291, 150–152. Rosen, S. M., & Iverson, P. (2007). Constructing adequate non‐speech analogues: What is special about speech anyway? Developmental Science, 10, 169–171. Rossing, T. D. (1990). The science of sound. Reading, MA: Addison‐Wesley. Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8‐ month‐old infants. Science, 274, 1926–1928. Seidenberg, M. S., MacDonald, M. C., & Saffran, J. R. (2002). Does grammar start where statistics stop? Science, 298, 553–554. Shannon, R. V., Zeng, F., Kamath, V., et al. (1995). Speech recognition with primarily temporal cues. Science, 270, 303–304. Smith, Z. M., Delgutte, B., & Oxenham, A. J. (2002). Chimaeric sounds reveal dichotomies in auditory perception. Nature, 416, 87–90. Steiger, H., & Bregman, A. S. (1982). Competition among auditory streaming, dichotic fusion, and diotic fusion. Perception & Psychophysics, 32, 153–162. Stevens, K. N. (1998). Acoustic phonetics. Cambridge, MA: MIT Press. Stevens, K. N., & Blumstein, S. E. (1981). The search for invariant acoustic correlates of phonetic features. In P. D. Eimas & J. L. Miller (Eds), Perspectives on the study of speech (pp. 1–38). Hillsdale, NJ: Lawrence Erlbaum. Stevens, K. N., & House, A. S. (1961). An acoustical theory of vowel production and some of its implications. Journal of Speech and Hearing Research, 4, 303–320. Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech intelligibility in noise. Journal of the Acoustical Society of America, 26, 212–215.

Summerfield, Q. (1992). Roles of harmonicity and coherent frequency modulation in auditory grouping. In M. E. H. Schouten (Ed.), The auditory processing of speech: From sounds to words (pp. 157–166). Berlin: Mouton de Gruyter. Svirsky, M. A., Robbins, A. M., Kirk, K. I., et al. (2000). Language development in profoundly deaf children with cochlear implants. Psychological Science, 11, 153–158. Toscano, J. C., & McMurray, B. (2010). Cue integration with categories: Weighting acoustic cues in speech using unsupervised learning and distributional statistics. Cognitive Science, 34, 434–464. Vouloumanos, A., & Werker, J. F. (2007). Listening to language at birth: Evidence for a bias for speech in neonates. Developmental Science, 10, 159–171. Warren, R. M., Obusek, C. J., Farmer, R. M., & Warren, R. P. (1969). Auditory sequence: confusion of patterns other than speech or music. Science, 164, 586–587. Wertheimer, M. (1923/1938). “Laws of organization in perceptual forms” (trans. of “Unsuchungen zur Lehre von der Gestalt”). In W. D. Ellis (Ed.), A sourcebook of gestalt psychology (pp. 71–88). London: Routledge & Kegan Paul. Whalen, D. H., & Liberman, A. M. (1987). Speech perception takes precedence over nonspeech perception. Science, 237, 169–171. Whalen, D. H., & Liberman, A. M. (1996). Limits on phonetic integration in duplex perception. Perception & Psychophysics, 58, 857–870. Zevin, J. D., Yang, J., Skipper, J. I., & McCandliss, B. D. (2010). Domain general change detection accounts for “dishabituation” effects in temporal‐ parietal regions in functional magnetic resonance imaging studies of speech perception. Journal of Neuroscience, 30, 1110–1117.

2 Primacy of Multimodal Speech Perception for the Brain and Science LAWRENCE D. ROSENBLUM AND JOSH DORSI University of California, Riverside, United States

It may be argued that multimodal speech perception has become one of the most studied topics in all of cognitive psychology. A keyword search for “multimodal speech” in Google Scholar shows that, since early 2005, over 192,000 papers citing the topic have been published. Since that time, the seminal published study on audiovisual speech: McGurk & MacDonald (1976) has been cited in publications over 4,700 times (Google Scholar search). There are likely many reasons for this explosion in multisensory speech research. Perhaps most importantly, this research has helped usher in a new understanding of the perceptual brain. In what has been termed the “multisensory revolution” (e.g. Rosenblum, 2013), research is now showing that brain areas and perceptual behaviors, long thought to be related to a single sense, are now known to be modulated by multiple senses (e.g. Pascual‐Leone & Hamilton, 2001; Reich, Maidenbaum, & Amedi, 2012; Ricciardi et al., 2014; Rosenblum, Dias, & Dorsi, 2016; Striem‐Amit et al., 2011). This research suggests a degree of neurophysiological and behavioral flexibility with perceptual modality not previously known. The research has been extensively reviewed elsewhere and will not be rehashed here (e.g. Pascual‐Leone & Hamilton, 2001; Reich, Maidenbaum, & Amedi, 2012; Ricciardi et al., 2014; Rosenblum, Dias, & Dorsi, 2016; Striem‐Amit et al., 2011). It is relevant, however, that research on audiovisual speech perception has spearheaded this revolution. Certainly, the phenomenological power of the McGurk effect has motivated research into the apparent automaticity with which the senses integrate/merge. Speech also provided the first example of a stimulus that could modulate an area in

The Handbook of Speech Perception, Second Edition. Edited by Jennifer S. Pardo, Lynne C. Nygaard, Robert E. Remez, and David B. Pisoni. © 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.

Primacy of Multimodal Speech Perception 29 the human brain that was thought to be solely responsible for another sense. In that original report, Calvert and her colleagues (1997) showed that lip‐reading of a silent face could induce activity in the auditory cortex. Since the publication of that seminal study, hundreds of other studies have shown that visible speech can induce cross‐sensory modulation of the human auditory cortex. More generally, thousands of studies have now demonstrated crossmodal modulation of primary and secondary sensory cortexes in humans (for a review, see Rosenblum, Dias, & Dorsi, 2016). These studies have led to a new conception of the brain as a multisensory processing organ, rather than as a collection of separate sensory processing units. This chapter will readdress important issues in multisensory speech perception in light of the enormous amount of relevant research conducted since publication of the first version of this chapter (Rosenblum, 2005). Many of the same topics addressed in that chapter will be addressed here including: (1) the ubiquity and automaticity of multisensory speech in human behavior; (2) the stage at which the speech streams integrate; and (3) the possibility that perception involves detection of a modality‐neutral – or supramodal – form of information that is available in multiple streams.

Ubiquity and automaticity of multisensory speech Since 2005, evidence has continued to grow that supports speech as an inherently multisensory function. It has long been known that visual speech is used to enhance challenging auditory speech, whether that speech is degraded by noise or accent, or simply contains complicated material (e.g. Arnold & Hill, 2001; Bernstein, Auer, & Takayanagi, 2004; Reisberg, McLean, & Goldfield, 1987; Sumby & Pollack, 1954; Zheng & Samuel, 2019). Visual speech information helps us acquire our first language (e.g. Teinonen et al., 2008; for a review, see Danielson et al., 2017) and our second languages (Hardison, 2005; Hazan et al., 2005; Navarra & Soto‐Faraco, 2007). The importance of visual speech in language acquisition is also evidenced in research on congenitally blind individuals. Blind children show small delays in learning to perceive and produce segments that are acoustically more ambiguous, but visually distinct (e.g. the /m/–/n/ distinction). Recent research shows that these idiosyncratic differences carry through to congenitally blind adults who show subtle distinctions in speech perception and production (e.g. Delvaux et al., 2018; Ménard, Leclerc, & Tiede, 2014; Ménard et al., 2009, 2013, 2015). The inherently multimodal nature of speech is also demonstrated by perceivers using and integrating information from a modality that they rarely, if ever, use for speech: touch. It has long been known that deaf‐blind individuals can learn to touch the lips, jaw, and neck of a speaker to perceive speech (the Tadoma technique). However, recent research shows just how automatic this process can be for even novice users (e.g. Treille et al., 2014). Novice perceivers (with normal sight and hearing) can readily use felt speech to (1) enhance comprehension of a noisy auditory speech (Gick et al., 2008; Sato, Cavé, et al., 2010); (2) enhance lip‐reading (Gick et al., 2008); and (3) influence perception of discrepant auditory speech

30 Sensing Speech (Fowler & Dekle, 1991, in a McGurk effect). Consistent with these findings, neurophysiological research shows that touching an articulating face can speed auditory cortex reactions to congruent auditory speech in the same way as is known to occur with visual speech (Treille et al., 2014; Treille, Vilain, & Sato, 2014; and see Auer et al., 2007). Other research shows that the speech function can effectively work with very sparse haptic information. Receiving light puffs of air on the skin in synchrony with hearing voiced consonants (e.g. b) can make those consonants sound voiceless (p; Derrick & Gick, 2013; Gick & Derrick, 2009). In a related example, if a listener’s cheeks are gently pulled down in synchrony with hearing a word that they had previously identified as “head,” they will be more likely to now hear that word as “had” (Ito, Tiede, & Ostry, 2009). The opposite effect occurs if a listener’s cheeks are instead pulled to the side. These haptic speech demonstrations are important for multiple reasons. First, they demonstrate how readily the speech system can make use of – and integrate – even the most novel type of articulatory information. Very few normally sighted and hearing individuals have intentionally used touch information for purposes of speech perception. Despite the odd and often limited nature of haptic speech information, it is readily usable, showing that the speech brain is sensitive to articulation, regardless through which modality it is conveyed. Second, the fact that this information can be used spontaneously despite its novelty may be problematic for integration accounts based on associative learning between the modalities. Both classic auditory accounts of speech perception (Diehl & Kluender, 1989; Hickok, 2009; Magnotti & Beauchamp, 2017) and Bayesian accounts of multisensory integration (Altieri, Pisoni, & Townsend, 2011; Ma et al., 2009; Shams et al., 2011; van Wassenhove, 2013) assume that the senses are effectively bound and integrated on the basis of the associations gained through a lifetime of experience simultaneously seeing and hearing speech utterances. However, if multisensory speech perception were based only on associative experience, it is unclear how haptic speech would be so readily used and integrated by the speech function. In this sense, the haptic speech findings pose an important challenge to associative accounts (see also Rosenblum, Dorsi, & Dias, 2016). Certainly, the most well‐known and studied demonstration of multisensory speech is the McGurk effect (McGurk & MacDonald, 1976; for recent reviews, see Alsius, Paré, & Munhall, 2017; Rosenblum, 2019; Tiippana, 2014). The effect typically involves a video of one type of syllable (e.g. ga) being synchronously dubbed onto an audio recording of a different syllable (ba) to induce a “heard” percept (da) that is strongly influenced by the visual component. The McGurk effect is considered to occur whenever the heard percept is different from that of the auditory component, whether a subject hears a compromise between the audio and visual components (auditory ba + visual ga = heard da) or hears a syllable dominated by the visual component (auditory ba + visual va = heard va). The effect has been demonstrated in multiple contexts, including with segments and speakers of different languages (e.g. Fuster‐Duran, 1996; Massaro et al., 1993; Sams et al., 1998; Sekiyama & Tohkura, 1991, 1993); across development (e.g. Burnham & Dodd, 2004; Desjardins & Werker, 2004; Jerger et al., 2014; Rosenblum, Schmuckler, & Johnson,

Primacy of Multimodal Speech Perception 31 1997); with degraded audio and visual signals (Andersen et al., 2009; Rosenblum & Saldaña, 1996; Thomas & Jordan, 2002); and regardless of awareness of the audiovisual discrepancy (Bertelson & De Gelder, 2004; Bertelson et al., 1994; Colin et al., 2002; Green et al. 1991; Massaro, 1987; Soto‐Faraco & Alsius, 2007, 2009; Summerfield & McGrath, 1984). These characteristics have been interpreted as evidence that multisensory speech integration is automatic, and impenetrable to outside influences (Rosenblum, 2005). However, some recent research has challenged this interpretation of integration (for a review, see Rosenblum, 2019). For example, a number of studies have been construed as showing that attention can influence whether integration occurs in the McGurk effect (for reviews, see Mitterer & Reinisch, 2017; Rosenblum, 2019). Adding a distractor to the visual, auditory, or even tactile channels seems to significantly reduce the strength of the effect (e.g. Alsius et al., 2005; Alsius, Navarra, & Soto‐Faraco, 2007; Mitterer & Reinisch, 2017; Tiippana, Andersen, & Sams, 2004; see also Munhall et al., 2009). Unfortunately, relatively few of these studies have also tested unimodal conditions to determine whether these distractors might simply reduce detection of the requisite unimodal information. If, for example, less visual information can be extracted during distraction (of any type), then a reduced McGurk effect would likely be observed. In the few studies that have examined distraction of visual conditions, it seems unlikely that these tests are sufficiently sensitive (given the especially low baseline performance of straight lipreading; Alsius et al., 2005; Alsius, Navarra, & Soto‐Faraco, 2007; and for a review of this argument, see Rosenblum, 2019). Thus, to date, it is unclear whether outside attention can truly penetrate the speech integration function or instead simply distracts from the extraction of the visual information for a McGurk effect. Moreover, it could very well be that the McGurk effect itself may not constitute a thorough test of speech integration.

The double‐edged sword of the McGurk effect As stated, the McGurk effect is considered the quintessential demonstration of multisensory integration and has been a key factor motivating our new understanding of the multisensory brain. At the same time, it is often considered a quintessential instance of multisensory integration and has become a litmus test for investigating how integration is affected by myriad factors. Methodologically, the McGurk effect may provide some advantages over other tests of multisensory speech such as visual enhancement of speech in noise. By asking subjects to report on what they are hearing, the effect uses a more implicit measure of visual influence. This fact increases the likelihood that the method is measuring true perceptual rather than post‐perceptual/decision‐making processes (e.g. Rosenblum, Yakel, & Green, 2000). Additionally, the McGurk method provides advantages in being composed of very short and simple syllable stimuli. Such stimuli allow the effect to be tested in time‐constrained imaging contexts, as well as in linguistic contexts for which it is important to limit lexical and semantic influences. Finally, while the

32 Sensing Speech effect has been shown to occur in myriad conditions, its strength and frequency can be variable, lending itself to a useful dependent measure. Consequently, the effect has become a method for establishing under which conditions integration occurs. Measurements of the effect’s strength have been used to determine how multisensory speech perception is affected by: individual differences (see Strand et al., 2014, for a review); attention; and generalized face processing (e.g. Eskelund, MacDonald, & Andersen, 2015; Rosenblum, Yakel, & Green, 2000). The effect has also been used to determine where in the perceptual and neurophysiological process integration occurs and whether integration is complete (for discussions of these topics, see Brancazio & Miller, 2005). However, a number of researchers have recently questioned whether the McGurk effect should be used as a primary test of multisensory integration (Alsius, Paré, & Munhall, 2017; Remez, Beltrone, & Willimetz, 2017; Rosenblum, 2019; Irwin & DiBlasi, 2017; Brown et al. 2018). There are multiple reasons for these concerns. First, there is wide variability in most aspects of McGurk methodology (for a review, see Alsius, Paré, & Munhall, 2017). Most obviously, the specific talkers used to create the stimuli usually vary from project to project. The dubbing procedure – specifically, how the audio and visual components are aligned – also vary across laboratories. Studies will also vary as to which syllables are used, as well as the type of McGurk effect tested (fusion; visual dominance). Procedurally, the tasks (e.g. open response vs. forced choice), stimulus ordering (fully randomized vs. blocked by modality), and the control condition chosen (e.g. audio‐alone vs. audiovisually congruent syllables) vary across studies (Alsius, Paré, & Munhall, 2017). This extreme methodological variability may account for the wide range of McGurk effect strengths reported across the literature. Finding evidence of the effect under such different conditions does speak to its durability. However, the methodological variability makes it difficult to know whether influences on the effect’s strength are attributable to the variable in question (e.g. facial inversion), or to some superfluous characteristic of idiosyncratic stimuli and/or tasks. Another concern about the McGurk effect is whether it is truly representative of natural (nonillusory) multisensory perception (Alsius, Paré, & Munhall, 2017; Remez Beltrone, & Willimetz, 2017). It could very well be that different perceptual and neurophysiological resources are recruited when integrating discrepant rather than congruent audiovisual components. In fact, it has long been known that McGurk‐effect syllables (e.g. audio ba + visual a = va) are less compelling and take longer to identify (Brancazio, 2004; Brancazio, Best, & Fowler, 2006; Green & Kuhl, 1991; Jerger et al., 2017; Massaro & Ferguson, 1993; Rosenblum & Saldaña, 1992) than analogous audiovisual congruent syllables (audio va + visual va = va). This is true even when McGurk syllables are identified with comparable frequency (98 percent va; Rosenblum & Saldaña, 1992) to the congruent syllables. Relatedly, there is evidence that, when spatial and temporal offsets are applied to the audio and visual components, McGurk stimuli are more readily perceived as separate components than as audiovisually congruent syllables (e.g. Bishop & Miller, 2011; van Wassenhove, Grant, & Poeppel, 2007).

Primacy of Multimodal Speech Perception 33 There are also differences in neurophysiological responses to McGurk compared to congruent syllables (for a review, see Alsius, Paré, & Munhall, 2017), even when these are identified as the same segment, and with the same frequency. For example, there is more involvement of the superior temporal sulcus (STS) when perceiving McGurk compared to audiovisually congruent stimuli (e.g. Beauchamp, Nath, & Pasalar, 2010; Nath & Beauchamp, 2012; Münte et al., 2012; but see Baum et al., 2012; Baum & Beauchamp, 2014). Relative to congruent stimuli, McGurk stimuli also induce different cortical temporal reactions and neural synchrony patterns relative to analogous audiovisually congruent syllables (Fingelkurts et al., 2003; Hessler et al., 2013). Additional evidence that the McGurk effect may not be representative of normal integration comes from intersubject differences. It turns out that there is little evidence for a correlation between a subject’s likelihood to display a McGurk effect and their benefit in using visual speech to enhance noisy auditory speech (at least in normal hearing subjects; e.g. Van Engen, Xie, & Chandrasekaran, 2016; but see Grant & Seitz, 1998). Relatedly, the relationship between straight lip‐reading skill and susceptibility to the McGurk effect is weak at best (Cienkowski & Carney, 2002; Strand et al., 2014; Wilson et al., 2016; Massaro et al., 1986). A particularly troubling concern regarding the McGurk effect is evidence that its failure does not mean integration has not occurred (Alsius, Paré, & Munhall, 2017; Rosenblum, 2019). Multiple studies have shown that when the McGurk effect seems to fail and a subject reports hearing just the auditory segment (e.g. auditory /b/ + visual /g/ = perceived /b/), the influences of the visual, and perhaps integrated, segment are present in the gestural nuances of the subject’s spoken response (Gentilucci & Cattaneo, 2005; Sato et al., 2010; see Rosenblum, 2019 for further discussion). In another example, Brancazio and Miller (2005) showed that in instances when a visual /ti/ failed to change identification of an audible /pi/, a simultaneous manipulation of spoken rate of the visible /ti/ did influence the voice‐onset time perceived in the /pi/ (see also Green & Miller, 1985). Thus, information for voice‐onset time was integrated across the visual and audible syllables even when the McGurk effect failed to change the identification of the /pi/. It is unclear why featural integration can still occur in the face of a failed McGurk effect (Rosenblum, 2019; Alsius, Paré, & Munhall, 2017). It could be that standard audiovisual segment integration does occur in these instances, but the resultant segment does not change enough to be categorized differently. As stated, perceived segments based on McGurk stimuli are less robust than audiovisually congruent (or audio‐alone) perceived segments. It could be that some integration almost always occurs for McGurk segments, but the less canonical integrated segment sometimes induces a phonetic categorization that is the same as the auditory‐alone segment. Regardless, the fact that audiovisual integration of some type can occur when the McGurk effect appears to fail forces a reconsideration of the effect as a primary test of integration. For all of these reasons, a number of authors, including ourselves, have suggested that less weight be placed on the McGurk effect in evaluating multisensory integration. Evaluation of integration may be better served with measures of the

34 Sensing Speech perceptual super‐additivity of visual and audio (e.g. in noise) streams (e.g. Alsius, Paré, & Munhall, 2017; Irwin & DiBlasi, 2017; Remez, Beltrone, & Willimetz, 2017); influences on speech‐production responses (Gentilucci & Cattaneo, 2005; and see Sato et al., 2010); and neurophysiological responses (e.g, Skipper et al., 2007). Such methods may very well be more stable, valid, and representative indexes of integration than the McGurk effect.

Multimodal speech is integrated at the earliest observable stage The question of where in the speech function the modal streams integrate (merge) continues to be one of the most studied in the multisensory literature. Since 2005, much of this research has used neurophysiological methods. After the aforementioned fMRI report by Calvert and her colleagues (1997; see also Pekkola et al., 2005), numerous studies have also shown visual speech activation of the auditory cortex, often using other technologies, for example, functional near‐ infrared spectroscopy (fNIR) (van de Rijt et al., 2016); electroencephalography (EEG; Callan et al., 2001; Besle et al., 2004); intercranial EEG (ECoG; e.g. Besle et al., 2008); magneto‐encephalography (MEG; Arnal et al., 2009; for a review, see Rosenblum, Dorsi, & Dias, 2016). More recent evidence shows that visual speech can modulate neurophysiological areas considered to be further upstream including the auditory brainstem (Musacchia et al., 2006), which is one of the earliest locations at which direct visual modulation could occur. There is even evidence of visual speech modulation of cochlear functioning (otoacoustic emissions; Namasivayam et al., 2015). While it is likely that visual influences on such peripheral auditory mechanisms are based on feedback from downstream areas, that it can occur indicates the importance of visual input to the speech function. Other neurophysiological findings suggest that the integration of the streams also happens early. A very recent EEG study revealed that N1 auditory‐evoked potentials (known to reflect primary auditory cortex activity) for visually induced (McGurk) fa and ba syllables (auditory ba + visual fa; auditory fa + visual ba, respectively) resemble the N1 responses for the corresponding auditory‐alone syllables (Shahin et al. 2018; and see van Wassenhove, Grant, & Poeppel, 2005). The degree of resemblance was larger for individuals whose identification responses showed greater visual influences, suggesting that this modulated auditory cortex activity (reflected in N1) corresponds to an integrated perceived segment. This finding is less consistent with the alternative model that separate unimodal analyses are first conducted at primary cortexes, with their outcomes then combined at a multisensory integrator, such as the posterior STS (pSTS; e.g. Beauchamp et al., 2004). Other findings suggest that visual modulation of the auditory cortex (as it responds to sound) happens too quickly for an additional integrative step to be part of the process (for a review, see Besle et al., 2004). In fact, there is evidence that adding congruent visual speech to auditory speech input speeds up ERP and MEG reactions in the auditory cortex (van Wassenhove, Grant, & Poeppel, 2005; Hertrich

Primacy of Multimodal Speech Perception 35 et al., 2009). This facilitation could be a result of visible articulatory information for a segment often being available before the auditory information (see Venezia, Thurman, et al., 2016 for a review). This could allow visual speech to partially serve a sort of priming function – or a cortical preparedness – to speed the auditory function for speech (e.g. Campbell, 2011; Hertrich et al., 2009). Regardless, it is clear that, as the neuroscientific technology improves, it continues to show crossmodal influences as early as can be observed. This pattern of results is analogous to recent nonspeech findings which similarly demonstrate early audiovisual integration (e.g. Shams et al., 2005; Watkins et al., 2006; for a review, see Rosenblum et al., 2016). The behavioral research also continues to show evidence of early crossmodal influences (for a review, see Rosenblum, Dorsi, & Dias, 2016). Evidence suggests that visual influences likely occur before auditory feature extraction (e.g. Brancazio, Miller, & Paré, 2003; Fowler, Brown, & Mann, 2000; Green & Gerdeman, 1995; Green & Kuhl, 1989; Green & Miller, 1985; Green & Norrix, 2001; Schwartz, Berthommier, & Savariaux, 2004). Other research shows that information in one modality is able to facilitate perception in the other even before the information is usable – and sometimes even detectable – on its own (e.g. Plass et al., 2014). For example, Plass and his colleagues (2014) used flash suppression to render visually presented articulating faces (consciously) undetectable. Still, if these undetected faces were presented with auditory speech that was consistent and synchronized with the visible articulation, then subjects were faster at recognizing that auditory speech. This suggests that useful crossmodal influences can occur even without awareness of information in one of the modalities. Other examples of the extreme super‐additive nature of speech integration have been shown in the context of auditory speech detection (Grant & Seitz, 2000; Grant, 2001; Kim & Davis, 2004; Palmer & Ramsey, 2012) and identification (Schwartz, Berthommier, & Savariaux, 2004), as well audiovisual speech identification (Eskelund, Tuomainen, & Andersen, 2011; Rosen, Fourcin, & Moore, 1981). Much of this research has been interpreted to suggest that, even without its own (conscious) clear phonetic determination, each modality can help the perceiver attend to critical information in the other modality through analogous patterns of temporal change in the two signals. These crossmodal correspondences are thought to be influential at an especially early stage (before feature extraction) to serve as a “bimodal coherence‐masking protection” against everyday signal degradation (e.g. Grant & Seitz, 2000; Kim & Davis, 2004; Schwartz, Berthommier, & Savariaux, 2004; see also Gordon, 1997). The impressive utility of these crossmodal correspondences will also help motivate the theoretical position proposed later in this chapter. However, other recent results have been interpreted as suggesting that additional linguistic analyses are conducted on the individual streams before, or concurrent with, integration. For example, a literature has emerged showing that the McGurk effect can be influenced by lexicality and semantic (sentence) context (e.g. Brancazio, 2004; Barutchu et al., 2008; but see Sams et al., 1998; Windmann, 2004, 2007). In one example, audio /ba/ paired with visual /va/, is perceived

36 Sensing Speech more often as va when presented in the context of the word valve than in the nonword vatch (Brancazio, 2004). This could mean that the analysis of each individual stream proceeds for some time before influencing the likelihood of audiovisual integration. However, other interpretations of these results have been offered which are consistent with early integration (Brancazio, 2004; Rosenblum, 2008). It may be that lexicality and sentence context does not bear on the likelihood of integration, but instead on how the post‐integrated segment is categorized. As stated, it is likely that syllables perceived from conflicting audiovisual information are less canonical than those based on congruent (or audio‐alone) information. This fact likely makes those syllables less robust, even when they are being identified as visually influenced segments. This could mean that, despite incongruent segments being fully integrated, the resultant perceived segment is more susceptible to contextual (e.g. lexical) influences than audiovisually congruent (and auditory‐alone) segments. This is certainly known to be the case for less canonical, more ambiguous audio‐alone segments as demonstrated in the Ganong effect, that is, when an ambiguous segment equally heard as k or g in isolation will be heard as the former when placed in front of the syllable iss, but as the latter if heard in front of ift (Connine & Clifton, 1987; Ganong, 1980). If the same is true of incongruent audiovisual segments, then lexical context may not bear on audiovisual integration as such, but on the categorization of the post‐integrated (and less canonical) segment (e.g. Brancazio, 2004). Still, other recent evidence has been interpreted as showing that a semantic analysis is conducted on the individual streams before integration is fully complete (see also Bernstein, Auer, & Moore, 2004). Ostrand and her colleagues (2016) present data showing that, despite a McGurk word being perceived as visually influenced (e.g. audio bait + visual date = heard date), the auditory component of the stimulus provides stronger priming of semantically related auditory words (audio bait + visual date primes worm more strongly than it primes calendar). This finding could suggest that the auditory component goes through a semantic analysis before it is merged with the visual component and provides stronger priming than the visible word. If this contention were true, then it would mean that the channels are not fully integrated until at least a good amount of processing has occurred on the individual channels. A more recent test of this question has provided very different results, however (Dorsi, Rosenblum, & Ostrand, 2017). For this purpose, our laboratory used word combinations that begin with consonants known to produce a very strong McGurk effect (e.g. audio boat + visual vote = heard vote). Using these stimuli, we found that it was the visually influenced word that provided stronger priming than the word comprising the auditory component (audio boat + visual vote primes election more strongly than it primes dock). Follow‐up analyses on both our own and Ostrand et al.’s (2016) original stimuli suggest that the degree to which an audiovisual word is actually identified as visually influenced (showing the McGurk effect), the more likely it is to show greater priming from the visually influenced word. This could mean that Ostrand et al.’s original findings might have been

Primacy of Multimodal Speech Perception 37 based on a stimulus set that did not induce many McGurk effects. The findings also suggest that the streams are functionally integrated by the time semantic analysis occurs. In sum, much of the new results from the behavioral, and especially neurophysiological, research suggest that the audio and visual streams are merged as early as can be currently observed (but see Bernstein, Auer, & Moore, 2004). In the previous version of this chapter we argued that this fact, along with the ubiquity and automaticity of multisensory speech, suggests that the speech function is designed around multisensory input (Rosenblum, 2005). We further argued that the function may make use of the fact that there is a common informational form across the modalities. This contention will be addressed in the final section of this chapter.

Supramodal speech information The notion that the speech mechanism may be sensitive to a form of information that is not tied to a specific sensory modality has been discussed for over three decades (e.g. Summerfield, 1987). This construal of multisensory speech information has been alternatively referred to as amodal, modality‐neutral (e.g. Rosenblum, 2005), and supramodal (Fowler, 2004; Rosenblum et al., 2016, 2017). The theory suggests a speech mechanism that is sensitive to a form of information that can be instantiated in multiple modalities. Such a mechanism would not need to contend with translating information across modality‐specific codes, or to involve a formal process of sensory integration (or merging), as such. From this perspective, the integration is a characteristic of the relevant information itself. Of course, the energetic details of the (light, sound, tactile‐mechanical) input and their superficial receptor reactions are necessarily distinct. But the deeper speech function may act to register the phonetically relevant higher‐order patterns of energy that can be functionally the same across modalities. The supramodal theory has been motivated by the characteristics of multisensory speech discussed earlier, including: (1) neurophysiological and behavioral evidence for the automaticity and ubiquity of multisensory speech; (2) neurophysiological evidence for a speech mechanism sensitive to multiple sensory forms; and (3) neurophysiological and behavioral evidence for integration occurring at the earliest observable stage; and (4) informational analyses showing a surprising close correlation between optic and acoustic informational variables for a given articulatory event. The theory is consistent with Carol Fowler’s direct approach to speech perception (e.g. Fowler, 1986, 2010), and James Gibson’s theory of multisensory perception (Gibson, 1966, 1979; and see Stoffregen & Bardy, 2001). The theory is also consistent with the task‐machine and metamodal theories of general multisensory perception which argue that function and task, rather than sensory system, is the guiding principle of the perceptual brain (e.g. Pascual‐Leone & Hamilton, 2001; Reich, Maidenbaum, & Amedi, 2012; Ricciardi et al., 2014; Striem‐ Amit et al., 2011; see also Fowler, 2004; Rosenblum, 2013; Rosenblum, Dias, & Dorsi, 2017).

38 Sensing Speech It will be argued that this thesis, which was presented in the first version of this chapter (Rosenblum, 2005), continues to gain supportive evidence (Rosenblum, Dorsi, & Dias, 2016; Rosenblum, Dias, & Dorsi, 2017). Throughout, the term supramodal information will be used instead of modality‐neutral information, which was used in the previous version of the chapter. This change has been made largely to be consistent with other theories outside of adult speech perception (neurophysiological; behavioral), which now make very similar claims (e.g. Papale et al. 2016; Ricciardi et al., 2014; Zilber et al. 2014).

Specific examples of supramodal information Summerfield (1987) was the first to suggest that the informational form for certain articulatory actions can be construed as the same across vision and audition. As an intuitive example, he suggested that the higher‐order information for a repetitive syllable would be the same in sound and light. Consider a speaker repetitively articulating the syllable /ma/. For hearing, a repetitive oscillation of the amplitude and spectral structure of the acoustic signal would be lawfully linked to the repetitive movements of the lips, jaw, and tongue. For sight, a repetitive restructuring of the light reflecting from the face would also be lawfully linked to the same movements. While the energetic details of the information differ across modalities, the more abstract repetitive informational restructuring occurs in both modalities in the same oscillatory manner, with the same time course, so as to be specific to the articulatory movements. Thus, repetitive informational restructuring could be considered supramodal information – available in both the light and the sound – that acts to specify a speech event of repetitive articulation. A speech mechanism sensitive to this form of supramodal information would function without regard to the sensory details specific to each modality: the relevant form of information exists in the same way (abstractly defined) in both modalities. In this sense, a speech function that could pick up on this abstract form of information in multiple modalities would not require integration or translation of the information across modalities. Summerfield (1987) offered other examples of supramodal information such as how quantal changes in articulation (e.g. bilabial contact to no contact), and reversals in articulation (e.g. during articulation of a consonant–vowel–consonant such as /wew/) would be accompanied by corresponding quantal and reversal changes in the acoustic and optic structure. More formal examinations of supramodal information have been provided by Vatikiotis‐Bateson and his colleagues (Munhall & Vatikiotis‐Bateson, 2004; Yehia, Kuratate, & Vatikiotis‐Bateson, 2002; Yehia, Rubin, & Vatikiotis‐Bateson, 1998). These researchers have shown high correlations between amplitude/spectral changes in the acoustical signal, kinematic changes in optical structure (measured mouth movements extracted from video), and changing vocal tract configurations (measured with a magnetometer). The researchers report that information visible on the face captures between 70 and 85 percent of the variance contained in the

Primacy of Multimodal Speech Perception 39 acoustic signal. Vatikiotis‐Bateson and his colleagues also found a close relationship between subtle nodding motions of the head and fundamental frequency (F0), which is potentially informative about prosodic dimensions (Yehia, Kuratate, & Vatikiotis‐Bateson, 2002). Other researchers have shown similar close relationships between articulatory motions, spectral changes, and visible movements using a wide variety of talkers and speech material (e.g. Barker & Berthommier, 1999; Jiang et al., 2002). These strikingly strong moment‐to‐moment correspondences between the acoustic and visual signals are suggestive that the streams can take a common form. Other recent research has determined that some of the strongest correlations across audible and visible signals lie in the acoustic range of 2–3 kHz (Chandrasekaran et al., 2009). This may seem unintuitive because it is within this range that the presumed less visible articulatory movements of the tongue and pharynx play their largest role in sculpting the sound. However, the configurations of these articulators were shown to systematically influence subtle visible mouth movements. This fact suggests that there is a class of visible information that strongly correlates with the acoustic information formed by internal articulators. In fact, visual speech research has shown that the presumably “hidden” articulatory dimensions (e.g. lexical tone, intraoral pressure) are actually visible in corresponding face surface changes, and can be used as speech information (Burnham et al., 2000; Han et al., 2018; Munhall & Vatikiotis‐Bateson, 2004). That visible mouth movements can inform about internal articulation may explain a striking recent finding. It turns out that, when observers are shown cross‐sectional ultrasound displays of internal tongue movements, they can readily integrate these novel displays with synchronized auditory speech information (D’Ausilio et al., 2014; see also Katz & Mehta, 2015). The strong correspondences between auditory and visual speech information has allowed auditory speech to be synthesized based on tracking kinematic dimensions available on the face (e.g. Barker & Berthommier, 1999; Yehia, Kuratate, & Vatikiotis‐Bateson, 2002). Conversely, the correspondences have allowed facial animation to be effectively created based on direct acoustic signal parameters (e.g. Yamamoto, Nakamura, & Shikano, 1998). There is also evidence for surprisingly close correspondences between audible and visible macaque calls, which macaques can easily perceive as corresponding (Ghazanfar et al., 2005). This finding may suggest a traceable phylogeny of the supramodal basis for multisensory communication. Importantly, there is evidence that perceivers make use of these crossmodal informational correspondences. While the supramodal thesis proposes that the relevant speech information takes a supramodal higher‐order form, the degree to which this information is simultaneously available in both modalities depends on a number of factors (e.g. visibility, audibility). The evidence shows that, in contexts for which the information is simultaneously available, perceivers take advantage of this correspondence (e.g. Grant & Seitz, 2000; Grant, 2001; Kim & Davis, 2004; Palmer & Ramsey, 2012; Schwartz, Berthommier, & Savariaux, 2004; Eskelund, Tuomainen, & Andersen, 2011; Rosen, Fourcin, & Moore, 1981). Research shows

40 Sensing Speech that the availability of segment‐to‐segment correspondence across the modalities’ information strongly predicts how well one modality will enhance the other (Grant & Seitz, 2000, 2001; Kim & Davis, 2004). Functionally, this finding supports the aforementioned “bimodal coherence‐masking protection” in that the informational correspondence across modalities allows one modality to boost the usability of the other (e.g. in the face of everyday masking degradation). In this sense, the supramodal thesis is consistent with the evidence supporting the bimodal coherence masking protection concept discussed earlier (Grant & Seitz, 2000; Grant, 2001; Kim & Davis, 2004). However, the supramodal thesis does go further by suggesting that: (1) the crossmodal correspondences are considered to be much more common and complex; and (2) the abstract form of information that can support correspondences is considered the primary type of information which the speech mechanism uses (regardless of the degree of moment‐to‐moment correspondence or specific availability of information in a modality).

General examples of supramodal information While some progress has been made in identifying the detailed ways in which information takes the same specific form across modalities, more progress has been made to establish the general ways in which the informational forms are similar. In the previous version of this chapter, it was argued that both auditory and visual speech show an important primacy of time‐varying information (Rosenblum, 2005; see also Rosenblum, 2008). At the time that chapter was written, many descriptions of visual speech information were based on static facial feature information, and still images were often used as stimuli. Since then, most all methodological and conceptual interpretations of visual speech information have incorporated a critical dynamic component (e.g. Jesse & Bartoli, 2018; Jiang et al., 2007). This contemporary emphasis on time‐varying information exists in both the behavioral and the neurophysiological research. A number of studies have examined how dynamic facial dimensions are extracted and stored for purposes of both phonetic and indexical perception (for a review, see Jesse & Bartoli, 2018). Other studies have shown that moment‐to‐moment visibility of articulator movements (as conveyed through discrete facial points) is highly predictive of lip‐reading performance (e.g. Jiang et al., 2007). These findings suggest that kinematic dimensions provide highly salient information for lip‐reading (Jiang et al., 2007). Other research has examined the neural mechanisms activated when perceiving dynamic speech information. For example, there is evidence that the mechanisms involved during perception of speech from isolated kinematic (point‐light) displays differ from those involved in recognizing speech from static faces (e.g. Santi et al., 2003). At the same time, brain reactivity to the isolated motion of point‐light speech does not qualitatively differ from reactivity to normal (fully illuminated) speaking faces (Bernstein et al., 2011). These neurophysiological findings are consistent with the primacy of time‐varying visible speech dimensions, which, in turn is analogous to the same primacy in audible speech (Rosenblum, 2005).

Primacy of Multimodal Speech Perception 41 A second general way in which auditory and visual speech information takes a similar form is in how it interacts with – and informs about – indexical properties. As discussed in the earlier chapter, there is substantial research showing that both auditory and visual speech functions make use of talker information to facilitate phonetic perception (for reviews, see Nygaard, 2005; Rosenblum, 2005). It is easier to understand speech from familiar speakers (e.g. Borrie et al., 2013; Nygaard, 2005), and easier to lip‐read from familiar faces, even for observers who have no formal lip‐reading experience (e.g. Lander & Davies, 2008; Schweinberger & Soukup, 1998; Yakel, Rosenblum, & Fortier, 2000). In these talker‐facilitation effects, it could be that an observer’s phonetic perception is facilitated by their familiarity with the separate vocal and facial characteristics provided by each modality. However, research conducted in our lab suggests that perceivers may also gain experience with the deeper, supramodal talker dimensions available across modalities (Rosenblum, Miller, & Sanchez, 2007; Sanchez, Dias, & Rosenblum, 2013). Our research shows that the talker experience gained through one modality can be shared across modalities to facilitate phonetic perception in the other. For example, becoming familiar with a talker by lip‐reading them (without sound) for one hour allows a perceiver to then better understand that talker’s auditory speech (Rosenblum, Miller, & Sanchez, 2007). Conversely, listening to the speech of a talker for one hour allows a perceiver to better lip‐read from that talker (Sanchez, Dias, & Rosenblum, 2013). Interestingly, this crossmodal talker facilitation works for both old words (perceived during familiarization) and new words, suggesting that the familiarity is not contained in specific lexical representations (Sanchez, Dias, & Rosenblum, 2013). Instead, the learned supramodal dimensions may be based on talker‐specific phonetic information contained in the idiolect of the perceived talker (e.g. Remez, Fellowes, & Rubin, 1997; Rosenblum et al., 2002). This interpretation can also explain our finding that learning to identify talkers can be shared across modalities (Simmons et al., 2015). In this demonstration, idiolectic information was isolated visually through a point‐light technique, and audibly through sinewave resynthesis (e.g. Remez et al. 1997; Rosenblum, Yakel, & Baseer, 2002). With these methods we observed that experience of learning to recognize talkers through point‐light displays transfers to allow better recognition of the same speakers heard in sinewave sentences. No doubt, our findings are related to past observations that perceivers can match a talker’s voice and speaking face, even when both signals are rendered as isolated phonetic information (sinewave speech and point‐light speech; Lachs & Pisoni, 2004). In all of these examples, perceivers may be learning the particular idiolectic properties of talkers’ articulation, which can be informed by both auditory and visual speech information. We have termed this interpretation of these findings the supramodal learning hypothesis. The hypothesis simply argues that part of what the speech function learns through experience is the supramodal properties related to a talker’s articulation. Because these articulatory properties are distal in nature, experience with learning in one modality can be shared across modalities to support crossmodal talker facilitation, learning, and matching.

42 Sensing Speech We further argue that the supramodal learning hypothesis helps explain bimodal training benefits recently reported in the literature. Bimodal training benefits occur when an observer is able to better understand degraded auditory speech after first being trained with congruent visual speech added to the degraded signal (Bernstein et al., 2013; Bernstein, Eberhardt, & Auer, 2014; Eberhardt, Auer, & Bernstein, 2014; Kawase et al., 2009; Lidestam et al., 2014; Moradi et al., 2019; Pilling & Thomas, 2011; but see Wayne & Johnsrude, 2012). For example, vocoded auditory speech is easier to understand on its own if a perceiver is first trained to listen to vocoded speech while seeing congruent visual speech. Bimodal training effects are also known to facilitate: (1) talker recognition from auditory speech (Schall & von Kriegstein, 2014; Schelinski, Riedel, & von Kriegstein, 2014; Sheffert et al., 2002; von Kriegstein et al., 2008; von Kriegstein & Giraud, 2006); (2) talker‐familiarity effects; (3) language development (Teinonen et al., 2008); and (4) second‐language learning (Hardison, 2005; Hazan et al., 2005). (There are also many examples of bimodal training benefits outside of speech perception suggesting that it may be a general learning strategy of the brain (e.g. Shams et al., 2011).) Importantly, neurophysiological correlates to many of these effects have revealed mechanisms that can be modulated with bimodal learning. It has long been known that the pSTS responds when observers are asked to report the speech they either see or hear. More recent research suggests that this activation is enhanced when a perceiver sees or hears a talker with whom they have some audiovisual experience (von Kriegstein & Giraud, 2006; von Kriegstein et al., 2005). If observers are tasked, instead, with identifying the voice of a talker, they show activation in an area associated with face recognition (fusiform face area; von Kriegstein et al., 2008; von Kriegstein & Giraud, 2006). This activation will also be enhanced by prior audiovisual exposure to the talker (von Kriegstein & Giraud, 2006). These findings are consistent with the possibility that observers are learning talker‐specific articulatory properties, as the supramodal learning hypothesis suggests. Other theories have arisen to explain these bimodal training benefits including (1) tacit recruitment of the associated face dimensions when later listening to the auditory‐alone speech (Riedel et al., 2015; Schelinski, Riedel, & von Kriegstein, 2014); and (2) greater access to auditory primitives based on previously experienced associations with the visual speech component (Bernstein et al., 2013; Bernstein, Eberhardt, & Auer, 2014). However, both of these theories are based on a mechanism that requires experiencing associations between concurrent audio and visual streams in order to improve subsequent audio‐alone speech perception. Recall, however, that our crossmodal talker‐facilitation findings (Rosenblum et al., 2007) show that such bimodal experience is not necessary for later auditory speech facilitation. Accordingly, we argue that at least some component of bimodal training benefits is based on both modalities providing common talker‐specific talker information. To examine this possibility, our laboratory is currently testing whether a bimodal training benefit can, in fact, occur without the associations afforded by concurrent audio and visual speech information, as the supramodal learning hypothesis would predict.

Primacy of Multimodal Speech Perception 43 Interestingly, because the supramodal learning hypothesis suggests that perceptual experience is of articulatory properties regardless of modality, another surprising prediction can be made. Observers should be able to show a bimodal training benefit using a modality they have rarely, if ever, used before: haptic speech. We have recently shown that, by listening to distorted auditory speech while touching the face of a speaker, observers are later able to understand the distorted speech on its own better than control subjects who touched a still face while listening (Dorsi et al., 2016). These results, together with our crossmodal talker facilitation findings (Rosenblum et al., 2007; Sanchez et al., 2013) suggests that the experiential basis of bimodal training benefits require neither long‐term experience with the involved modalities nor concurrent presentation of the streams. What is required for a bimodal training benefit is access to some lawful auditory/visual/haptic information for articulatory actions and their indexical properties. In sum, as we argued in our 2005 chapter, both auditory and visual speech share the general informational commonalities of being composed of time‐varying information which is intimately tied to indexical information. However, since 2005, another category of informational commonality can be added to this list: information in both streams can act to guide the indexical details of a production response. It is well known that during live conversation each participant’s productions are influenced by the indexical details of the speech they have just heard (e.g. Pardo, 2006; Pardo et al., 2013; for a review, see Pardo et al., 2017). This phonetic convergence shows that interlocuters’ utterances often subtly mimic aspects of the utterances of the person with whom they are speaking. This phenomenon occurs not only during live interaction, but also when subjects are asked to listen to recorded words and to say each word out loud. There have been many explanations for this phenomenon, including that it helps facilitate the interaction socially (e.g. Pardo et al., 2012). Phonetic convergence may also reveal the tacit connection between speech perception and production, as if the two function share a “common currency” (e.g. Fowler, 2004). Importantly, recent research from our lab and others suggests that phonetic convergence is not an alignment toward an interlocuter’s sound of speech as much as toward their articulatory style – conveyed supramodally. We have shown that, despite having no formal lip‐reading experience, perceivers will produce words containing the indexical properties of words they have just lip‐read (Miller, Sanchez, & Rosenblum, 2010). Further, the degree to which talkers converge toward lip‐read words is comparable to that observed for convergence to heard words. Other research from our lab shows that, during live interactions, seeing an interlocuter increases the degree of convergence over simply hearing them (Dias & Rosenblum, 2011), and that this increase is based on the availability of visible speech articulation (Dias & Rosenblum, 2016). Finally, it seems that the visual information for articulatory features (voice‐onset time) can integrate with auditory information to shape convergence (Sanchez, Miller, & Rosenblum, 2010). This finding also suggests that the streams are merged by the time they influence a spontaneous production response.

44 Sensing Speech This evidence for multimodal influences on phonetic convergence is consistent with neurophysiological research showing visual speech modulation of speech motor areas. As has been shown for auditory speech, visual speech can induce speech motor system (cortical) activity during lip‐reading of syllables, words, and sentences (e.g. Callan et al., 2003, 2004; Hall, Fussell, & Summerfield, 2005; Nishitani & Hari, 2002; Olson, Gatenby, & Gore, 2002; Paulesu et al., 2003). This motor system activity also occurs when a subject is attending to another task and passively perceives visual speech (Turner et al., 2009). Other research shows an increase in motor system activity when visual information is added to auditory speech (e.g. Callan, Jones, & Callan, 2014; Irwin et al., 2011; Miller & D’Esposito, 2005; Swaminathan et al., 2013; Skipper, Nusbaum, & Small, 2005; Skipper et al., 2007; Uno et al., 2015; Venezia, Fillmore, et al., 2016; but see Matchin, Groulx, & Hickok, 2014). This increase is proportionate to the relative visibility of the particular segments present in the stimuli (Skipper, Nusbaum, & Small, 2005). Relatedly, with McGurk‐effect types of stimuli (audio /pa/ + video /ka/), segment‐specific reactivity in the motor cortex follows the integrated perceived syllable (/ta/; Skipper et al., 2007). This finding is consistent with other research showing that with transcranial magnetic stimulation (TMS) priming of the motor cortex, electromyographic (EMG) activity in the articulatory muscles follow the integrated segment (Sundara, Namasivayam, & Chen, 2001; but see Sato et al., 2010). These findings are also consistent with our own evidence that phonetic convergence in production responses is based on the integration of audio and visual channels (Sanchez, Miller, & Rosenblum, 2010). There is currently a debate on whether the involvement of motor areas is necessary for audiovisual integration and for speech perception, in general (for a review, see Rosenblum, Dorsi, & Dias, 2016). But it is clear that the speech system treats auditory and visual speech information similarly for priming phonetic convergence in production responses. Thus, phonetic convergence joins the characteristics of critical time‐varying and indexical dimensions as an example of general informational commonality across audio and video streams. In this sense, the recent phonetic convergence research supports a supramodal perspective.

Conclusions Research on multisensory speech has flourished since 2005. This research has spearheaded a revolution in our understanding of the perceptual brain. The brain is now thought to be largely designed around multisensory input, with most major sensory areas showing crossmodal modulation. Behaviorally, research has shown that even our seemingly unimodal experiences are continuously influenced by crossmodal input, and that the senses have a surprising degree of parity and flexibility across multiple perceptual tasks. As we have argued, research on multisensory speech has provided seminal neurophysiological, behavioral, and phenomenological demonstrations of these principles.

Primacy of Multimodal Speech Perception 45 Arguably, as this research has grown, it has continued to support claims made in the first version of this chapter. There is now more evidence that multisensory speech perception is ubiquitous and (largely) automatic. This ubiquity is demonstrated in the new research showing that tactile and kinesthetic speech information can be used, and can readily integrate, with heard speech. Next, the majority of the new research continues to reveal a function for which the streams are integrated at the earliest stages of the speech function. Much of this research comes from neurophysiological research showing that auditory brainstem and even cochlear functioning is modulated by visual speech information. Finally, evidence continues to accumulate for the salience of a supramodal form of information. This evidence now includes findings that, like auditory speech, visual speech can act to influence an alignment response, and can modulate motor‐cortex activity for that purpose. Other support shows that the speech and talker experience gained through one modality can be shared with another modality, suggesting a mechanism sensitive to the supramodal articulatory dimensions of the stimulus: the supramodal learning hypothesis. There is also recent evidence that can be interpreted as unsupportive of a supramodal approach. Because the supramodal approach claims that “integration” is a consequence of the informational form across modalities, evidence should show that the function is early, impenetrable, and complete. As stated, however, there are findings that have been interpreted as showing that integration can be delayed until after some lexical analysis is conducted on unimodal input (e.g. Ostrand et al., 2016). There is also evidence interpreted as showing that integration is not impenetrable but is susceptible to outside influences including lexical status and attention (e.g. Brancazio, 2004; Alsius et al., 2005). Finally, there is evidence that has been interpreted to demonstrate that integration is not complete. For example, when subjects are asked to shadow a McGurk‐effect stimulus (e.g. responding ada when presented audio /aba/ and visual /aga/), their ada shadowed response will show articulatory remnants of the individual audio (/aba/) and video (/aga/) components (Gentilucci & Cattaneo, 2005). In principle, all of these findings are inconsistent with a supramodal account. While we have provided alternative interpretations of these findings both in the current paper and elsewhere (e.g. Rosenblum, 2019), it is clear that more research is needed to test the viability of the supramodal account.

REFERENCES Alsius, A., Navarra, J., Campbell, R., & Soto‐Faraco, S. (2005). Audiovisual integration of speech falters under high attention demands. Current Biology, 15(9), 839–843. Alsius, A., Navarra, J., & Soto‐Faraco, S. (2007). Attention to touch weakens

audiovisual speech integration. Experimental Brain Research, 183(3), 399–404. Alsius, A., Paré, M., & Munhall, K. G. (2017). Forty years after Hearing lips and seeing voices: The McGurk effect revisited. Multisensory Research, 31(1–2), 111–144.

46 Sensing Speech Altieri, N., Pisoni, D. B., & Townsend, J. T. (2011). Some behavioral and neurobiological constraints on theories of audiovisual speech integration: A review and suggestions for new directions. Seeing and Perceiving, 24(6), 513–539. Andersen, T. S., Tiippana, K., Laarni, J., et al. (2009). The role of visual spatial attention in audiovisual speech perception. Speech Communication, 29, 184–193. Arnal, L. H., Morillon, B., Kell, C. A., & Giraud, A. L. (2009). Dual neural routing of visual facilitation in speech processing. Journal of Neuroscience, 29(43), 13445–13453. Arnold, P., & Hill, F. (2001). Bisensory augmentation: A speechreading advantage when speech is clearly audible and intact. British Journal of Psychology, 92(2), 339–355. Auer, E. T., Bernstein, L. E., Sungkarat, W., & Singh, M. (2007). Vibrotactile activation of the auditory cortices in deaf versus hearing adults. Neuroreport, 18(7), 645–648. Barker, J. P., & Berthommier, F. (1999). Evidence of correlation between acoustic and visual features of speech. In J. J. Ohala, Y. Hasegawa, M. Ohala, et al. (Eds), Proceedings of the XIVth International Congress of Phonetic Sciences (pp. 5–9). Berkeley: University of California. Barutchu, A., Crewther, S. G., Kiely, P., et al. (2008). When/b/ill with/g/ill becomes/d/ill: Evidence for a lexical effect in audiovisual speech perception. European Journal of Cognitive Psychology, 20(1), 1–11. Baum, S., Martin, R. C., Hamilton, A. C., & Beauchamp, M. S. (2012). Multisensory speech perception without the left superior temporal sulcus. Neuroimage, 62(3), 1825–1832. Baum, S. H., & Beauchamp, M. S. (2014). Greater BOLD variability in older compared with younger adults during

audiovisual speech perception. PLOS One, 9(10), 1–10. Beauchamp, M. S., Nath, A. R., & Pasalar, S. (2010). fMRI‐guided transcranial magnetic stimulation reveals that the superior temporal sulcus is a cortical locus of the McGurk effect. Journal of Neuroscience, 30(7), 2414–2417. Bernstein, L. E., Auer, E. T., Jr., Eberhardt, S. P., & Jiang, J. (2013). Auditory perceptual learning for speech perception can be enhanced by audiovisual training. Frontiers in Neuroscience, 7, 1–16. Bernstein, L. E., Auer, E. T., Jr., & Moore, J. K. (2004). Convergence or association? In G. A. Calvert, C. Spence, & B. E. Stein (Eds), Handbook of multisensory processes (pp. 203–220). Cambridge, MA: MIT Press. Bernstein, L. E., Auer, E. T., Jr., & Takayanagi, S. (2004). Auditory speech detection in noise enhanced by lipreading. Speech Communication, 44(1–4), 5–18. Bernstein, L. E., Eberhardt, S. P., & Auer, E. T. (2014). Audiovisual spoken word training can promote or impede auditory‐only perceptual learning: Prelingually deafened adults with late‐ acquired cochlear implants versus normal hearing adults. Frontiers in Psychology, 5, 1–20. Bernstein, L. E., Jiang, J., Pantazis, D., et al. (2011). Visual phonetic processing localized using speech and nonspeech face gestures in video and point‐light displays. Human Brain Mapping, 32(10), 1660–1676. Bertelson, P., & de Gelder, B. (2004). The psychology of multi‐sensory perception. In C. Spence & J. Driver (Eds), Crossmodal space and crossmodal attention (pp. 141–177). Oxford: Oxford University Press. Bertelson, P., Vroomen, J., Wiegeraad, G., & de Gelder, B. (1994). Exploring the relation between McGurk interference and ventriloquism. In Proceedings of the

Primacy of Multimodal Speech Perception 47 Third International Congress on Spoken Language Processing (pp. 559–562). Yokohama: Acoustical Society of Japan. Besle, J., Fort, A., Delpuech, C., & Giard, M. H. (2004). Bimodal speech: Early suppressive visual effects in human auditory cortex. European Journal of Neuroscience, 20(8), 2225–2234. Besle, J., Fischer, C., Bidet‐Caulet, A., et al. (2008). Visual activation and audiovisual interactions in the auditory cortex during speech perception: Intracranial recordings in humans. Journal of Neuroscience, 24, 14301–14310. Bishop, C. W., & Miller, L. M. (2011). Speech cues contribute to audiovisual spatial integration. PLOS One, 6(8), e24016. Borrie, S. A., McAuliffe, M. J., Liss, J. M., et al. (2013). The role of linguistic and indexical information in improved recognition of dysarthric speech. Journal of the Acoustical Society of America, 133(1), 474–482. Brancazio, L. (2004). Lexical influences in audiovisual speech perception. Journal of Experimental Psychology: Human Perception and Performance, 30(3), 445–463. Brancazio, L., Best, C. T., & Fowler, C. A. (2006). Visual influences on perception of speech and nonspeech vocal‐tract events. Language and Speech, 49(1), 21–53. Brancazio, L., & Miller, J. L. (2005). Use of visual information in speech perception: Evidence for a visual rate effect both with and without a McGurk effect. Attention, Perception, & Psychophysics, 67(5), 759–769. Brancazio, L., Miller, J. L., & Paré, M. A. (2003). Visual influences on the internal structure of phonetic categories. Perception & Psychophysics, 65(4), 591–601. Brown, V., Hedayati, M., Zanger, A., et al. (2018). What accounts for individual differences in susceptibility to the McGurk effect? PLOS ONE, 13(11), e0207160. Burnham, D., Ciocca, V., Lauw, C., et al. (2000). Perception of visual information

for Cantonese tones. In M. Barlow & P. Rose (Eds), Proceedings of the Eighth Australian International Conference on Speech Science and Technology (pp. 86–91). Canberra: Australian Speech Science and Technology Association. Burnham, D. K., & Dodd, B. (2004). Auditory–visual speech integration by prelinguistic infants: Perception of an emergent consonant in the McGurk effect. Developmental Psychobiology, 45(4), 204–220. Callan, D. E., Callan, A. M., Kroos, C., & Eric Vatikiotis‐Bateson. (2001). Multimodal contribution to speech perception reveled by independant component analysis: A single sweep EEG case study. Cognitive Brain Research, 10(3), 349–353. Callan, D. E., Jones, J. A., & Callan, A. (2014). Multisensory and modality specific processing of visual speech in different regions of the premotor cortex. Frontiers in Psychology, 5, 389. Callan, D. E., Jones, J. A., Callan, A. M., & Akahane‐Yamada, R. (2004). Phonetic perceptual identification by native‐and second‐language speakers differentially activates brain regions involved with acoustic phonetic processing and those involved with articulatory–auditory/ orosensory internal models. NeuroImage, 22(3), 1182–1194. Callan, D. E., Jones, J. A., Munhall, K., et al. (2003). Neural processes underlying perceptual enhancement by visual speech gestures. Neuroreport, 14(17), 2213–2218. Calvert, G. A, Bullmore, E. T., Brammer, M. J., et al. (1997). Activation of auditory cortex during silent lipreading. Science, 276(5312), 593–596. Campbell, R. (2011). Speechreading: What’s missing. In A. Calder (Ed.), Oxford handbook of face perception (pp. 605–630). Oxford: Oxford University Press. Chandrasekaran, C., Trubanova, A., Stillittano, S., et al. (2009). The natural

48 Sensing Speech statistics of audiovisual speech. PLOS Computational Biology, 5(7), 1–18. Cienkowski, K. M., & Carney, A. E. (2002). Auditory–visual speech perception and aging, Ear and Hearing, 23, 439–449. Colin, C., Radeau, M., Deltenre, P., et al. (2002). The role of sound intensity and stop‐consonant voicing on McGurk fusions and combinations. European Journal of Cognitive Psychology, 14, 475–491. Connine, C. M., & Clifton, C., Jr. (1987). Interactive use of lexical information in speech perception. Journal of Experimental Psychology: Human Perception and Performance, 13(2), 291–299. Danielson, D. K., Bruderer, A. G., Kandhadai, P., et al. (2017). The organization and reorganization of audiovisual speech perception in the first year of life. Cognitive Development, 42, 37–48. D’Ausilio, A., Bartoli, E., Maffongelli, L., et al. (2014). Vision of tongue movements bias auditory speech perception. Neuropsychologia, 63, 85–91. Delvaux, V., Huet, K., Piccaluga, M., & Harmegnies, B. (2018). The perception of anticipatory labial coarticulation by blind listeners in noise: A comparison with sighted listeners in audio‐only, visual‐ only and audiovisual conditions. Journal of Phonetics, 67, 65–77. Derrick, D., & Gick, B. (2013). Aerotactile integration from distal skin stimuli. Multisensory Research, 26(5), 405–416. Desjardins, R. N., & Werker, J. F. (2004). Is the integration of heard and seen speech mandatory for infants? Developmental Psychobiology, 45(4), 187–203. Dias, J. W., & Rosenblum, L. D. (2011). Visual influences on interactive speech alignment. Perception, 40, 1457–1466. Dias, J. W., & Rosenblum, L. D. (2016). Visibility of speech articulation enhances auditory phonetic convergence. Attention, Perception, & Psychophysics, 78, 317–333.

Diehl, R. L., & Kluender, K. R. (1989). On the objects of speech perception. Ecological Psychology, 1, 121–144. Dorsi, J., Rosenblum, L. D., Dias, J. W., & Ashkar, D. (2016). Can audio‐haptic speech be used to train better auditory speech perception? Journal of the Acoustical Society of America, 139(4), 2016–2017. Dorsi, J., Rosenblum, L. D., & Ostrand, R. (2017). What you see isn’t always what you get, or is it? Reexamining semantic priming from McGurk stimuli. Poster presented at the 58th Meeting of the Psychonomics Society, Vancouver, Canada, November 10. Eberhardt, S. P., Auer, E. T., & Bernstein, L. E. (2014). Multisensory training can promote or impede visual perceptual learning of speech stimuli: Visual‐tactile vs. visual‐auditory training. Frontiers in Human Neuroscience, 8, 1–23. Eskelund, K., MacDonald, E. N., & Andersen, T. S. (2015). Face configuration affects speech perception: Evidence from a McGurk mismatch negativity study. Neuropsychologia, 66, 48–54. Eskelund, K., Tuomainen, J., & Andersen, T. S. (2011). Multistage audiovisual integration of speech: Dissociating identification and detection. Experimental Brain Research, 208(3), 447–457. Fingelkurts, A. A., Fingelkurts, A. A., Krause, C. M., et al. (2003). Cortical operational synchrony during audio– visual speech integration. Brain and Language, 85(2), 297–312. Fowler, C. A. (1986). An event approach to the study of speech perception from a direct‐realist perspective. Journal of Phonetics, 14, 3–28. Fowler, C. A. (2004). Speech as a supramodal or amodal phenomenon. In G. Calvert, C. Spence, & B. E. Stein (Eds), Handbook of multisensory processes (pp. 189–201). Cambridge, MA: MIT Press.

Primacy of Multimodal Speech Perception 49 Fowler, C. A. (2010). Embodied, embedded language use. Ecological Psychology, 22(4), 286–303. Fowler, C. A., Brown, J. M., & Mann, V. A. (2000). Contrast effects do not underlie effects of preceding liquids on stop‐ consonant identification by humans. Journal of Experimental Psychology: Human Perception and Performance, 26(3), 877–888. Fowler, C. A., & Dekle, D. J. (1991). Listening with eye and hand: Cross‐ modal contributions to speech perception. Journal of Experimental Psychology: Human Perception and Performance, 17(3), 816–828. Fuster‐Duran, A. (1996). Perception of conflicting audio‐visual speech: An examination across Spanish and German. In D. G. Stork & M. E. Hennecke (Eds), Speechreading by humans and machines (pp. 135–143). Berlin: Springer. Ganong, W. F. (1980). Phonetic categorization in auditory word perception. Journal of Experimental Psychology: Human Perception and Performance, 6(1), 110–125. Gentilucci, M., & Cattaneo, L. (2005). Automatic audiovisual integration in speech perception. Experimental Brain Research, 167(1), 66–75. Ghazanfar, A. A., Maier, J. X., Hoffman, K. L., & Logothetis, N. K. (2005). Multisensory integration of dynamic faces and voices in rhesus monkey auditory cortex. Journal of Neuroscience, 25(20), 5004–5012. Gibson, J. J. (1966). The senses considered as perceptual systems. Boston: Houghton Mifflin. Gibson, J. J. (1979). The ecological approach to visual perception. Boston: Houghton Mifflin. Gick, B., & Derrick, D. (2009). Aero‐tactile integration in speech perception. Nature, 462(7272), 502–504. Gick, B., Jóhannsdóttir, K. M., Gibraiel, D., & Mühlbauer, J. (2008). Tactile

enhancement of auditory and visual speech perception in untrained perceivers. Journal of the Acoustical Society of America, 123(4), 72–76. Gordon, P. C. (1997). Coherence masking protection in speech sounds: The role of formant synchrony. Perception & Psychophysics, 59, 232–242. Grant, K. W. (2001). The effect of speechreading on masked detection thresholds for filtered speech. Journal of the Acoustical Society of America, 109(5), 2272–2275. Grant, K. W., & Seitz, P. F. (1998). Measures of auditory‐visual integration in nonsense syllables and sentences. Journal of the Acoustical Society of America, 104, 2438–2450. Grant, K. W., & Seitz, P. F. P. (2000). The use of visible speech cues for improving auditory detection of spoken sentences. Journal of the Acoustical Society of America, 108(3), 1197–1208. Green, K. P., & Gerdeman, A. (1995). Cross‐modal discrepancies in coarticulation and the integration of speech information: The McGurk effect with mismatched vowel. Journal of Experimental Psychology: Human Perception and Performance, 21, 1409–1426. Green, K. P., & Kuhl, P. K. (1989). The role of visual information in the processing of. Perception & Psychophysics, 45(1), 34–42. Green, K. P., & Kuhl, P. K. (1991). Integral processing of visual place and auditory voicing information during phonetic perception. Journal of Experimental Psychology: Human Perception and Performance, 17, 278–288. Green, K. P., Kuhl, P. K., Meltzoff, A. N., & Stevens, E. B. (1991). Integrating speech information across talkers, gender, and sensory modality: Female faces and male voices in the McGurk effect. Perception & Psychophysics, 50(6), 524–536. Green, K. P., & Miller, J. L. (1985). On the role of visual rate information in

50 Sensing Speech phonetic perception. Perception & Psychophysics, 38(3), 269–276. Green, K. P., & Norrix, L. W. (2001). Perception of /r/ and /l/ in a stop cluster: Evidence of cross‐modal context effects. Journal of Experimental Psychology: Human Perception and Performance, 27(1), 166–177. Hall, D. A., Fussell, C., & Summerfield, A. Q. (2005). Reading fluent speech from talking faces: Typical brain networks and individual differences. Journal of Cognitive Neuroscience, 17(6), 939–953. Han, Y., Goudbeek, M., Mos, M., & Swerts, M. (2018). Effects of modality and speaking style on Mandarin tone identification by non‐native listeners. Phonetica, 76(4), 263–286. Hardison, D. M (2005). Variability in bimodal spoken language processing by native and nonnative speakers of English: A closer look at effects of speech style. Speech Communication, 46, 73–93. Hazan, V., Sennema, A., Iba, M., & Faulkner, A. (2005). Effect of audiovisual perceptual training on the perception and production of consonants by Japanese learners of English. Speech Communication, 47(3), 360–378. Hertrich, I., Mathiak, K., Lutzenberger, W., & Ackermann, H. (2009). Time course of early audiovisual interactions during speech and nonspeech central auditory processing: A magnetoencephalography study. Journal of Cognitive Neuroscience, 21(2), 259–274. Hessler, D., Jonkers, R., Stowe, L., & Bastiaanse, R. (2013). The whole is more than the sum of its parts: Audiovisual processing of phonemes investigated with ERPs. Brain Language, 124, 213–224. Hickok, G. (2009). Eight problems for the mirror neuron theory of action understanding in monkeys and humans. Journal of Cognitive Neuroscience, 21(7), 1229–1243. Irwin, J., & DiBlasi, L. (2017). Audiovisual speech perception: A new approach and implications for clinical populations.

Language and Linguistics Compass, 11(3), 77–91. Irwin, J. R., Frost, S. J., Mencl, W. E., et al. (2011). Functional activation for imitation of seen and heard speech. Journal of Neurolinguistics, 24(6), 611–618. Ito, T., Tiede, M., & Ostry, D. J. (2009). Somatosensory function in speech perception. Proceedings of the National Academy of Sciences of the United States of America, 106(4), 1245–1248. Jerger, S., Damian, M. F., Tye‐Murray, N., & Abdi, H. (2014). Children use visual speech to compensate for non‐intact auditory speech. Journal of Experimental Child Psychology, 126, 295–312. Jerger, S., Damian, M. F., Tye‐Murray, N., & Abdi, H. (2017). Children perceive speech onsets by ear and eye. Journal of Child Language, 44(1), 185–215. Jesse, A., & Bartoli, M. (2018). Learning to recognize unfamiliar talkers: Listeners rapidly form representations of facial dynamic signatures. Cognition, 176, 195–208. Jiang, J., Alwan, A., Keating, P., et al. (2002). On the relationship between facial movements, tongue movements, and speech acoustics. EURASIP Journal on Applied Signal Processing, 11, 1174–1178. Jiang, J., Auer, E. T., Alwan, A., et al. (2007). Similarity structure in visual speech perception and optical phonetic signals. Perception & Psychophysics, 69(7), 1070–1083. Katz, W. F., & Mehta, S. (2015). Visual feedback of tongue movement for novel speech sound learning. Frontiers in Human Neuroscience, 9, 612. Kawase, T., Sakamoto, S., Hori, Y., et al. (2009). Bimodal audio–visual training enhances auditory adaptation process. NeuroReport, 20, 1231–1234. Kim, J., & Davis, C. (2004). Investigating the audio–visual speech detection advantage. Speech Communication, 44(1), 19–30.

Primacy of Multimodal Speech Perception 51 Lachs, L., & Pisoni, D. B. (2004). Specification of cross‐modal source information in isolated kinematic displays of speech. Journal of the Acoustical Society of America, 116(1), 507–518. Lander, K. & Davies, R. (2008). Does face familiarity influence speechreadability? Quarterly Journal of Experimental Psychology, 61, 961–967. Lidestam, B., Moradi, S., Pettersson, R., & Ricklefs, T. (2014). Audiovisual training is better than auditory‐only training for auditory‐only speech‐in‐noise identification. Journal of the Acoustical Society of America, 136(2), EL142–147. Ma, W. J., Zhou, X., Ross, L. A., et al. (2009). Lip‐reading aids word recognition most in moderate noise: A Bayesian explanation using high‐dimensional feature space. PLOS ONE, 4(3), 1–14. Magnotti, J. F., & Beauchamp, M. S. (2017). A causal inference model explains perception of the McGurk effect and other incongruent audiovisual speech. PLOS Computational Biology, 2017(13), e1005229. Massaro, D. W. (1987). Speech perception by ear and eye: A paradigm for psychological inquiry. Hillsdale, NJ: Lawrence Erlbaum. Massaro, D. W., Cohen, M. M., Gesi, A., et al. (1993). Bimodal speech perception: An examination across languages. Journal of Phonetics, 21, 445–478. Massaro, D. W., & Ferguson, E. L. (1993). Cognitive style and perception: The relationship between category width and speech perception, categorization, and discrimination. American Journal of Psychology, 106(1), 25–49. Massaro, D. W., Thompson, L. A., Barron, B., & Laron, E. (1986). Developmental changes in visual and auditory contributions to speech perception, Journal of Experimental Child Psychology, 41, 93–113. Matchin, W., Groulx, K., & Hickok, G. (2014). Audiovisual speech integration does not rely on the motor system:

Evidence from articulatory suppression, the McGurk effect, and fMRI. Journal of Cognitive Neuroscience, 26(3), 606–620. McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748. Ménard, L., Cathiard, M. A., Troille, E., & Giroux, M. (2015). Effects of congenital visual deprivation on the auditory perception of anticipatory labial coarticulation. Folia Phoniatrica et Logopaedica, 67(2), 83–89. Ménard, L., Dupont, S., Baum, S. R., & Aubin, J. (2009). Production and perception of French vowels by congenitally blind adults and sighted adults. Journal of the Acoustical Society of America, 126(3), 1406–1414. Ménard, L., Leclerc, A., & Tiede, M. (2014). Articulatory and acoustic correlates of contrastive focus in congenitally blind adults and sighted adults. Journal of Speech, Language, and Hearing Research, 57(3), 793–804. Ménard, L., Toupin, C., Baum, S. R., et al. (2013). Acoustic and articulatory analysis of French vowels produced by congenitally blind adults and sighted adults. Journal of the Acoustical Society of America, 134(4), 2975–2987. Miller, B. T., & D’Esposito, M. (2005). Searching for “the top” in top‐down control. Neuron, 48(4), 535–538. Miller, R., Sanchez, K., & Rosenblum, L. (2010). Alignment to visual speech information. Attention, Perception, & Psychophysics, 72(6), 1614–1625. Mitterer, H., & Reinisch, E. (2017). Visual speech influences speech perception immediately but not automatically. Attention, Perception, & Psychophysics, 79(2), 660–678. Moradi, S., Lidestam, B., Ng, E. H. N., et al. (2019). Perceptual doping: An audiovisual facilitation effect on auditory speech processing, from phonetic feature extraction to sentence identification in noise. Ear and Hearing, 40(2), 312–327.

52 Sensing Speech Munhall, K. G., Ten Hove, M. W., Brammer, M., & Paré, M. (2009). Audiovisual integration of speech in a bistable illusion. Current Biology, 19(9), 735–739. Munhall, K. G., & Vatikiotis‐Bateson, E. (2004). Spatial and temporal constraints on audiovisual speech perception. In G. A. Calvert, C. Spence, & B. E. Stein (Eds), Handbook of multisensory processes (pp. 177–188) Cambridge, MA: MIT Press. Münte, T. F., Stadler, J., Tempelmann, C., & Szycik, G. R. (2012). Examining the McGurk illusion using high‐field 7 Tesla functional MRI. Frontiers in Human Neuroscience, 6, 95. Musacchia, G., Sams, M., Nicol, T., & Kraus, N. (2006). Seeing speech affects acoustic information processing in the human brainstem. Experimental Brain Research, 168(1–2), 1–10. Namasivayam, A. K., Wong, W. Y. S., Sharma, D., & van Lieshout, P. (2015). Visual speech gestures modulate efferent auditory system. Journal of Integrative Neuroscience, 14(1), 73– 83. Nath, A. R., & Beauchamp M. S. (2012). A neural basis for interindividual differences in the McGurk effect: A multisensory speech illusion. NeuroImage, 59(1), 781–787. PubMed. Navarra, J., & Soto‐Faraco, S. (2007). Hearing lips in a second language: Visual articulatory information enables the perception of second language sounds. Psychological Research, 71, 4–12. Nishitani, N., & Hari, R. (2002). Viewing lip forms: Cortical dynamics. Neuron, 36(6), 1211–1220. Nygaard, L. C. (2005). The integration of linguistic and non‐linguistic properties of speech. In D. Pisoni & R. Remez (Eds), Handbook of speech perception (pp. 390– 414). Oxford: Blackwell. Olson, I. R., Gatenby, J., & Gore, J. C. (2002). A comparison of bound and unbound audio–visual information processing in the human cerebral cortex. Cognitive Brain Research, 14, 129–138.

Ostrand, R., Blumstein, S. E., Ferreira, V. S., & Morgan, J. L. (2016). What you see isn’t always what you get: Auditory word signals trump consciously perceived words in lexical access. Cognition, 151, 96–107. Palmer, T. D., & Ramsey, A. K. (2012). The function of consciousness in multisensory integration. Cognition, 125(3), 353–364. Papale, P., Chiesi, L., Rampinini, A. C., et al. (2016). When neuroscience “touches” architecture: From hapticity to a supramodal functioning of the human brain. Frontiers in Psychology, 7(631), 866. Pardo, J. S. (2006). On phonetic convergence during conversational interaction. Journal of the Acoustical Society of America, 119(4), 2382–2393. Pardo, J. S., Gibbons, R., Suppes, A., & Krauss, R. M. (2012). Phonetic convergence in college roommates. Journal of Phonetics, 40(1), 190–197. Pardo, J. S., Jordan, K., Mallari, R., et al. (2013). Phonetic convergence in shadowed speech: The relation between acoustic and perceptual measures. Journal of Memory and Language, 69(3), 183–195. Pardo, J. S., Urmanche, A., Wilman, S., & Wiener, J. (2017). Phonetic convergence across multiple measures and model talkers. Attention, Perception, & Psychophysics, 79(2), 637–659. Pascual‐Leone, A., & Hamilton, R. (2001). The metamodal organization of the brain. Progress in Brain Research, 134, 427–445. Paulesu, E., Perani, D., Blasi, V., et al. (2003). A functional‐anatomical model for lipreading. Journal of Neurophysiology, 90(3), 2005–2013. Pekkola, J., Ojanen, V., Autti, T., et al. (2005). Primary auditory cortex activation by visual speech: An fMRI study at 3 T. Neuroreport, 16(2), 125–128. Pilling, M., & Thomas, S. (2011). Audiovisual cues and perceptual

Primacy of Multimodal Speech Perception 53 learning of spectrally distorted speech. Language and Speech, 54(4), 487–497. Plass, J., Guzman‐Martinez, E., Ortega, L., et al. (2014). Lip reading without awareness. Psychological Science, 25(9), 1835–1837. Reich, L., Maidenbaum, S., & Amedi, A. (2012). The brain as a flexible task machine: Implications for visual rehabilitation using noninvasive vs. invasive approaches. Current Opinion in Neurobiology, 25(1), 86–95. Reisberg, D., McLean, J., & Goldfield, A. (1987). Easy to hear but hard to understand: A lip‐reading advantage with intact auditory stimuli. In B. Dodd & R. Campbell (Eds), Hearing by eye: The psychology of lip‐reading (pp. 97–113). Hillsdale, NJ: Lawrence Erlbaum. Remez, R. E., Beltrone, L. H., & Willimetz, A. A. (2017). Effects of intrinsic temporal distortion on the multimodal perceptual organization of speech. Paper presented at the 58th Annual Meeting of the Psychonomic Society, Vancouver, British Columbia, November. Remez, R. E., Fellowes, J. M., & Rubin, P. E. (1997). Talker identification based on phonetic information. Journal of Experimental Psychology: Human Perception and Performance, 23(3), 651–666. Ricciardi, E., Bonino, D., Pellegrini, S., & Pietrini, P. (2014). Mind the blind brain to understand the sighted one! Is there a supramodal cortical functional architecture? Neuroscience & Biobehavioral Reviews, 41, 64–77. Riedel, P., Ragert, P., Schelinski, S., et al. (2015). Visual face‐movement sensitive cortex is relevant for auditory‐only speech recognition. Cortex, 68, 86–99. Rosen, S. M., Fourcin, A. J., & Moore, B. C. (1981). Voice pitch as an aid to lipreading. Nature, 291(5811), 150. Rosenblum, L. D. (2005). Primacy of multimodal speech perception. In D. Pisoni & R. Remez (Eds), Handbook of

speech perception (pp. 51–78). Oxford: Blackwell. Rosenblum, L. D. (2008). Speech perception as a multimodal phenomenon. Current Directions in Psychological Science, 17(6), 405–409. Rosenblum, L. D. (2013). A confederacy of senses. Scientific American, 308, 72–75. Rosenblum, L. D. (2019). Audiovisual speech perception and the McGurk effect. In Oxford research encyclopedia of linguistics. https://oxfordre.com/ linguistics/view/10.1093/ acrefore/9780199384655.001.0001/acrefor e‐9780199384655‐e‐420?rskey=L7JvON&r esult=1 Rosenblum, L. D., Dias, J. W., & Dorsi, J. (2017). The supramodal brain: Implications for auditory perception. Journal of Cognitive Psychology, 5911, 1–23. Rosenblum, L. D., Dorsi, J., & Dias, J. W. (2016). The impact and status of Carol Fowler’s supramodal theory of multisensory speech perception. Ecological Psychology, 28, 262–294. Rosenblum, L. D., Miller, R. M., & Sanchez, K. (2007). Lip‐read me now, hear me better later: Cross‐modal transfer of talker‐familiarity effects. Psychological Science, 18(5), 392–396. Rosenblum, L. D., & Saldaña, H. M. (1992). Discrimination tests of visually‐ influenced syllables. Perception & Psychophysics, 52(4), 461–473. Rosenblum, L. D., & Saldaña, H. M. (1996). An audiovisual test of kinematic primitives for visual speech perception. Journal of Experimental Psychology: Human Perception and Performance, 22(2), 318–331. Rosenblum, L. D., Schmuckler, M. A., & Johnson, J. A. (1997). The McGurk effect in infants. Perception & Psychophysics, 59(3), 347–357. Rosenblum, L. D., Yakel, D. A., Baseer, N., et al. (2002). Visual speech information for face recognition. Perception & Psychophysics, 64(2), 220–229.

54 Sensing Speech Rosenblum, L. D., Yakel, D. A., & Green, K. G. (2000). Face and mouth inversion affects on visual and audiovisual speech perception. Journal of Experimental Psychology: Human Perception and Performance, 26(3), 806–819. Sams, M., Manninen, P., Surakka, V., et al. (1998). McGurk effect in Finnish syllables, isolated words, and words in sentences: Effects of word meaning and sentence context. Speech Communication, 26(1–2), 75–87. Sanchez, K., Dias, J. W., & Rosenblum, L. D. (2013). Experience with a talker can transfer across modalities to facilitate lipreading. Attention, Perception & Psychophysics, 75, 1359–1365. Sanchez, K., Miller, R. M., & Rosenblum, L. D. (2010). Visual influences on alignment to voice onset time. Journal of Speech, Language, and Hearing Research, 53, 262–272. Santi, A., Servos, P., Vatikiotis‐Bateson, E., et al. (2003). Perceiving biological motion: Dissociating visible speech from walking. Journal of Cognitive Neuroscience, 15(6), 800–809. Sato, M., Buccino, G., Gentilucci, M., & Cattaneo, L. (2010). On the tip of the tongue: Modulation of the primary motor cortex during audiovisual speech perception. Speech Communication, 52(6), 533–541. Sato, M., Cavé, C., Ménard, L., & Brasseur, A. (2010). Auditory‐tactile speech perception in congenitally blind and sighted adults. Neuropsychologia, 48(12), 3683–3686. Schall, S., & von Kriegstein, K. (2014). Functional connectivity between face‐ movement and speech‐intelligibility areas during auditory‐only speech perception. PLOS ONE, 9(1), 1–11. Schelinski, S., Riedel, P., & von Kriegstein, K. (2014). Visual abilities are important for auditory‐only speech recognition: Evidence from autism

spectrum disorder. Neuropsychologia, 65, 1–11. Schwartz, J. L., Berthommier, F., & Savariaux, C. (2004). Seeing to hear better: Evidence for early audio‐visual interactions in speech identification. Cognition, 93(2), B69–78. Schweinberger, S. R., & Soukup, G. R. (1998). Asymmetric relationships among perceptions of facial identity, emotion, and facial speech. Journal of Experimental Psychology: Human Perception and Performance, 24, 1748–1765. Sekiyama, K., & Tohkura, Y. (1991). McGurk effect in non‐English listeners: Few visual effects for Japanese subjects hearing Japanese syllables of high auditory intelligibility. Journal of the Acoustical Society of America, 90(4), 1797–1805. Sekiyama, K., & Tohkura, Y. (1993). Inter‐ language differences in the influence of visual cues in speech perception. Journal of Phonetics, 21(4), 427–444. Shams, L., Iwaki, S., Chawla, A., & Bhattacharya, J. (2005). Early modulation of visual cortex by sound: An MEG study. Neuroscience Letters, 378(2), 76–81. Shams, L., Wozny, D. R., Kim, R., & Seitz, A. (2011). Influences of multisensory experience on subsequent unisensory processing. Frontiers in Psychology, 2, 264. Shahin, A. J., Backer, K. C., Rosenblum, L. D., & Kerlin, J. R. (2018). Neural mechanisms underlying cross‐modal phonetic encoding. Journal of Neuroscience, 38(7), 1835–1849. Sheffert, S. M., Pisoni, D. B., Fellowes, J. M., & Remez, R. E. (2002). Learning to recognize talkers from natural, sinewave, and reversed speech samples. Journal of Experimental Psychology: Human Perception and Performance, 28(6), 1447–1469. Simmons, D. C., Dias, J. W., Dorsi, J. & Rosenblum, L. D. (2015). Crossmodal

Primacy of Multimodal Speech Perception 55 transfer of talker learning. Poster presented at the 169th meeting of the Acoustical Society of America, Pittsburg, Pennsylvania, May. Skipper, J. I., Nusbaum, H. C., & Small, S. L. (2005). Listening to talking faces: Motor cortical activation during speech perception. NeuroImage, 25(1), 76–89. Skipper, J. I., van Wassenhove, V., Nusbaum, H. C., & Small, S. L. (2007). Hearing lips and seeing voices: How cortical areas supporting speech production mediate audiovisual speech perception. Cerebral Cortex, 17(10), 2387–2399. Soto‐Faraco, S., & Alsius, A. (2007). Conscious access to the unisensory components of a crossmodal illusion. NeuroReport, 18, 347–350. Soto‐Faraco, S., & Alsius, A. (2009). Deconstructing the McGurk–MacDonald illusion. Journal of Experimental Psychology: Human Perception and Performance, 35, 580–587. Stoffregen, T. A., & Bardy, B. G. (2001). On specification and the senses. Behavioral and Brain Sciences, 24(2), 195–213. Strand, J., Cooperman, A., Rowe, J., & Simenstad, A. (2014). Individual differences in susceptibility to the McGurk effect: Links with lipreading and detecting audiovisual incongruity, Journal of Speech, Language, and Hearing Research, 57, 2322–2331. Striem‐Amit, E., Dakwar, O., Hertz, U., et al. (2011). The neural network of sensory‐ substitution object shape recognition. Functional Neurology, Rehabilitation, and Ergonomics, 1(2), 271–278. Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech intelligibility in noise. Journal of the Acoustical Society of America, 26(2), 212–215. Summerfield, Q. (1987). Some preliminaries to a comprehensive account of audio‐ visual speech perception. In B. Dodd & R. Campbell (Eds), Hearing by eye: The

psychology of lip‐reading (pp. 53–83). London: Lawrence Erlbaum. Summerfield, Q., & McGrath, M. (1984). Detection and resolution of audiovisual incompatibility in the perception of vowels. Quarterly Journal of Experimental Psychology, 36A, 51–74. Sundara, M., Namasivayam, A. K., & Chen, R. (2001). Observation‐execution matching system for speech: A magnetic stimulation study. NeuroReport, 12(7), 1341–1344. Swaminathan, S., MacSweeney, M., Boyles, R., et al. (2013). Motor excitability during visual perception of known and unknown spoken languages. Brain and Language, 126(1), 1–7. Teinonen, T., Aslin, R. N., Alku, P., & Csibra, G. (2008). Visual speech contributes to phonetic learning in 6‐ month‐old infants. Cognition, 108(3), 850–855. Thomas, S. M., & Jordan, T. R. (2002). Determining the influence of Gaussian blurring on inversion effects with talking faces. Perception & Psychophysics, 64, 932–944. Tiippana, K. (2014). What is the McGurk effect? Frontiers in Psychology, 5, 725–728. Tiippana, K., Andersen, T. S., & Sams, M. (2004). Visual attention modulates audiovisual speech perception. European Journal of Cognitive Psychology, 16(3), 457–472. Treille, A., Cordeboeuf, C., Vilain, C., & Sato, M. (2014). Haptic and visual information speed up the neural processing of auditory speech in live dyadic interactions. Neuropsychologia, 57(1), 71–77. Treille, A., Vilain, C., & Sato, M. (2014). The sound of your lips: Electrophysiological cross‐modal interactions during hand‐to‐face and face‐to‐face speech perception. Frontiers in Psychology, 5, 1–8.

56 Sensing Speech Turner, T. H., Fridriksson, J., Baker, J., et al. (2009). Obligatory Broca’s area modulation associated with passive speech perception. Neuroreport, 20(5), 492–496. Uno, T., Kawai, K., Sakai, K., et al. (2015). Dissociated roles of the inferior frontal gyrus and superior temporal sulcus in audiovisual processing: Top‐down and bottom‐up mismatch detection. PLOS ONE, 10(3). van de Rijt, L. P. H., van Opstal, A. J., Mylanus, E. A. M., et al. (2016). Temporal cortex activation to audiovisual speech in normal‐hearing and cochlear implant users measured with functional near‐ infrared spectroscopy. Frontiers in Human Neuroscience, 10, 1–14. Van Engen, K. J., Xie, Z., & Chandrasekaran, B. (2016). Audiovisual sentence recognition is not predicted by susceptibility to the McGurk effect. Attention, Perception, &Psychophysics, 79, 396– 403. van Wassenhove, V. (2013). Speech through ears and eyes: Interfacing the senses with the supramodal brain. Frontiers in Psychology, 4, 1–17. van Wassenhove, V., Grant, K. W., & Poeppel, D. (2007). Temporal window of integration in auditory‐visual speech perception. Neuropsychologia, 45(3), 598–607. van Wassenhove, V., Grant, K. W., & Poeppel, D. (2005). Visual speech speeds up the neural processing of auditory speech. Proceedings of the National Academy of Sciences of the United States of America, 102(4), 1181–1186. Venezia, J. H., Fillmore, P., Matchin, W., et al. (2016). Perception drives production across sensory modalities: A network for sensorimotor integration of visual speech. NeuroImage, 126, 196–207. Venezia, J. H., Thurman, S. M., Matchin, W., et al. (2016). Timing in audiovisual speech perception: A mini review and

new psychophysical data. Attention, Perception & Psychophysics, 78(2), 583–601. von Kriegstein, K., Dogan, O., Grüter, M., et al. (2008). Simulation of talking faces in the human brain improves auditory speech recognition. Proceedings of the National Academy of Sciences of the United States of America, 105(18), 6747–6752. von Kriegstein, K., & Giraud, A. L. (2006). Implicit multisensory associations influence voice recognition. PLOS Biology, 4(10), 1809–1820. von Kriegstein, K., Kleinschmidt, A., Sterzer, P., & Giraud, A. L. (2005). Interaction of face and voice areas during speaker recognition. Journal of Cognitive Neuroscience, 17(3), 367–376. Watkins, S., Shams, L., Tanaka, S., et al. (2006). Sound alters activity in human V1 in association with illusory visual perception. NeuroImage, 31(3), 1247–1256. Wayne, R. V., & Johnsrude, I. S. (2012). The role of visual speech information in supporting perceptual learning of degraded speech. Journal of Experimental Psychology: Applied, 18(4), 419–435. Wilson, A., Alsius, A., Paré, M., & Munhall, K. (2016). Spatial frequency requirements and gaze strategy in visual‐only and audiovisual speech perception, Journal of Speech, Language, and Hearing Research, 59, 601–615. Windmann, S. (2004). Effects of sentence context and expectation on the McGurk illusion. Journal of Memory and Language, 50(2), 212–230. Windmann, S. (2007). Sentence context induces lexical bias in audiovisual speech perception. Review of Psychology, 14(2), 77–91. Yakel, D. A., Rosenblum, L. D., & Fortier, M. A. (2000). Effects of talker variability on speechreading. Perception & Psychophysics, 62, 1405–1412. Yamamoto, E., Nakamura, S., & Shikano, K. (1998). Lip movement

Primacy of Multimodal Speech Perception 57 synthesis from speech based on hidden Markov models. Speech Communication, 26(1–2), 105–115. Yehia, H. C., Kuratate, T., & Vatikiotis‐ Bateson, E. (2002). Linking facial animation, head motion, and speech acoustics. Journal of Phonetics, 30(3), 555–568. Yehia, H., Rubin, P., & Vatikiotis‐Bateson, E. (1998). Quantitative association of

vocal‐tract and facial behavior. Speech Communication, 26(1–2), 23–43. Zheng, Y., & Samuel, A. G. (2019). How much do visual cues help listeners in perceiving accented speech? Applied Psycholinguistics, 40(1), 93–109. Zilber, N., Ciuciu, P., Gramfort, A., et al. (2014). Supramodal processing optimizes visual perceptual learning and plasticity. NeuroImage, 93, 32–46.

3 How Does the Brain Represent Speech? OIWI PARKER JONES1 AND JAN W. H. SCHNUPP2 University of Oxford, United Kingdom City University of Hong Kong, Hong Kong

1 2

Introduction In this chapter, we provide a brief overview of how the brain’s auditory system represents speech. The topic is vast, many decades of research on the subject have generated several books’ worth of insight into this fascinating question, and getting close up and personal with this subject matter necessitates a fair bit of background knowledge about neuroanatomy and physiology, as well as acoustics and linguistic sciences. Providing a reasonably comprehensive overview of the topic that is accessible to a wide readership, within a short chapter, is a near‐ impossible task, and we apologize in advance for the shortcomings that this chapter will inevitably have. With these caveats and without further ado, let us jump right in and begin by examining the question What is there to ‘represent’ in a speech signal? The word representation is quite widely used in sensory neuroscience, but it is rarely clearly defined. A neural representation tends to refer to the manner in which neural activity patterns encode or process some key aspects of the sensory world. Of course, if we want to understand how the brain listens to speech, then grasping how neural activity in early stages of the nervous system encodes speech sounds is really only a very small part of what we would ideally like to understand. It is a necessary first step that leaves many interesting questions unanswered, as you can easily appreciate if you consider that fairly simple technological devices such as telephone lines are able to represent speech with patterns of electrical activity, but these devices tell us relatively little about what it means for a brain to hear speech.

The Handbook of Speech Perception, Second Edition. Edited by Jennifer S. Pardo, Lynne C. Nygaard, Robert E. Remez, and David B. Pisoni. © 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.

How Does the Brain Represent Speech? 59 Phone lines merely have to capture enough of the physical parameters of an acoustic waveform to allow the resynthesis of a sufficiently similar acoustic waveform to facilitate comprehension by another person at the other end of the line. Brains, in contrast, don’t just deliver signals to a mind at the other end of the line, but they have to make the mind at the other end of the line, and to do that they have to try to learn something from the speech signal about who speaks, where they might be, what mood they are in, and, most importantly, the ideas the speaker is trying to communicate. Consequently it would be nice to know how the brain represents not just the acoustics, but also the phonetic, prosodic, and semantic features of the speech it hears. Readers of this volume are likely to be well aware that extracting such higher‐ order features from speech signals is difficult and intricate. Once the physical aspects of the acoustic waveform are encoded, phonetic properties such as formant frequencies, voicing, and voice pitch must be inferred, interpreted, and classified in a context‐dependent manner, which in turn facilitates the creation of a semantic representation of speech. In the auditory brain, this occurs along a processing hierarchy, where the lowest levels of the auditory nervous system – the inner ear, auditory nerve fibers and brainstem – encode the physical attributes of the sound and compute what may be described as low‐level features, which are then passed on via the midbrain and the thalamus toward an extensive network of auditory and multisensory cortical areas, whose task it is to form phonetic and semantic representations. As this chapter progresses, we will look in some detail at this progressive transformation of an initially largely acoustic representation of speech sounds in the auditory nerve, brainstem, midbrain, and primary cortex to an increasingly linguistic feature representation in a part of the brain called the superior temporal gyrus, and finally to semantic representations in brain areas stretching well beyond those classically thought of as auditory structures. While it is apt to think of this neural speech‐processing stream as a hierarchical process, it would nevertheless be wrong to think of it as entirely a feed‐forward process. It is well known that, for each set of ascending nerve fibers carrying auditory signals from the inner ear to the brainstem, from brainstem to midbrain, from midbrain to thalamus, and from thalamus to cortex, there is a parallel descending pathway going from cortex back to thalamus, midbrain, brainstem and all the way back to the ear. This is thought to allow feedback signals to be sent in order to focus attention and to make use of the fact that the rules of language make the temporal evolution of speech sounds partly predictable, and such predictions can facilitate hearing speech in noise, or to tune the ear to the voice or dialect of a particular speaker. To orient the readers who are unfamiliar with the neuroanatomy of the auditory pathway we include a sketch in Figure 3.1, which shows the approximate location of some of the key stages of the early parts of this pathway, from the ear to the primary auditory cortex, projected onto a drawing of a frontal section through the brain running vertically roughly through the middle of the ear canals. The arrows in Figure 3.1 show the principal connections along the main, ‘lemniscal’, ascending auditory pathway. Note, however, that it is impossible to

60 Sensing Speech

itory

ud ary a Prim x cor te

Thalamus

Cerebral cortex

Medial geniculate Inferior colliculus

Temporal bone Outer ear

Incus

Superior olive

Malleus

Cochlear nuclei

Stapes Eardrum Cochlea Ear canal

Eustachian tube

Auditory nerve

Figure 3.1 Illustration of the ear showing the early stages of the ascending auditory pathway.

overstate the extent to which Figure 3.1 oversimplifies the richness and complexity of the brain’s auditory pathways. For example, the cochlear nuclei, the first auditory relay station receiving input from the ear, has no fewer than three anatomical subdivisions, each comprising many tens to a few hundred thousand neurons of different cell types and with different onward connections. Here we show the output neurons of the cochlear nuclei as projecting to the superior olive bilaterally, which is essentially correct, but for simplicity we omit the fact that the superior olive itself is composed of around half a dozen intricately interconnected subnuclei, and that there are also connections from the cochlear nuclei which bypass the superior olive and connect straight to the inferior colliculus, the major auditory‐processing center of the midbrain. The inferior colliculus too has several subdivisions, as does the next station on the ascending pathway, the medial geniculate body of the thalamus. And even primary auditory cortex is thought to have

How Does the Brain Represent Speech? 61 two or three distinct subfields, depending on which mammalian species one looks at and which anatomist one asks. In order not to clutter the figure we do not show any of the descending connections, but it would not be a bad start to think of each of the arrows here as going in both directions. The complexity of the anatomy is quite bewildering, and much remains unknown about the detailed structure and function of its many subdivisions. But we have nevertheless learned a great deal about these structures and the physiological mechanisms that are at work within them and that underpin our ability to hear speech. Animal experiments have been invaluable in elucidating basic physiological mechanisms of sound encoding, auditory learning, and pattern classification in the mammalian brain. Clinical studies on patients with various forms of hearing impairment or aphasia have also helped to identify key cortical structures. More recently, functional brain imaging on normal volunteers, as well as invasive electrophysiological recordings from the brains of patients who are undergoing brain surgery for epilepsy have further refined our knowledge of speech representations, particularly in higher‐order cortical structures. In the sections that follow we shall highlight some of the insights that have been gained from these types of studies. The chapter will be structured as a journey: we shall accompany speech sounds as they leave the vocal tract of a speaker, enter the listener’s ear, become encoded as trains of nerve impulses in the cochlea and auditory nerve, and then travel along the pathways just described and spread out across a phenomenally intricate network of hundreds of millions of neurons whose concerted action underpins our ability to perform the everyday magic of communicating abstract thoughts across space and time through the medium of the spoken word.

Encoding of speech in the inner ear and auditory nerve Let us begin our journey by reminding ourselves about how speech sounds are generated, and what acoustic features are therefore elementary aspects of a speech sound that need to be encoded. When we speak, we produce both voiced and unvoiced speech sounds. Voiced speech sounds arise when the vocal folds in our larynx open and close periodically, producing a rapid and periodic glottal pulse train which may vary from around 80 Hz for a low bass voice to 900 Hz or above for a high soprano voice, although glottal pulse rates of somewhere between 125 Hz to 300 Hz are most common for adult speech. Voiced speech sounds include vowels and voiced consonants. Unvoiced sounds are simply those that are not produced with any vibrating of the vocal folds. The manner in which they are created causes unvoiced speech sounds to have spectra typical of noise, while the spectra of voiced speech sounds exhibit a harmonic structure, with regular sharp peaks at frequencies corresponding to the overtones of the glottal pulse train. Related to these differences in the waveforms and spectra is the fact that, perceptually, unvoiced speech sounds do not have an identifiable pitch, while voiced speech sounds do have a clear pitch of a height that corresponds to their

62 Sensing Speech fundamental frequency, which corresponds to the glottal pulse rate. Thus, we can sing melodies with voiced speech sounds, but we cannot whisper a melody. When we speak, the different types of sound sources, whether unvoiced noises or voiced harmonic series, are shaped by resonances in the vocal tract, which we must deftly manipulate by dynamically changing the volume and the size of the openings of a number of cavities in our throat, mouth, and nose, which we do by articulatory movements of the jaw, soft palate, tongue, and lips. The resonances in our vocal tracts impose broad spectral peaks on the spectra of the speech sounds, and these broad spectral peaks are known as formants. The dynamic pattern of changing formant frequencies encodes the lion’s share of the semantic information in speech. Consequently, to interpret a speech stream that arrives at our ears, one might think that our ears and brains will chiefly need to examine the incoming sounds for broad peaks in the spectrum to identify formants. But, to detect voicing and to determine voice pitch, the brain must also look either for sharp peaks at regular intervals in the spectrum to identify harmonics or, alternatively, for periodicities in the temporal waveform. Pitch information provided by harmonicity or, equivalently, periodicity is a vital cue to help identify speakers, gain prosodic information, or determine the tone of a vowel in tonal languages like Chinese or Thai, which use pitch contours to distinguish between otherwise identical homophonic syllables. Encoding information about these fundamental features, formants, and harmonicity or periodicity, is thus an essential job of the inner ear and auditory nerve. They do this as they translate incoming sound waveforms into a tonotopically organized pattern of neural activity, which represents differences in acoustic energy across frequency bands by means of a so‐called rate–place code. Nerve fibers that are tuned to systematically different preferred, or characteristic, frequencies are arranged in an orderly array. Differences in firing rates across the array encode peaks and valleys in the frequency spectrum, which conveys information about formants and, to a lesser extent, harmonics. This concept of tonotopy is quite central to the way all sounds, not just speech sounds, are usually thought to be represented along the lemniscal auditory pathway. All the stations of the lemniscal auditory pathway shown in Figure 3.1, from the cochlea to the primary auditory cortex, contain at least one, and sometimes several tonotopic maps, that is, arrays of frequency‐tuned neurons arranged in a systematic array from low to high preferred frequency. It is therefore worth examining this notion of tonotopy in some detail to understand its origin, and to ask what tonotopy can and cannot do to represent fundamental features of speech. In the mammalian brain, tonotopy arises quite naturally from the way in which sounds are transduced into neural responses by the basilar membrane and organ of corti in the cochlea. When sounds are transmitted from the ear drum to the inner ear via the ossicles, the mechanical vibrations are transmitted to the basilar membrane via fluid‐filled chambers of the inner ear. The basilar membrane itself has a stiffness gradient, being stiff at the basal end, near the ossicles, and floppy at the far end, the apex. Sounds transmitted through the far end have little mechanical resistance from the stiffness of the basilar membrane, but have to displace more inert fluid column in the inner ear. Sounds traveling through the near end face less

How Does the Brain Represent Speech? 63 inertia, but more stiffness. The upshot of this is that one can think of the basilar membrane as a bank of mechanical spring‐mass filters, with filters tuned to high frequencies at the base, and to increasingly lower frequencies toward the apex. Tiny, highly sensitive hair cells that sit on the basilar membrane then pick up these frequency‐filtered vibrations and translate them into electrical signals, which are then encoded as trains of nerve impulses (also called action potentials or spikes) in the bundles of auditory nerve fibers that connect the inner ear to the brain. Thus, each nerve fiber in the auditory nerve is frequency tuned, and the sound frequency it is most sensitive to is known as its characteristic frequency (CF). The cochlea, and the basilar membrane inside it, is curled up in a spiral, and the organization of the auditory nerve mirrors that of the basilar membrane: inside it we have something that could be described as a rate–place code for sounds, where the amount of sound energy at the lowest audible frequencies (around 50 Hz) is represented by the firing rates of nerve fibers right at the center, and increasingly higher frequencies are encoded by nerve fibers that are arranged in a spiral around that center. Once the auditory nerve reaches the cochlear nuclei, this orderly spiral arrangement unwraps to project systematically across the extent of the nuclei, creating tonotopic maps, which are then passed on up the auditory pathway by orderly anatomical connections from one station to the next. What this means for the encoding of speech in the early auditory system is that formant peaks of speech sounds, and maybe also the peaks of harmonics, should be represented by systematic differences in firing rates across the tonotopic array. The human auditory nerve contains about 30,000 such nerve fibers, each capable of firing anywhere between zero and several hundred spikes a second. So there are many hundreds of thousands of nerve impulses per second available to represent the shape of the sound spectrum across the tonotopic array. And, indeed, there is quite a lot of experimental evidence that systematic firing‐rate differences across this array of nerve fibers is not a bad first‐order approximation of what goes on in the auditory system (Delgutte, 1997), but, as so often in neurobiology, the full story is a lot more complicated. Thanks to decades of physiological and anatomical studies on experimental animals by dozens of teams, the mechanisms of sound encoding in the auditory nerve are now known in sufficient detail that it has become possible to develop computer models that can predict the activity of auditory nerve fibers to arbitrary sound inputs (Zhang et al., 2001; Heinz, Colburn, & Carney, 2002; Sumner et al., 2002; Meddis & O’Mard, 2005; Zhang & Carney, 2005; Ferry & Meddis, 2007), and here we shall use the model of Zilany, Bruce, and Carney (2014) to look at the encoding of speech sounds in the auditory nerve in a little more detail. The left panel of Figure 3.2 shows the power spectrum of a recording of the spoken vowel [ɛ], as in head (IPA [hɛːd]). The spectrum shows many sharp peaks at multiples of about 145 Hz – the harmonics of the vowel. These sharp peaks ride on top of broad peaks centered around 500, 1850, and 2700 Hz – the formants of the vowel. The right panel of the figure shows the distribution of firing rates of low spontaneous rate (LSR) auditory nerve fibers in response to the same vowel, according to the auditory nerve fiber model by Zilany, Bruce, and Carney (2014).

64 Sensing Speech 50

Vowel [𝜀] – spectrum

100

LSR fiber responses

40 80

30 Spikes/s

dB

20 10 0 –10

60 40 20

–20 –30

1000

2000 Hz

3000

4000

0

1000

2000

3000

4000

CF (Hz)

Figure 3.2 A power spectrum representing an instantaneous spectrogram (left) and a simulated distribution of firing rates for an auditory nerve fiber (right) for the vowel [ɛ] in head [hɛːd].

Along the x‐axis we plot the CF of each nerve fiber, and along the y‐axis we expect the average number of spikes the fiber would be expected to fire per second when presented with the vowel [ɛ] at a sound level of 65 dB SPL (sound pressure level), the sort of sound level that would be typical during a calm conversation with a quiet background. Comparing the spectrogram on the left with the distribution of firing rates on the right, it is apparent that the broad peaks of the formants are well reflected in the firing rate distribution, if anything perhaps more visibly than in the spectrum, but that most of the harmonics are not. Indeed, only the lowest three harmonics are visible; the others have been ironed out by the fact that the frequency tuning of cochlear filters is often broad compared to the frequency interval between individual harmonics, and becomes broader for higher frequencies. Only the very lowest harmonics are therefore resolved by the rate–place code of the tonotopic nerve fiber array, and we should think of tonotopy as well adapted to representing formants but poorly adapted to representing pitch or voicing information. If you bear in mind that many telephones will high‐pass filter speech at 300 Hz, thereby effectively cutting off the lowest harmonic peak, there really is not much information about the harmonicity of the sound left reflected in the tonotopic firing‐rate distribution. But there are important additional cues to voicing and pitch, as we shall see shortly. The firing rates of auditory nerve fibers increase monotonically with increasing sound level, but these fibers do need a minimum‐threshold sound level, and they cannot increase their firing rates indefinitely when sounds keep getting louder. This gives auditory nerve fibers a limited dynamic range, which usually covers 50 dB or less. At the edges of the dynamic range, the formants of speech sounds

How Does the Brain Represent Speech? 65 cannot be effectively represented across the tonotopic array because the neurons in the array either fire not at all (or not above their spontaneous firing rates), or because they all fire as fast as they can. However, people can usually understand speech well over a very broad range of sound levels. To be able to code sounds effectively over a wide range of sound levels, the ear appears to have evolved different types of auditory nerve fibers, some of which specialize in hearing quiet sounds, with low thresholds but also relatively low‐saturation sound levels, and others of which specialize in hearing louder sounds, with higher thresholds and higher saturation levels. Auditory physiologists call the more sensitive of these fiber types high spontaneous rate (HSR) fibers, given that these auditory nerve fibers may fire nerve impulses at fairly elevated rates (some 30 spikes per second or so), even in the absence of any external sound, and the less sensitive fibers the LSR fibers, which we have already encountered, and which fire only a handful of spikes per second in the absence of sound. There are also medium spontaneous rate fibers, which, as you might expect, lie in the middle between HSR and LSR fibers in sensitivity and spontaneous activity. You may, of course, wonder why these auditory nerve fibers would fire any impulses if there is no sound to encode, but it is worth bearing in mind that the amount of physical energy in relatively quiet sounds is minuscule, and that the sensory cells that need to pick up those sounds cannot necessarily distinguish a very quiet external noise from internal physiological noise that comes simply from blood flow or random thermal motion inside the ear at body temperature. Auditory nerve fibers operate right at the edge of this physiological noise floor, and the most sensitive cells are also most sensitive to the physiological background noise, which gives rise to their high spontaneous firing rate.

LSR fibers

200

HSR fibers

300 250

150 Spikes/s

Spikes/s

200 100

150 100

50

25 dB SPL 65 dB SPL 85 dB SPL

50 0

1000

2000 CF (Hz)

3000

4000

0

1000

2000

3000

4000

CF (Hz)

Figure 3.3 Firing‐rate distributions in response to the vowel [ɛ] in head [hɛːd] for low spontaneous rate fibers (left) and high spontaneous rate fibers (right) at three different sound intensities.

66 Sensing Speech To give you a sense of what these different auditory nerve fiber types contribute to speech representations as different sound levels, Figure 3.3 shows the firing‐rate distributions for the vowel [ɛ], much as in the right panel of Figure 3.2, but at three different sound levels (from a very quiet 25 dB SPL to a pretty loud 85 dB SPL, and for both LSR and HSR populations. As you can see, the LSR fibers (left panel) hardly respond at all at 25 dB, but the HSR fibers show clear peaks at the formant frequencies already at those very low sound levels. However, at the loud sound levels, most of the HSR fibers saturate, meaning that most of them are firing as fast as they can, so that the valleys between the formant peaks begin to disappear. One interesting consequence of this division of labor between HSR and LSR fibers for representing speech at low and high sound levels respectively is that it may provide an explanation why some people, particularly among the elderly, may complain of an increasing inability to understand speech in situations with high background noise. Recent work by Kujawa and Liberman (2015) has shown that, perhaps paradoxically, the less sound‐sensitive LSR fibers are actually more likely to be damaged during prolonged noise exposure. Patients with such selective fiber loss would still be able to hear quiet sounds quite well because their HSR fibers are intact, but they would find it very difficult to resolve sounds in high sound levels when HSR fibers are saturating and the LSR fibers that should encode spectral contrast at these high levels are missing. It has long been recognized that our ability to hear speech in noise tends to decline with age, even in those elderly who are lucky enough to retain normal auditory sensitivity (Stuart & Phillips, 1996), and it has been suggested that cumulative noise‐induced damage to LSR fibers such as that described by Kujawa and Liberman in their mouse model may pinpoint a possible culprit. Such hidden hearing loss, which is not detectable with standard audiometric hearing tests that measure sensitivity to probe tones in quiet, can be a significant problem, for example by taking all the fun out of important social occasions, such as lively parties and get‐ togethers, which leads to significant social isolation. However, some recent studies have looked for, but failed to find, a clear link between greater noise exposure and poorer reception of speech in noise (Grinn et al., 2017; Grose, Buss, & Hall, 2017), which would suggest that perhaps the decline in our ability to understand speech in noise as we age may be more to do with impaired representations of speech in higher cortical centers than with impaired auditory nerve representations. Of course, when you listen to speech, you don’t really want to have to ask yourself whether, given the current ambient sound levels, you should be listening to your HSR or your LSR auditory nerve fibers in order to get the best representation of speech formants, and one of the jobs of the auditory brainstem and midbrain circuitry is to combine information across these nerve fiber populations so that representations at midbrain and cortical stations will automatically adapt to changes both in mean sound level and in sound‐level contrast or variability, so that features like formants are efficiently encoded whatever the current acoustic environment happens to be (Dean, Harper, & McAlpine, 2005; Rabinowitz et al., 2013; Willmore et al., 2016).

How Does the Brain Represent Speech? 67 As we saw earlier, the tonotopic representation of speech‐sound spectra in the auditory nerve provides much information about speech formants, but not a great deal about harmonics, which would reveal voicing or voice pitch. We probably owe much of our ability to nevertheless hear voicing and pitch easily and with high accuracy to the fact that, in addition to the small number of resolved harmonics, the auditory nerve delivers a great deal of so‐called temporal fine structure information to the brain. To appreciate what is meant by that, consider Figure 3.4, which shows the waveform (top), a spectrogram (middle) and an auditory nerve neurogram display (bottom) for a recording of the spoken word ‘head.’ The neurogram was produced by computing firing rates of a bank of LSR auditory nerve fibers in response to the sound as a function of time using the model by Zilany, Bruce, and Carney (2014). The waveform reveals the characteristic [h𝜀:d] – sound waveform

Amplitude (Pa)

0.4 0.2 0 –0.2

0

50

100

150

200

250

300

350

400

300

350

400

350

400

Frequency (kHz)

Spectrogram 4 3 2 1 0

0

50

100

150

200

250

LSR nerve fibre responses 4 CF (kHz)

2 1 0.5 0.25 0.125 0

50

100

150

200 250 Time (ms)

300

Figure 3.4 Waveform (top), spectrogram (middle), and simulated LSR auditory nerve‐ fiber neurogram of the spoken word head [hɛːd].

68 Sensing Speech energy arc remarked upon by Greenberg (2006) for spoken syllables, with a relatively loud vowel flanked by much quieter consonants. The voicing in the vowel is manifest in the large sound‐pressure amplitude peaks, which arise from the glottal pulse train at regular intervals of approximately 7 ms, that is at a rate of approximately 140 Hz. This voice pitch is also reflected in the harmonic stack in the spectrogram, with harmonics at multiples of ~140 Hz, but this harmonic stack is not apparent in the neurogram. Instead we see that the nerve fiber responses rapidly modulate their firing rates to produce a temporal pattern of bands at time intervals which either directly reflect the 7 ms interval of the glottal pulse train (for nerve fibers with CFs below 0.2 kHz or above 1 kHz) or at intervals that are integer fractions (harmonics) of the glottal pulse interval. In this manner auditory nerve fibers convey important cues for acoustic features such as periodicity pitch by phase locking their discharges to salient features of the temporal fine structure of speech sounds with sub‐millisecond accuracy. As an aside, note that it is quite common for severe hearing impairment to be caused by an extensive loss of auditory hair cells in the cochlea, which can leave the auditory nerve fibers largely intact. In such patients it is now often possible to restore some hearing through cochlear implants, which use electrode arrays implanted along the tonotopic array to deliver direct electrical stimulation to the auditory nerve fibers. The electrical stimulation patterns delivered by the 20‐odd electrode contacts provided by these devices are quite crude compared to the activity patterns created when the delicate dance of the basilar membrane is captured by some 3,000 phenomenally sensitive auditory hair cells, but because coarsely resolving only a modest number of formant peaks is normally sufficient to allow speech sounds to be discriminated, the large majority of deaf cochlear implant patients do gain the ability to have pretty normal spoken conversations – as long as there is little background noise. Current cochlear implant processors are essentially incapable of delivering any of the temporal fine structure information which we have just described via the auditory nerve, and consequently cochlear implant users miss out on things like periodicity pitch cues, which may help them separate out voices in a cluttered auditory scene. A lack of temporal fine structure can also affect the perception of dialect and affect in speech, as well as melody, harmony, and timbre in music.

Subcortical pathways As we saw in Figure 3.1, the neural activity patterns just described are passed on first to the cochlear nuclei, and from there through the superior olivary nuclei to the midbrain, thalamus, and primary auditory cortex. As mentioned, each of these stations of the lemniscal auditory pathway has a tonotopic structure, so all we learned in the previous section about tonotopic arrays of neurons representing speech formant patterns neurogram style still applies at each of these stations. But that is not to say that the neural representation of speech sounds does not undergo some transformations along these pathways. For example, the cochlear nuclei

How Does the Brain Represent Speech? 69 contain a variety of different neural cell types that receive different types of converging inputs from auditory nerve fibers, which may make them more or less sensitive to certain acoustic cues. So‐called octopus cells, for example, collect inputs across a number of fibers across an extent of the tonotopic array, which makes them less sharply frequency tuned but more sensitive to the temporal fine structure of sounds such glottal pulse trains (Golding & Oertel, 2012). So‐called bushy cells in the cochlear nucleus are also very keen on maintaining temporal fine structure encoded in the timing of auditory nerve fiber inputs with very high precision, and passing this information on undiminished to the superior olivary nuclei (Joris, Smith, & Yin, 1998). The nuclei of the superior olive receive converging (and, of course, tonotopically organized) inputs from both ears, which allows them to compute binaural cues to the direction that sounds may have come from (Schnupp & Carr, 2009). Thus, firing‐rate distributions between neurons in the superior olive, and in subsequent processing stations, may provide information not just about formants or voicing of a speech sound, but also about whether the speech came from the left or the right or from straight ahead. This adds further dimensions to the neural representation of speech sounds in the brainstem, but much of what we have seen still applies: formants are represented by peaks of activity across the tonotopic array, and the temporal fine structure of the sound is represented by the temporal fine structure of neural firing patterns. However, while the tonotopic representation of speech formants remains preserved throughout the subcortical pathways up to and including those in the primary auditory cortex, temporal fine structure at fast rates of up to several hundred hertz is not preserved much beyond the superior olive. Maintaining the sub‐millisecond precision of firing patterns across a chain of chemical synapses and neural cell membranes that typically have temporal jitter and time constants in the millisecond range is not easy. To be up to the job, neurons in the cochlear nucleus and olivary nuclei have specialized synapses and ion channels that more ordinary neurons in the rest of the nervous system lack. It is therefore generally thought that temporal fine structure cues to aspects such as the periodicity pitch of voiced speech sounds become recoded as they ascend the auditory pathway beyond the brainstem. Thus, from about the inferior colliculus onward, temporal fine structure at fast rates is increasingly less represented though fast and highly precise temporal firing patterns, but instead through neurons becoming periodicity tuned (Frisina, 2001); this means that their firing rates may vary as a function of the fundamental frequency of a voiced speech sound, in addition to depending on the amount of sound energy in a particular frequency band. Some early work on periodicity tuning in the inferior colliculus has led to the suggestion that this structure may even contain a periodotopic map (Schreiner & Langner, 1988), with neurons tuned to different periodicities arranged along an orderly periodotopic axis running through the whole length of the inferior colliculus, with the periodotopic gradient more or less orthogonal to the tonotopic axis. Such an arrangement would be rather neat: periodicity being a major cue for sound features such as voice pitch, a periodotopic axis might, for example, physically separate out the representations of voices that differ substantially in pitch.

70 Sensing Speech But, while some later neuroimaging studies seemed to support the idea of a periodotopic map in the inferior colliculus (Baumann et al., 2011), more recent, very detailed, and comprehensive recordings with microelectrode arrays have shown conclusively that there are no consistent periotopic gradients running the width, breadth, or depth of the inferior colliculus (Schnupp, Garcia‐Lazaro, & Lesica, 2015), nor are such periodotopic maps a consistent feature of primary auditory cortex (Nelken et al., 2008). Thus, tuning to periodicity (and, by implication, voicing and voice pitch), as well as to cues for sound‐source direction, is widespread among neurons in the lemniscal auditory pathway from at least the midbrain upward, but neurons with different tuning properties appear to be arranged in clusters without much overarching systematic order, and their precise arrangement can differ greatly from one individual to the next. Thus, neural populations in these structures are best thought of as a patchwork of neurons that are sensitive to multiple features of speech sounds, including pitch, sound‐source direction, and formant structure (Bizley et al., 2009; Walker et al., 2011), without much discernible overall anatomical organization other than tonotopic order.

Primary auditory cortex So far, in the first half of this chapter we have talked about how speech is represented in the inner ear and auditory nerve, and along the subcortical pathways. However, for speech to be perceived, the percolation of auditory information must reach the cortex. Etymologically, the word cortex is Latin for “rind,” which is fitting as the cerebral cortex covers the outer surface of the brain – much like a rind covers citrus fruit. Small mammals like mice and trees shrews are endowed with relatively smooth cortices, while the cerebral cortices of larger mammals, including humans (Homo sapiens) and, even more impressively, African bush elephants (Loxodonta africana), exhibit a high degree of cortical folding (Prothero & Sundsten, 1984). The more folded, wrinkled, or crumpled your cortex, the more surface area can fit into your skull. This is important because a larger cortex (relative to body size) means more neurons, and more neurons generally mean more computational power (Jerison, 1973). For example, in difficult, noisy listening conditions, the human brain appears to recruit additional cortical regions (Davis & Johnsrude, 2003) which we shall come back to in the next few sections. In this section, we begin our journey through the auditory cortex by touching on the first cortical areas to receive auditory inputs: the primary auditory cortex.

Anatomy and tonotopicity of the human primary auditory cortex In humans the primary auditory cortex (PAC) is located around a special wrinkle in the cortical sheet, known as Heschl’s gyrus (HG). A gyrus is a ridge, where the cortical sheet is folded outward, while a sulcus describes an inward fold or valley.

How Does the Brain Represent Speech? 71 There are multiple HG in each brain. First, all people have at least two HG, one in each cerebral hemisphere (the left and right halves of the visible brain). These are positioned along the superior aspect of each temporal lobe. In addition, some brains have a duplication in the HG, that is, one or both hemispheres have two ridges instead of one (Da Costa et al., 2011). This anatomical factoid can be useful for identifying PAC (also known as A1) in real brains (as we shall see in Figure 3.5). However, the gyri are used only as landmarks: what matters is the sheet of neurons in and around HG, not whether that area is folded once or twice. This sheet of neurons receives connections from the subcortical auditory pathways, most prominently via the medial geniculate body of the thalamus (see Figure 3.1 and the previous section). When the cortex is smoothed, in silico, using computational image processing, the primary auditory cortex can be shown to display the same kind of tonotopic maps that we observed in the cochlea and in subcortical regions. This has been known from invasive microelectrode recordings in laboratory animals for decades and can be confirmed to be the case in humans using noninvasive MRI (magnetic resonance imaging) by playing subjects stimuli at different tones and then modeling the optimal cortical responses to each tone. This use of functional MRI (fMRI) results in the kind of tonotopic maps shown in Figure 3.5. Left hemisphere

Posterior

Medial

SG HG

HG→Heschl’s gyrus STG→superior temporal gyrus SG→supramarginal gyrus

Anterior

Lateral

STG

200

400

800 1600 3200 6400 Hz

Figure 3.5 Tonotopic map. HG = Heschl’s gyrus; STG = superior temporal gyrus; SG = supramarginal gyrus. (Source: Adapted from Humphries, Liebenthal, & Binder, 2010.)

72 Sensing Speech Figure 3.5 depicts a flattened view of the left‐hemisphere cortex colored in dark gray. Superimposed onto the flattened cortex is a tonotopic map (grayscale corresponding to the color bar on the bottom right). Each point on the surface of the tonotopic map has a preferred stimulus frequency, in hertz, and along the dotted arrow across HG there is a gradient pattern of responses corresponding to low frequencies, high frequencies, and then low frequencies again. Given this tonotopic organization of the primary auditory cortex, which is in some respects not that different from the tonotopy seen in lower parts of the auditory system, we may expect the nature of the representation of sounds (including speech sounds) in this structure to be to a large extent spectrogram‐like. That is, if we were to read out the firing‐rate distributions along the frequency axes of these areas while speech sounds are represented, the resulting neurogram of activity would exhibit dynamically shifting peaks and troughs that reflect the changing formant structure of the presented speech. That this is indeed the case has been shown in animal experiments by Engineer et al. (2008), who, in one set of experiments, trained rats to discriminate a large set of consonant–vowel syllables and, in another, recorded neurograms for the same set of syllables from the primary cortices of anesthetized rats using microelectrodes. They found, first, that rats can learn to discriminate most American English syllables easily, but are more likely to confuse syllables that humans too find more similar and easier to confuse (e.g. ‘sha’ vs. ‘da’ is easy, but ‘sha’ vs. ‘cha’ is harder). Second, Engineer et al. found that the ease with which rats can discriminate between two speech syllables can be predicted by how different the primary auditory cortex neurograms for these syllables are. These data would suggest that the representation of speech in primary auditory cortex is still a relatively unsophisticated time–frequency representation of sound features, with very little in the way of recognition, categorization, or interpretation. Calling primary auditory cortex unsophisticated is, however, probably doing it an injustice. Other animal experiments indicate that neurons in the primary auditory cortex can, for example, change their frequency tuning quickly and substantially if a particular task requires attention to be directed to a particular frequency band (Edeline, Pham, & Weinberger, 1993; Fritz et al., 2003). Primary auditory cortex neurons can even become responsive to stimuli or events that aren’t auditory at all if these events are firmly associated with sound‐related tasks that an animal has learned to master (Brosch, Selezneva, & Scheich, 2005). Nevertheless, it is currently thought that the neural representations of sounds and events in the primary auditory cortex are probably based on detecting relatively simple acoustic features and are not specific to speech or vocalizations, given that the primary cortex does not seem to have any obvious preference for speech over nonspeech stimuli. In the human brain, to find the first indication of areas that appear to prefer speech to other, nonspeech sounds, we must move beyond the tonotopic maps of the primary auditory cortex (Belin et al., 2000; Scott et al., 2000). In the following sections we will continue our journey through the auditory system into cortical regions that appear to make specialized contributions to speech processing, and which are situated in the temporal, parietal, and frontal lobes. We will also discuss how these regions communicate with each other in

How Does the Brain Represent Speech? 73 Medial frontal gyrus Anterior cingulate

Foot area Central sulcus

Frontal lobe Parietal lobe Middle frontal gyrus

PMC

SMC

IPL

Lip area IFG

aSTG

PAC

pSTG

Sylvian fissure

Temporal lobe

Occipital lobe Fusiform gyrus

Figure 3.6 A map of cortical areas involved in the auditory representation of speech. PAC = primary auditory cortex; STG = superior temporal gyrus; aSTG = anterior STG; pSTG = posterior STG; IFG = inferior frontal gyrus; PMC = pre‐motor cortex; SMC = sensorimotor cortex; IPL = inferior parietal lobule. Dashed lines indicate medial areas. (Source: Adapted from Rauschecker & Scott, 2009.)

noisy contexts and during self‐generated speech, when information from the (pre) motor cortex influences speech perception, and look at representations of speech in time. Figure 3.6 introduces the regions and major connections to be discussed. In brief, we will consider the superior temporal gyrus (STG) and the premotor cortex (PMC), and then loop back to the STG to discuss how brain regions in the auditory system work together as part of a dynamic network.

What does the higher‐order cortex add? All the systems that we have reviewed so far on our journey along the auditory pathway have been general auditory‐processing systems. So, although they are important for speech processing, their function is not speech specific. For example, the cochlea converts pressure waves into electrical impulses, whether the pressure waves are encoding a friendly ‘hello’ or the sound of falling rain; the subcortical pathways process and propagate these neural signals to the primary auditory

74 Sensing Speech cortex, regardless of whether they are encoding a phone conversation, barking dogs, or noisy traffic; and the primary auditory cortex exhibits a tonotopic representation of an auditory stimulus, whether that stimulus is part of a Shakespearean soliloquy or of Ravel’s Boléro. In this section, we encounter a set of cortical areas that preferentially process speech over other kinds of auditory stimuli. We will also describe deeply revealing new work into the linguistic‐phonetic representation of speech, obtained using surgical recordings in human brains.

Speech‐preferential areas That areas of the brain exist that are necessary for the understanding of speech but not for general sound perception has been known since the nineteenth century, when the German neurologist Carl Wernicke associated the aphasia that bears his name with damage to the STG (Wernicke, 1874). Wernicke’s eponymous area was, incidentally, reinterpreted by later neurologists to refer only to the posterior third of the STG and adjacent parietal areas (Bogen & Bogen, 1976), although some disagreement about its precise boundaries continues until this day (Tremblay & Dick, 2016). With the advent of fMRI at the end of the twentieth century, the posterior STG (pSTG) was confirmed to respond more strongly to vocal sounds than to nonvocal sounds (e.g. speech, laughter, or crying compared to the sounds of wind, galloping, or cars; Belin et al., 2000). Neuroimaging also revealed a second, anterior, area in the STG, which responds more to vocal than to nonvocal sounds (Belin et al., 2000). These voice‐preferential areas can be found in both hemispheres of the brain. Additional studies have shown that it is not just the voice but also intelligible speech that excites these regions, with speech processing being more specialized in the left hemisphere (Scott et al., 2000). Anatomically, the anterior and posterior STG receive white‐matter connections from the primary auditory cortex, and in turn feed two auditory‐processing streams, one antero‐ventral, which extends into the inferior frontal cortex, and the other postero‐dorsal, which curves into the inferior parietal lobule. The special function of these streams remains a matter of debate. For example, Rauschecker and Scott (2009) propose that the paths differ in processing what and where information in the auditory signal, where what refers to recognizing the cause of the sound (e.g. it’s a thunderclap) and where to locating the sound’s spatial location (e.g. to the west). Another, more linguistic, suggestion is that the ventral stream is broadly semantic, whereas the dorsal stream may be described as more phonetic in nature (Hickok & Poeppel, 2004). Whatever the functions, however, there appear to be two streams diverging around the anterior and posterior STG. Over the years, these early STG results have been replicated many times using neuroimaging (Price, 2012). Each technique for observing activity of the human brain, whether it is noninvasive magnetoencephalography (MEG) or fMRI, or invasive surgical techniques such electrocorticography (ECoG; described in the next section), all have their limitations and shortcomings. It is therefore reassuring that the insights into the neuroanatomy of speech comprehension established by

How Does the Brain Represent Speech? 75 methods like MEG or fMRI, which can image the whole brain, are both confirmed and extended by studies using targeted surgical techniques like ECoG.

Auditory phonetic representations in the superior temporal gyrus ECoG, which involves the placement of electrodes directly onto the surface of the brain, cannot easily record from the primary auditory cortex. This is because the PAC is tucked away inside the Sylvian fissure, along the dorsal aspect of the temporal lobe. At the same time, because ECoG measures the summed postsynaptic electrical current of neurons with millisecond resolution, it is sensitive to rapid neural responses at the timescale of individual syllables, or even individual phones. By contrast, fMRI measures hemodynamic responses; these are changes in blood flow that are related to neural activity but occur on the order of seconds. In recent years, the use of ECoG has revolutionized the study of speech in auditory neuroscience. An exemplar of this can be found in a recent paper (Mesgarani et al., 2014). Mesgarani et al. (2014) used ECoG to learn about the linguistic‐phonetic representation of auditory speech processing in the STG of six epileptic patients. These patients listened passively to spoken sentences taken from the TIMIT corpus (Garofolo et al., 1993), while ECoG was recorded from their brains. These ECoG recordings were then analyzed to discover patterns in the neural responses to individual speech sounds (for a summary of the experimental setup, see Figure 3.7, panels A–C). The authors used a phonemic analysis of the TIMIT dataset to group the neural responses at each electrode, according to the phoneme that caused it. For examples, see panel D of Figure 3.7, which allows the comparison of responses to different speech sounds for a number of different sample electrodes labeled e1 to e5. The key observation here is that an electrode such as e1 gives similar responses for /d/ and /b/ but not for /d/ and /s/, and that the responses at each of the electrodes shown will respond strongly for some groups of speech sounds but not others. Given these data, we can ask the question: Do STG neurons group, or classify, speech segments through the similarity of their response patterns? And, if so, which classification scheme do they use? Linguists and phoneticians often analyze individual speech sounds into feature classes, based, for example, on either the manner or the place of articulation that is characteristic for that speech act. Thus, /d/, /b/, and /t/ are all members of the plosive manner‐of‐articulation class because they are produced by an obstruction followed by a sudden release of air through the vocal tract, and /s/ and /f/ belong to the fricative class because both are generated by turbulent air hissing through a tight constriction in the vocal tract. At the same time, /d/ and /s/ also belong to the alveolar place‐of‐articulation class because, for both phonemes, the tip of the tongue is brought up toward the alveolar ridge just behind the top row of the teeth. In contrast, /b/ has a labial place of articulation because to articulate /b/ the airflow is constricted at the lips. Manner features are often associated with particular acoustic characteristics. Plosives involve characteristically brief intervals of silence followed by a short noise burst, while fricatives exhibit sustained

76 Sensing Speech

(b) 8

1 0

0.1

Normalized power

and what eyes they were

Frequency (KHz)

(c)

Phonemes

120 t-value 0

ε n wә a z ð e w εr

0

Posterior 1.4

z-score

Electrode

2

Time (s)

e1

0

e4

e5

0

–1 32

i PSI

ð

v n m

0

0 0.2 0.4

0 0.2 0.4 0 0.2 0.4 0 0.2 0.4 Time from phoneme onset (s)

0 0.2 0.4

(f)

Electrodes

d b g p k t

PSI

e3

1

(e)

32

e2

e j

Anterior

0

d b g p k t ∫ z s f 0 u w ә r l o c a a a æ

z-score

(d)

(a)

d b g p k t

z s f

z s f

u w ә r l o c a a a æ e³ j

u w ә r l o c a a a æ e³ j

i

i

ð v n m

ð v n

m

0

(g)

1

1 Obstruent

Sonorant

Figure 3.7 Feature‐based representations in the human STG. (a) shows left‐hemisphere cortex with black dots indicating ECoG electrodes. (b) shows an example acoustic stimulus (and what eyes they were), including orthography, waveform, spectrogram, and IPA transcription. (c) shows time‐aligned neural responses to the acoustic stimulus. The electrodes (y‐axis) are sorted spatially (anterior to posterior), with time (in seconds) along the x‐axis. (d) shows sample phoneme responses by electrode. For five electrodes (e1 to e5), the plots show cortical selectivity for English phonemes (y‐axis) as a function of time (x‐axis), with phoneme onsets indicated by vertical dashed lines. The phoneme selectivity index (PSI) is a summary over time of how selective the cortical response is for each phoneme. (e) shows phoneme responses (PSIs) for all electrodes, arranged for hierarchical clustering analyses. (f) and (g) show clustering analyses by phoneme and by electrode. These show how phonemes and electrodes are grouped, respectively, with reference to phonetic features. For example, (f) shows that electrodes can be grouped by selectivity to obstruents and sonorants. Source: Mesgarani et al., © 2014, The American Association for the Advancement of Science.

How Does the Brain Represent Speech? 77 aperiodic noise spread over a wide part of the spectrum. Classifying speech sounds by place and manner of articulation is certainly very popular among speech scientists, and is also implied in the structure of the International Phonetic Alphabet (IPA), but it is by no means the only possible scheme. Speech sounds can also be described and classified according to alternative acoustic properties or perceptual features, such as loudness and pitch. An example feature that is harder to characterize in articulatory or acoustic terms is sonority. Sonority defines a scale of perceived loudness (Clements, 1990) such that vowels are the most sonorous, and glides are the next most sonorous, followed by then liquids, nasals, and finally obstruents (i.e. fricatives and plosives). Despite the idea of sonority as a multitiered scale, phonemes are sometimes lumped into two groups of sonorant and nonsonorant, with everything but the obstruents counting as sonorants. As these examples illustrate, there could in principle be many different ways in which speech sounds are grouped. To ask which grouping is “natural” or “native” for the STG, Mesgarani et al. (2014) used hierarchical clustering of neural responses to speech, examples of which can be seen in the ECoG recordings depicted in Figure 3.7, panel D. The results of the clustering analysis follow in Figure 3.7, panels E–G. Perhaps surprisingly, Mesgarani et al. (2014) discovered that the STG was organized primarily by manner‐of‐articulation features and secondarily by place‐of‐articulation features. The prominence of manner‐of‐articulation features can be seen by clustering the phonemes directly (Figure 3.7, panel F). For example, on the right‐side dendrogram we find neat clusters of plosives /d b g p k t/, fricatives /ʃ z s f θ/, and nasals /m n ŋ/. Manner‐of‐articulation features also stand out when the electrodes are clustered (Figure 3.7, panel G). By going up a column from the bottom dendrogram, we can find the darkest cells (those with the greatest selectivity for phonemes), and then follow these rows to the left to identify the phonemes for which the electrode signal was strongest. The electrode indexed by the leftmost column, for example, recorded neural activity that appeared selective for the plosives /d b g p k t/. In this way, we may also find electrodes that respond to both manner and place of articulation features. For example, the fifth column from the left responds to the bilabial plosives /b p/. Thus, the types of features that phoneticians have for a long time employed for classifying speech sounds turn out to be reflected in the neural patterns across the STG. Mesgarani et al. (2014) argue that this pattern of organization, prioritizing manner over place‐of‐ articulation features, is most consistent with auditory‐perceptual theories of feature hierarchies (Stevens, 2002; Clements, 1985). Auditory‐perceptual theories contrast, for instance, with articulatory or gestural theories, which Mesgarani et al. (2014) assert would have prioritized place‐of‐articulation features (Fowler, 1986). The clustering analyses in Figure 3.7 (panels F and G) are doubly rich: they at the same time support a broadly auditory‐perceptual view of sound representations in the STG while also revealing limitations of that view. For instance, on the right‐side cluster (panel F), we find that the phonemes /v/ and /ð/ do not cluster with the other fricatives /ʃ z s f θ/. Instead the fricatives /v/ and /ð/ cluster with the sonorants. Moreover, /v/ and /ð/ are most closely clustered in a group of high front and central vowels and glides /j ɪ i ʉ/. This odd grouping may reflect noise

78 Sensing Speech at some level of the experiment or analysis, but it raises the intriguing possibility that the STG actually groups /j ɪ i ʉ v ð/ together, and thus does not strictly follow established phonetic conventions. Therefore, in addition to articulatory, acoustic, and auditory phonetics, studies such as this on the cortical response to speech may pave the way to innovative neural feature analyses. However, we would like to emphasize that these are early results in the field. The use of discrete segmental phonemes may, for example, be considered a useful first approximation to analyses using more complex, overlapping feature representations.

Auditory phonetic representations in the sensorimotor cortex From the STG, we turn now to a second cortical area. The ventral sensorimotor cortex (vSMC) is better known for its role in speech production than in speech comprehension (Bouchard et al., 2013). This part of the cortex, near the ventral end of the SMC (see Figure 3.6), contains the primary motor and somatosensory areas, which send motor commands to and receive touch and proprioceptive information from the face, lips, jaw, tongue, velum, and pharynx. The vSMC plays a key role in controlling the muscles associated with these articulators, and is further involved in monitoring feedback from the sensory nerves in these areas when we speak. Less widely known is that the vSMC also plays a role in speech perception. We know, for example, that a network including frontal areas becomes more active when the conditions for perceiving speech become more difficult (Davis & Johnsrude, 2003), such as when there is background noise or the sound of multiple speakers overlaps (contrast easy listening conditions when distractions like these are absent). This context‐specific recruitment of speech‐production areas may signal that they play an auxiliary role in speech perception, by providing additional computational resources when the STG is overburdened. We might ask how the vSMC, as an auxiliary auditory system that is primarily dedicated to coordinating the articulation of speech, represents heard speech. Does the vSMC represent the modalities of overt and heard speech similarly or differently? Is the representation of heard speech in the vSMC similar or different from that of the STG? ECoG studies of speech production (Bouchard et al., 2013; Cheung et al., 2016) suggest that place‐of‐articulation features take primacy over the manner‐of‐articulation features in the vSMC, which is the reverse of what we described for the STG (Mesgarani et al., 2014). Given that the vSMC contains a map of body parts like the lips and tongue, it makes sense that this region be represented by place‐of‐ articulation features, rather than by manner‐of‐articulation features. But does this representation in vSMC hold during both speech production and comprehension? Our starting hypothesis might be that, yes, the feature representations in vSMC will be the same regardless of task. There is even some theory to back this up. For example, there have been proposals, like the motor theory of speech perception (Liberman et al., 1967; Liberman & Mattingly, 1985) or the analysis‐by‐synthesis theory (Stevens, 1960) that view speech perception as a kind of active rather than passive process. Analysis by synthesis says that speech perception involves trying

How Does the Brain Represent Speech? 79 to match what you hear to what your own mouth, and other articulators, would have needed to do to produce what you heard. Speech comprehension would therefore involve an active process of covert speech production. Following this line of thought, we might suppose that what the vSMC does, when it is engaged in deciphering what your friend is asking you at a noisy cocktail party, is in some sense the same as what the vSMC does when it is used to articulate your reply. Because we know that place‐of‐articulation features take priority over manner‐of‐ articulation features in the vSMC during a speech‐production task (i.e. reading consonant–vowel syllables aloud), we might hypothesize that place‐of‐articulation features will similarly take primacy during passive listening. Interestingly, despite being predicted by theory, this prediction is wrong. When Cheung et al. (2016) examined neural response patterns in the vSMC while subjects listened to recordings of speech, they found that, as in the STG, it was the manner‐of‐articulation features that took precedence. In other words, representations in vSMC were conditioned by task: during speech production the vSMC favored place‐of‐articulation features (Bouchard et al., 2013; Cheung et al., 2016), but during speech comprehension the vSMC favored manner‐of‐articulation features (Cheung et al., 2016). As we discussed earlier, the STG is also organized according to manner‐of‐articulation features when subjects listen to speech (Mesgarani et al., 2014). Therefore the representations in these two areas, STG and vSMC, appear to use a similar type of code when they represent heard speech. To be more concrete, Cheung et al. (2016) recorded ECoG from the STG and vSMC of subjects performing two tasks. One task involved reading aloud from a list of consonant–vowel syllables (e.g. ‘ba,’ ‘da,’ ‘ga’), while the other task involved listening to recordings of people producing these syllables. Instead of using hierarchical clustering, as Mesgarani et al. (2014) did in their study of the STG, Cheung et al. (2016) used a dimensionality‐reduction technique called multidimensional scaling (MDS) but with the similar goal of describing the structure of phoneme representations in the brain during each task (Figure 3.8). For the speaking task, the dimensionality‐reduced vSMC representations for eight sounds could be linearly separated into three place‐of‐articulation features: labial /p b/, alveolar /t d s ʃ/, and velar /k g/ (see Figure 3.8, panel D). The same phonemes could not be linearly separated into place‐of‐articulation features in the listening task (Figure 3.8, panel E); however they could be linearly separated into another set of features (Figure 3.8, panel G): voiced plosives /d g b/, voiceless plosives /k t p/, and fricatives /ʃ s/. These are the same manner‐of‐articulation and voicing features that characterize the neural responses in STG to heard speech (Figure 3.8, panel F). Again, the implication is that the vSMC has two codes for representing speech, suggesting that there are either two distinct but anatomically intermingled neural populations in vSMC, or the same population of neurons is capable of operating in two very different representational modes. Unfortunately, the spatial resolution of ECoG electrodes is still too coarse to resolve this ambiguity, so other experimental techniques will be needed. For now, we can only say that during speech production the vSMC uses a feature analysis that emphasizes place‐of‐articulation features, but during speech comprehension the vSMC uses a feature analysis that instead emphasizes manner features and voicing. An intriguing possibility is that the existence of

80 Sensing Speech (c) 5

z-score –5

Place of articulation

MDS2

(d)

Plosive

Velar

td

kg

s∫

(e)

Speaking: vSMC sites kg ∫ td s

SPEAKING z-score –5

Alveolar

pb

Fricative

p

b

Listening: vSMC sites b

d

Labial Velar Alveolar

(b) 5

Labial

MDS2

LISTENING

Manner of articulation

(a)

t ∫ k s g p

MDS1

MDS1

(f)

(g)

k db g tp

Listening: vSMC sites

∫

s

Fricative Voiced plosive Voiceless plosive

MDS1

MDS2

Listening: STG sites MDS2

Labial Velar Alveolar

b

d

t ∫ k s p g

Fricative Voiced plosive Voiceless plosive

MDS1

Figure 3.8 Feature‐based representations in the human sensorimotor cortex. (a) and (b) show the most significant electrodes (gray dots) for listening and speaking tasks. (c) presents a feature analysis of the consonant phonemes used in the experiments. The left phoneme in each pair is unvoiced and the right phoneme is voiced (e.g. /p/ is unvoiced and /b/ is voiced). (d–g) are discussed in the main text; each panel shows a low‐dimensional projection of the neural data where distance between phoneme representations is meaningful (i.e. phonemes that are close to each other are represented similarly in the neural data). The dotted lines show how groups of phonemes can be linearly separated (or not) according to place of articulation, manner of articulation, and voicing features. Source: Cheung et al., 2016. Licensed under CC BY 4.0.

similar representations for heard speech in the STG and the vSMC may play an important role in the communication, or connectivity, between distinct cortical regions – a topic we touch on in the next section.

Systems‐level representations and temporal prediction Our journey through the auditory system has focused on specific regions and on the auditory representation of speech in these regions. However, representations in the brain are not limited to isolated islands of cells, but also rely upon constellations of regions that relay information within a network. In this section, we touch

How Does the Brain Represent Speech? 81 briefly on the topic of systems‐level representations of speech perception and on the related topic of temporal prediction, which is at the heart of why we have brains in the first place.

Auditory feedback networks One way to appreciate the dynamic interconnectedness of the auditory brain is to consider the phenomenon of auditory suppression. Auditory suppression manifests, for example, in the comparison of STG responses when we listen to another person speak and when we speak ourselves, and thus hear the sounds we produce. Electrophysiological studies have shown that auditory neurons are suppressed in monkeys during self‐vocalization (Müller‐Preuss & Ploog, 1981; Eliades & Wang, 2008; Flinker et al., 2010). This finding is consistent with fMRI and ECoG results in humans, showing that activity in the STG is suppressed during speech production compared to speech comprehension (Eliades & Wang, 2008; Flinker et al., 2010). The reason for this auditory suppression is thought to be an internal signal (efference copy) received from another part of the brain, such as the motor or premotor cortex, which has inside information about external stimuli when those external stimuli are self‐produced (Holst & Mittelstaedt, 1950). The brain’s use of this kind of inside information is not, incidentally, limited to the auditory system. Anyone who has failed to tickle themselves has experienced another kind of sensory suppression, again thought to be based on internally generated expectations (Blakemore, Wolpert, & Frith, 2000). Auditory suppression in the STG is also a function of language proficiency. As an example, Parker Jones et al. (2013) explored the interactions between premotor cortex and two temporal areas (aSTG and pSTG) when native and nonnative English speakers performed speech‐production tasks such as reading and picture naming in an MRI scanner. The fMRI data were then subjected to a kind of connectivity analysis, which can tell us which regions influenced which other regions of the brain. Technically, the observed signals were deconvolved to model the effect of the hemodynamic response, and the underlying neural dynamics were inferred by inverting a generative model based on a set of differential equations (Friston, Harrison, & Penny, 2003; Daunizeau, David, & Stephan, 2011). A positive connection between two regions, A and B, means that, when the response in A is strong, the response in B will increase (i.e. B will have a positive derivative). Likewise, a negative connection means that, when the response in A is strong, the response in B will decrease (B will have a negative derivative). Between the PMC and temporal auditory areas, Parker Jones et al. (2013) observed significant negative connections, implying that brain activity in the PMC caused a decrease in auditory temporal activity consistent with auditory suppression. However, auditory suppression was observed in only the native English speakers. In nonnative speakers, there was no significant auditory suppression, but there was a positive effect between pSTG and PMC consistent with the idea of error feedback. The results suggest that PMC sends signal‐canceling, top‐down predictions to aSTG and pSTG. These top‐down predictions are stronger if you are a native speaker and more

82 Sensing Speech confident about what speech sounds you produce. In nonnative speakers, the top‐ down predictions canceled less of the auditory input, and a bottom‐up learning signal (“error”) was fed back from the pSTG to the PMC. Interestingly, as the nonnative speakers became more proficient, the learning signals were observed to decrease, so that the most highly proficient nonnative speakers were indistinguishable from native speakers in terms of error feedback. The example of auditory suppression argues for a systems‐level view of speech comprehension that includes both auditory and premotor regions of the brain. Theoretically, we might think of these regions as being arranged in a functional hierarchy, with PMC located above both aSTG and pSTG. Top‐down predictions may thus be said to descend from PMC to aSTG and pSTG, while bottom‐up errors percolate in the opposite direction, from pSTG to PMC. We note that the framework used to interpret the auditory suppression results, predictive coding, subtly inverts the view that perceptual systems in the brain passively extract knowledge from the environment; instead, it proposes that these systems are actively trying to predict their sense experiences (Ballard, Hinton, & Sejnowski, 1983; Mumford, 1992; Kawato, Hayakawa, & Inui, 1993; Dayan et al., 1995; Rao & Ballard, 1999; Friston & Kiebel, 2009). In a foundational sense, predictive coding frames the brain as a forecasting machine, which has evolved to minimize surprises and to anticipate, and not merely react to, events in the world (Wolpert, Ghahramani, & Flanagan, 2001). This is not necessarily to say that what it means to be a person is to be a prediction machine, but rather to conjecture that perceptual systems in our brains, at least sometimes, predict sense experiences.

Temporal prediction The importance of prediction as a theme and as a hypothetical explanation for neural function also goes beyond explicit modeling in neural networks. We can invoke the idea of temporal prediction even when we do not know about the underlying connectivity patterns. Speech, for example, does not consist of a static set of phonemes; rather, speech is a continuous sequence of events, such that hearing part of the sequence gives you information about other parts that you have yet to hear. In phonology the sequential dependency of phonemes is called phonotactics and can be viewed as a kind of prediction. That is, if the sequence /st/ is more common than /sd/, because /st/ occurs in syllabic onsets, then it can be said that /s/ predicts /t/ (more than /s/ predicts /d/). This use of phonotactics for prediction is made explicit in machine learning, where predictive models (e.g. bigram and trigram models historically, or, more recently, recurrent neural networks) have played an important role in the development and commercial use of speech‐recognition technologies (Jurafsky & Martin, 2014; Graves & Jaitly, 2014). In neuroscience, the theme of prediction comes up in masking and perceptual restoration experiments. One remarkable ECoG study, by Leonard et al. (2016), played subjects recordings of words in which key phonemes were masked by noise. For example, a subject might have heard /fæ#tr/, where the /#/ symbol represents a brief noise burst masking the underlying phoneme. In this example, the intended

How Does the Brain Represent Speech? 83 word is ambiguous: it could have been /fæstr/ ‘faster’ or /fæktr/ ‘factor’. So, by controlling the context in which the stimulus was presented, Leonard et al. (2016) were able to manipulate subjects to hear one word or the other. In the sentence ‘On the highway he drives his car much /fæ#tr/,’ we expect the listener to perceive the word ‘faster’ /fæstr/. In another sentence, that expectation was modified so that subjects perceived the same noisy segment of speech as ‘factor’ /fæktr/. Leonard et al. (2016) then used a technique called stimulus reconstruction, by which it is possible to infer rather good speech spectrograms from intracranial recordings (Mesgarani et al., 2008; Pasley et al., 2012). Spectrograms reconstructed from masked stimuli showed that the STG had filled in the missing auditory representations (Figure 3.9). For example, when the context was modulated so that subjects perceived the ambiguous stimulus as ‘faster’/fæstr/, the reconstructed spectrogram was shown to contain an imagined fricative(s) (Figure 3.9, panel E). When subjects perceived the word as ‘factor’/fæktr/, the reconstructed spectrogram contained an imagined stop [k] (Figure 3.9, panel F). In this way, Leonard et al. (2016) demonstrated that auditory representations of speech are sensitive to their temporal context. In addition to filling in missing phonemes, the idea of temporal prediction can be invoked as an explanation of how the auditory system accomplishes one of its most difficult feats: selective attention. Selective attention is often called the cocktail party problem, because many people have experienced the use of selective attention in a busy, noisy party to isolate one speaker’s voice from the cacophonous mixture of many. Mesgarani and Chang (2012) simulated this cocktail party experience (unfortunately without the cocktails) by simultaneously playing two speech recordings to their subjects, one in each ear. The subjects were asked to attend to the recording presented to a specific ear and ECoG was used to record neural responses from the STG. Using the same stimulus‐reconstruction technique as Leonard et al. (2016), Mesgarani and Chang (2012) took turns reconstructing the speech that was played to each ear. Despite the fact that acoustic energy entered both ears and presumably propagated up the subcortical pathway, Mesgarani and Chang (2012) found that, once the neural processing of the speech streams had reached the STG, only the attended speech stream could be reconstructed; to the STG, it was as if the unattended stream did not exist. We know from a second cocktail party experiment (which again did not include any actual cocktails) that selective attention is sensitive to how familiar the hearer is with each speaker. In their behavioral study, Johnsrude et al. (2013) recruited a group of subjects that included multiple spouses. If you were a subject in the study, your partner’s voice was sometimes the target (i.e. attended speech); your partner’s voice was sometimes the distractor (i.e. unattended speech); and sometimes both target and distractor voices belonged to other subjects’ spouses. Johnsrude et al. (2013) found that not only were subjects better at recalling semantic details of the attended speech when the target speaker was their partner, but they also performed better when their spouse played the role of distractor, compared to when both target and distractor roles were played by strangers. In effect, Johnsrude et al. (2013) amusingly showed that people are better at ignoring their own spouses

84 Sensing Speech Stimulus spectrograms

(a)

(b)

faster

factor

Frequency (kHz)

8

0.1

(c)

f

æ

s

t

r f æ Reconstructed spectrograms

k

t

Original faster

(d)

Original factor

Masked faster

(f)

Masked factor

r

8

0.1 8

Frequency (kHz)

(e)

0.1

200

400 Time (ms)

600

200

400

600

Time (ms)

Figure 3.9 The human brain reinstates missing auditory representations. (a) and (b) show spectrograms for two words, faster /fæstr/ and factor /fæktr/. The segments of the spectrograms for /s/ and /k/ are indicated by dashed lines. The arrow in (a) points to aperiodic energy in higher‐frequency bands associated with fricative sounds like [s], which is absent in (b). (c) and (d) show neural reconstructions when subjects heard (a) and (b). (e) and (f) show neural reconstructions when subjects heard the masked stimulus /fæ#tr/. In (e), subjects heard On the highway he drives his car much /fæ#tr/, which caused them to interpret the masked segment as /s/. In (f), the context suggested that the masked segment should be /k/. Source: Leonard et al., 2016. Licensed under CC BY 4.0.

than they are at ignoring strangers. Given that hearers can fill in missing information when it can be predicted from context (Leonard et al., 2016), it makes sense that subjects should comprehend the speech of someone familiar, whom they are better at predicting, than the speech of a stranger. Given that native

How Does the Brain Represent Speech? 85 speakers are better than nonnative speakers at suppressing the sound of their own voices (Parker Jones et al., 2013), it also makes sense that subjects should be better able to suppress the voice of their spouse – again assuming that their spouse’s voice is more predictable to them than a stranger’s. Taken together, these findings suggest that the mechanism behind selective attention is, again, prediction. So, while Mesgarani and Chang (2012) may be unable to reconstruct the speech of a distractor voice from ECoG recordings in the STG, it may be that higher brain regions will nonetheless contain a representation of the distractor voice for the purpose of suppressing it. An as yet unproven hypothesis is that the increased neural activity in frontal areas, observed during noisy listening conditions (Davis & Johnsrude, 2003), may be busy representing background noise or distractor voices, so that these sources may be filtered out of the mixed input signal. One way to test this may be to replicate Mesgarani and Chang’s (2012) cocktail party study, but with the focus on reconstructing speech from ECoG recordings taken from the auxiliary speech comprehension areas described by Davis and Johnsrude (2003) rather than from the STG. In the next and final section, we turn from sounds to semantics and to the representation of meaning in the brain.

Semantic representations Following a long tradition in linguistics that goes back to Saussure (1989), speech may be thought of as a pairing of sound and meaning. In this chapter, our plan has been to follow the so‐called chain of speech (linking articulation, acoustics, and audition) deep into the brain systems involved in comprehending speech (cochlea, subcortical pathways, primary auditory cortex, and beyond). We have asked how the brain represents speech at each stage and even how speech representations are dynamically linked in a network of brain regions, but we have not talked yet about meaning. This was largely dictated by necessity: much more is known about how the brain represents sound than meaning. Indeed, it can even be difficult to pin down what meaning means. In this section, we will focus on a rather narrow kind of meaning, which linguists refer to as semantics and which should be kept distinct from another kind of meaning called pragmatics. Broadly speaking, semantics refers to literal meaning (e.g. ‘It is cold in here’ as a comment on the temperature of the room), whereas pragmatics refers to meaning in context (‘It is cold in here’ as an indirect request that someone close the window). It may be true that much of what is interesting about human communication is contextual (we are social animals after all), but we shall have our hands full trying to come to grips with even a little bit of how the brain represents the literal meaning of words (lexical semantics). Moreover, the presentation we give here views lexical semantics from a relatively new perspective grounded in the recent neuroscience and machine‐ learning literatures, rather than in the linguistic (and philosophical) tradition of formal semantics (e.g. Aloni & Dekker, 2016). This is important because many established results in formal semantics have yet to be explained neurobiologically.

86 Sensing Speech For future neurobiologists of meaning, there will be many important discoveries to be made.

Embodied meaning Despite the difficulty of comprehending the totality of what an example of speech might mean to your brain, there are some relatively easy places to begin. One kind of meaning a word might have, for instance, will relate to the ways in which you experience that word. Take the word ‘strawberry.’ Part of the meaning of this word is the shape and vibrant color of strawberries that you have seen. Another is how it smells and feels in your mouth when you eat it. To a first approximation, we can think of the meaning of the word ‘strawberry’ as the set of associated images, colors, smells, tastes, and other sensations that it can evoke. This is a very useful operational definition of “meaning” because it is to an extent possible to decode brain responses in sensory and motor areas and test whether these areas are indeed activated by words in the ways that we might expect, given the word’s meanings. To take a concrete example of how this approach can be used to distinguish the meaning of two words, consider the words ‘kick’ and ‘lick’: they differ by only one phoneme, /k/ versus /l/. Semantically, however, the words differ substantially, including, for example, by the part of the body that they are associated with: the foot for ‘kick’ and the tongue for ‘lick.’ Since, as we know, the sensorimotor cortex contains a map of the body, the so‐called homunculus (Penfield & Boldrey, 1937), with the foot and tongue areas at opposite ends, the embodied view of meaning would predict that hearing the word ‘kick’ should activate the foot area, which is located near the very top of the head, along the central sulcus on the medial surface of the brain, whereas the word ‘lick’ should active the tongue area, on the lateral surface almost all the way down the central sulcus to the Sylvian fissure. And indeed, these predictions have been verified now over a series of experiments (Pulvermüller, 2005): when you hear a word like ‘kick’ or ‘lick,’ not only does your brain represent the sounds of these words through the progression of acoustic, phonetic, and phonological representations in a hierarchy of auditory‐processing centers that has been discussed in this chapter, but your brain also represents the meaning of these words across a network of associations that certainly engage your sensory and motor cortices, and, as we shall see, many other cortical regions too. The result of ‘kick’ and ‘lick’ is of fundamental importance because it gives us a leg up, so to speak, on the very difficult problem of trying to understand the representation of semantics in the brain. Of course, not all words are grounded in embodied semantics in the same way. For example, some words are abstract. Consider the word ‘society.’ Questions like “What does a society taste like?” or even “What does a society look like?” are difficult to answer, because societies are not the kinds of things that we taste or see. Societies are not like strawberries. But even abstract words like ‘society’ may contain embodied semantics that become apparent when we consider the ways in which metaphors link abstract concepts with concretely experienced objects (Lakoff & Johnson, 1980). One feature of societies, we might assert, is that they have insides and outsides. In this respect, they

How Does the Brain Represent Speech? 87 are like a great many objects that we experience directly: cups, bowls and rooms. Therefore, it may be hypothesized that even abstract words such as ‘society’ could have predictable effects on the sensorimotor system. Brain areas such as the insula that respond to the physical disgust of fetid smells also respond to the social disgust of seeing an appalled look on someone else’s face (Wicker et al., 2003). There are limits, however, to the embodied view of meaning. Function words such as conjunctions and prepositions are more difficult to associate with concrete experiences. As we have described it, the approach is also limited to finding meaning in the sensorimotor systems, which is unsatisfying as it ignores large swathes of the brain. In the next subsection, we turn to a more ambitious, if abstract, way of mapping the meaning of words that is not limited to finding meaning in the sensorimotor systems.

Vector representations and encoding models One difficulty in studying meaning is that “meaning” can be challenging to define. If you ask what the word ‘strawberry’ means, we might point at a strawberry. If we know the activity in your visual system that is triggered by looking at a strawberry, then we can point to similar activity patterns in your visual system when you think of the word ‘strawberry’ as another kind of meaning. You might imagine that it is harder to point to just any part of the brain and ask of its current state, “Is this a representation of ‘strawberry’?” But it is not impossible. In this subsection, we will, in as informal a way as possible, introduce the ideas of vector representations of words, and encoding models for identifying the neural representations of vectors. Generally speaking, an encoding model aims to predict how the brain will respond to a stimulus. Encoding models contrast with decoding models, which aim to do the opposite: guess which stimulus caused the brain response. The spectrogram reconstruction method (mentioned in a previous section) is an example of a decoding model (Mesgarani et al., 2008). An encoding model of sound would therefore try to predict the neural response to an audio recording. In a landmark study of semantic encoding, Mitchell et al. (2008) were able to predict fMRI responses to the meanings of concrete nouns, like ‘celery’ and ‘airplane.’ Unlike studies of embodied meaning, Mitchell et al. (2008) were able to predict neural responses that were not limited to the sensorimotor systems. For instance, they predicted accurate word‐specific neural responses across bilateral occipital and parietal lobes, the fusiform and middle frontal gyri, and sensory cortex; the left inferior frontal gyrus; the medial frontal gyrus and the anterior cingulate (see Figure 3.6 for reference; Mitchell et al., 2008). These encoding results expand the number of regions to which the meaning of a word might be distributed, to nonsensory systems like the anterior cingulate. An even greater expansion of these semantic regions can be found in more recent work (Huth et al., 2016). So how does an encoding model work? The model uses linear regression to map from a vector representation of a word to the intensity of a single voxel measured during an fMRI scan (and representing the activity in a bit of brain). This approach

88 Sensing Speech can be generalized to fit multiple voxels (representing the whole brain) and trained on a subset of word embeddings and brain scans, before being tested on unseen data in order to evaluate the model’s ability to generalize beyond the words it was trained on. But what do vector representation and word embedding mean? This field is rather technical and jargon rich, but the key ideas are relatively easy to grasp. Vector representations, or word embeddings, represent each word by a vector, effectively a list of numbers. Similarly, brain states can be quantified by vectors or lists of numbers that represent the amount of activity seen in each voxel. Once we have these vectors, using linear regression methods to try to identify relationships that map one onto the other is mathematically quite straightforward. So the maths is not difficult and the brain activity vectors are measurable by experiment, but how do we obtain suitable vector representations for each word that we are interested in? Let us assume a vocabulary of exactly four words: 1. airplane 2. boat 3. celery 4. strawberry One way to encode each of these as a list of numbers is to simply assign one number to each word: ‘airplane’ = [1], ‘boat’ = [2], ‘celery’ = [3], and ‘strawberry’ = [4]. We have enclosed the numbers in square brackets to mean that these are lists. Note that it is possible to have only one item in a list. A good thing about this encoding of the words, as lists of numbers, is that the resulting lists are short and easy to decode: we only have to look them up in our memory or in a table. But this encoding does not do a very good job of capturing the differences in meanings between the words. For example, ‘airplane’ and ‘boat’ are both manufactured vehicles that you could ride inside, whereas ‘celery’ and ‘strawberry’ are both edible parts of plants. A more involved semantic coding might make use of all of these descriptive features to produce the following representations. In Table 3.1, a 1 has been placed under the semantic description if the word along the row satisfies it. For example, an airplane is manufactured, so the first number in its list is 1, but ‘celery,’ even if grown by humans, is not manufactured, so the first number in its list is 0. The full list for the word ‘boat’ is [1, 1, 1, 0, 0], which is five numbers long. Is this a good encoding? It is certainly longer than the previous encoding (boat = [2]), and unlike the previous code it no longer distinguishes ‘airplane’ from ‘boat’ (both have the identical five‐number codes). Finally, the codes are redundant in the sense that, as far as a linear‐regression model is concerned, representing the word ‘boat’ as [1, 1, 1, 0, 0] is no more expressive than representing it as [1, 0]. Still, we might like the more verbose listing, since we can interpret the meaning of each number, and we can solve the problem of ‘airplane’ not differing from ‘boat’ by adding another number to the list. That is, if we represented the words with six‐number lists, then ‘airplane’ and ‘boat’ could be distinguished: airplane = [1, 1, 1, 0, 0, 0] and boat = [1, 1, 1, 0, 0, 1]. Now the last number of airplane is a 0 and the last number of boat is a 1.

How Does the Brain Represent Speech? 89 Table 3.1 Semantic‐field encodings for four words. Word

Manufactured

Vehicle

Ride inside

Edible

Plant part

airplane boat celery strawberry

1 1 0 0

1 1 0 0

1 1 0 0

0 0 1 1

0 0 1 1

So far, our example may seem tedious and somewhat arbitrary: we had to come up with attributes such as “manufactured” or “edible,” then consider their merit as semantic feature dimensions without any obvious objective criteria. However, there are many ways to automatically search for word embeddings without needing to dream up a large set of semantic fields. An incrementally more complex way is to rely on the context words that each one of our target words occurs within a corpus of sentences. Consider a corpus that contains exactly four sentences. 1. The boy rode on the airplane. 2. The boy also rode on the boat. 3. The celery tasted good. 4. The strawberry tasted better. Our target‐words are, again, ‘airplane,’ ‘boat,’ ‘celery,’ and ‘strawberry.’ The context‐words are ‘also,’ ‘better,’ ‘boy,’ ‘good,’ ‘on,’ ‘rode,’ ‘tasted,’ and ‘the’ (ignoring capitalization). If we create a table of target words in rows and context words in columns, we can count how many times each context word occurs in a sentence with each target word. This will produce a new set of word embeddings (Table 3.2). Unlike the previous semantic‐field embeddings, which were constructed using our “expert opinions,” these context‐word embeddings were learned from data (a corpus of four sentences). Learning a set of word embeddings from data can be very powerful. Indeed we can automate the procedure; and even a modest computer can process very large corpora of text to produce embeddings for hundreds of thousands of words in seconds. Another strength of creating word embeddings like these is that the procedure is not limited to concrete nouns, since context words can be found for any target word – whether an abstract noun, verb, or even a function word. You may be wondering how context words are able to represent meaning, but notice that words with similar meanings are bound to co‐occur with similar context words. For example, an ‘airplane’ and a ‘boat’ are both vehicles that you ride in, so they will both occur quite frequently in sentences with the word ‘rode’; however, one will rarely find sentences that contain both ‘celery’ and ‘rode.’ Compared to ‘airplane’ and ‘boat,’ ‘celery’ is more likely to occur in sentences containing the word ‘tasted.’ As the English phonetician Firth (1957, p. 11) wrote: “You shall know a word by the company it keeps.”

90 Sensing Speech Table 3.2 Context‐word encodings of four words. Word

also

better

boy

good

on

rode

tasted

the

airplane boat celery strawberry

0 1 0 0

0 0 0 1

1 1 0 0

0 0 1 0

1 1 0 0

1 1 0 0

0 0 1 1

2 2 1 1

With a reasonable vector representation for words like these, one can begin to see how it may be possible to predict the brain activation for word meanings (Mitchell et al., 2008). Start with a fairly large set of words and their vector representations, and record the brain activity they evoke. Put aside some of the words (including perhaps the word ‘strawberry’) and use the remainder as a training set in order to find the best linear equation that maps from word vectors to patterns of brain activation. Finally, use that equation to predict what the brain activation should have been for the words you held back, and test how similar that predicted brain activation is to the one that is actually observed, and whether the activation patterns for ‘strawberry’ is indeed more similar to that of ‘celery’ than it is to that of ‘boat.’ One similarity measure commonly used for this sort of problem is the cosine similarity, which can be defined for two vectors p⃗ and q,⃗ according to the following formula:  similarity p ,q

 p·q 2 i i

p

2 i i

q

Now if we plug the context‐word embeddings for each pair of words from our four‐word set into this equation, we end up with the similarity scores shown in Table 3.3. Note that numbers closer to 1 mean more similar and numbers closer to 0 mean more dissimilar. A perfect score of 1 actually means identical, which we see when we compare any word embedding with itself. Note that we have only populated the diagonal and upper triangle of this table, because the lower part is a reflection of the upper part, and therefore redundant. As expected, the words ‘airplane’ and ‘boat’ received a very high similarity score (0.94), whereas ‘airplane’ and ‘celery,’ for example, received lower similarity scores (0.41). The score for ‘celery’ and ‘strawberry,’ however, were also more similar (0.67). Summary statistics such as these for the similarity between two very long lists of numbers are quick and easy to compute, even for very long lists of numbers. Exploring them also helps to build an intuition about how encoding models, such as those of Mitchell et al. (2008), represent the meanings of words and thus what the brain maps they discover represent. Specifically, Firth’s (1957)

How Does the Brain Represent Speech? 91 Table 3.3 Cosine similarities between four words.

airplane boat celery strawberry

airplane

boat

celery

strawberry

1 – – –

0.94 1 – –

0.44 0.41 1 –

0.44 0.41 0.67 1

idea that the company a word keeps can be used to build up a semantic representation of a word has had a profound impact on the study of semantics recently, especially in the computational fields of natural language processing and machine learning (including deep learning). Mitchell et al.’s (2008) landmark study bridged natural language processing with neuroscience in a way that finds common ground for both fields at the time of writing. Not only do we expect words that belong to similar semantic domains to co‐occur with similar context words, but if the brain is capable of statistical learning, as many believe, then this is exactly the kind of pattern we should expect to find encoded in neural representations. To summarize, we have only begun to scratch the surface of how linguistic meaning is represented in the brain. But figuring out what the brain is doing when it is interpreting speech is so important, and mysterious, that we have tried to illustrate a few recent innovations in enough detail that the reader may begin to imagine how to go further. Embodied meaning, vector representations, and encoding models are not the only ways to study semantics in the brain. They do, however, benefit from engaging with other areas of neuroscience, touching for example on the homunculus map in the somatosensory cortex (Penfield & Boldrey, 1937). It is less clear, at the moment, how to extend these results from lexical to compositional semantics. A more complete neural understanding of pragmatics will also be needed. Much work remains to be done. Because spoken language combines both sound and meaning, a full account of speech comprehension should explain how meaning is coded by the brain. We hope that readers will feel inspired to contribute the next exciting chapters in this endeavor.

Conclusion Our journey through the auditory pathway has finally reached the end. It was a substantial trip, through the ear and auditory nerve, brainstem and midbrain, and many layers of cortical processing. We have seen how, along that path, speech information is initially encoded by some 30,000 auditory nerve fibers firing hundreds of thousands of impulses a second, and how their activity patterns across the tonotopic array encode formants, while their temporal firing patterns encode temporal fine structure cues to pitch and voicing. We have learned how, as these activity patterns then propagate and fan out over the millions of neurons of the

92 Sensing Speech auditory brainstem and midbrain, information from both ears are combined to add cues to sound‐source direction. Furthermore, temporal fine structure information gets recoded, so that temporal firing patterns at higher levels of the auditory brain no longer need to be read out with sub‐millisecond precision, and information about the pitch and timbre of speech sounds is instead encoded by a distributed and multiplexed firing‐rate code. We have seen that the neural activity patterns at levels up to and including the primary auditory cortex are generally thought to represent predominantly physical acoustic or relatively low‐level psychoacoustic features of speech sounds, and that this is then transformed into increasingly phonetic representations at the level of the STG, and into semantic representations as we move beyond the STG into the frontal and parietal brain areas. Finally we have seen how notions of embodied meaning, as well as of statistical learning, are shaping our thinking about how the brain represents the meaning of speech. By the time they reach these meaning‐representing levels of the brain, the waves of neural activity racing up the auditory pathway will have passed through at least a dozen anatomical processing stations, each composed of between a few hundreds of thousands to hundreds of millions of neurons, each of which is richly and reciprocally interconnected both internally and with the previous and the next levels in the processing hierarchy. We hope readers will share our sense of awe when we consider that it takes a spoken word only a modest fraction of a second to travel through this entire stunningly intricate network to be transformed from sound wave to meaning. Remember that the picture painted here of a feed‐forward hierarchical network that transforms acoustics to phonetics to semantics is a highly simplified one. It is well grounded in scientific evidence, but it is necessarily a rather selective telling of the story as we understand it to date. Recent years have been a particularly productive time in auditory neuroscience, as insights from animal research, human brain imaging, human patient data and ECoG studies, and artificial intelligence have begun to come together to provide the framework of understanding we have attempted to outline here. But many important details remain unknown, and, while we feel fairly confident that the insights and ideas presented here will stand the test of time, we must be aware that future work may not just complement and refine but even overturn some of the ideas that we currently put forward as our best approximations to the truth. One thing we are absolutely certain of, though, is that studying how human brains speak to each other will remain a profoundly rewarding intellectual pursuit for many years to come.

REFERENCES Aloni, M., & Dekker, P. (2016). The Cambridge handbook of formal semantics. Cambridge: Cambridge University Press.

Ballard, D. H., Hinton, G. E., & Sejnowski, T. J. (1983). Parallel visual computation. Nature, 306, 21–26.

How Does the Brain Represent Speech? 93 Baumann, S., Griffiths, T. D., Sun, L., et al. (2011). Orthogonal representation of sound dimensions in the primate midbrain. Nature Neuroscience, 14, 423–425. Belin, P., Zatorre, R. J., Lafaille, P., et al. (2000). Voice‐selective areas in human auditory cortex. Nature, 403, 309–312. Bizley, J. K., Walker, K. M., Silverman, B. W., et al. (2009). Interdependent encoding of pitch, timbre, and spatial location in auditory cortex. Journal of Neuroscience, 29, 2064–2075. Blakemore, S.‐J., Wolpert, D., & Frith, C. (2000). Why can’t you tickle yourself? NeuroReport, 11, R11–R16. Bogen, J. E., & Bogen, G. (1976). Wernicke’s region – where is it? Annals of the New York Academy of Sciences, 280, 834–843. Bouchard, K. E., Mesgarani, N., Johnson, K., & Chang, E. F. (2013). Functional organization of human sensorimotor cortex for speech articulation. Nature, 495, 327–332. Brosch, M., Selezneva, E., & Scheich, H. (2005). Nonauditory events of a behavioral procedure activate auditory cortex of highly trained monkeys. Journal of Neuroscience, 25, 6797–6806. Cheung, C., Hamilton, L. S., Johnson, K., & Chang, E. F. (2016). The auditory representation of speech sounds in human motor cortex. Elife, 5, e12577. Clements, G. N. (1985). The geometry of phonological features. Phonology, 2, 225–252. Clements, G. N. (1990). The role of the sonority cycle in core syllabification. Papers in Laboratory Phonology, 1, 283–333. Da Costa, S., van der Zwaag, W., Marques, J. P., et al. (2011). Human primary auditory cortex follows the shape of Heschl’s gyrus. Journal of Neuroscience, 31, 14067–14075. Daunizeau, J., David, O., & Stephan, K. E. (2011). Dynamic causal modelling: A critical review of the biophysical and statistical foundations. NeuroImage, 58, 312–322.

Davis, M. H., & Johnsrude, I. S. (2003). Hierarchical processing in spoken language comprehension. Journal of Neuroscience, 23, 3423–3431. Dayan, P., Hinton, G. E., Neal, R. M., & Zemel, R. S. (1995). The Helmholtz machine. Neural Computation, 7, 889–904. Dean, I., Harper, N., & McAlpine, D. (2005). Neural population coding of sound level adapts to stimulus statistics. Nature Neuroscience, 8, 1684–1689. Delgutte, B. (1997). Auditory neural processing of speech. In W. J. Hardcastle & J. Laver (Eds.), The handbook of phonetic sciences (pp. 507–538). Oxford: Blackwell. Edeline, J. M., Pham, P., & Weinberger, N. M. (1993). Rapid development of learning‐induced receptive field plasticity in the auditory cortex. Behavioral Neuroscience, 107, 539–551. Eliades, S. J., & Wang, X. (2008). Neural substrates of vocalization feedback monitoring in primate auditory cortex. Nature, 453, 1102–1106. Engineer, C. T., Perez, C. A., Chen, Y. H., et al. (2008). Cortical activity patterns predict speech discrimination ability. Nature Neuroscience, 11, 603–608. Ferry, R. T., & Meddis, R. (2007). A computer model of medial efferent suppression in the mammalian auditory system. Journal of the Acoustical Society of America, 122, 3519–3526. Firth, J. (1957). Papers in linguistics, 1934– 1951. Oxford: Oxford University Press. Flinker, A., Chang, E. F., Kirsch, H. E., et al. (2010). Single‐trial speech suppression of auditory cortex activity in humans. Journal of Neuroscience, 30, 16643–16650. Fowler, C. A. (1986). An event approach to the study of speech perception from a direct‐realist perspective. Journal of Phonetics, 14, 3–28. Frisina, R. D. (2001). Subcortical neural coding mechanisms for auditory temporal processing. Hearing Research, 158(1–2), 1–27.

94 Sensing Speech Friston, K. J., Harrison, L., & Penny, W. (2003). Dynamic causal modelling. NeuroImage, 19(4), 1273–1302. Friston, K., & Kiebel, S. (2009). Predictive coding under the free‐energy principle. Philosophical Transactions of the Royal Society B: Biological Sciences, 364, 1211–1221. Fritz, J., Shamma, S., Elhilali, M., & Klein, D. (2003). Rapid task‐related plasticity of spectrotemporal receptive fields in primary auditory cortex. Nature Neuroscience, 6, 1216–1223. Garofolo, J. S., Lamel, L. F., Fisher, W. M., et al. (1993). TIMIT Acoustic‐Phonetic Continuous Speech Corpus. Linguistic Data Consortium, from https://catalog. ldc.upenn.edu/LDC93S1. Golding, N. L., & Oertel, D. (2012). Synaptic integration in dendrites: Exceptional need for speed. Journal of Physiology, 590, 5563–5569. Graves, A., & Jaitly, N. (2014). Towards end‐to‐end speech recognition with recurrent neural networks. In ICML’14: Proceedings of the 31st International Conference on Machine Learning, 32(2), 1764–1772. Greenberg, S. (2006). A multi‐tier framework for understanding spoken language. In S. Greenberg & W. A. Ainsworth (Ed.), Listening to speech: An auditory perspective (pp. 411–430). Mahwah, NJ: Lawrence Erlbaum. Grinn, S. K., Wiseman, K. B., Baker, J. A., & Le Prell, C. G. (2017). Hidden hearing loss? No effect of common recreational noise exposure on cochlear nerve response amplitude in humans. Frontiers in Neuroscience, 11, 465. Grose, J. H., Buss, E., & Hall, J. W. (2017). Loud music exposure and cochlear synaptopathy in young adults: Isolated auditory brainstem response effects but no perceptual consequences. Trends in Hearing, 21, 1–18. Heinz, M. G., Colburn, H. S., & Carney, L. H. (2002). Quantifying the implications of nonlinear cochlear tuning for

auditory‐filter estimates. Journal of the Acoustical Society of America, 111, 996–1011. Hickok, G., & Poeppel, D. (2004). Dorsal and ventral streams: A framework for understanding aspects of the functional anatomy of language. Cognition, 92, 67–99. Holst, E. von, & Mittelstaedt, H. (1950). Das eafferenzprinzip. Naturwissenschaften, 37, 464–476. Humphries, C., Liebenthal, E., & Binder, J. R. (2010). Tonotopic organization of human auditory cortex. NeuroImage, 50, 1202–1211. Huth, A. G., De Heer, W. A., Griffiths, T. L., et al. (2016). Natural speech reveals the semantic maps that tile the human cerebral cortex. Nature, 532, 453–458. Jerison, H. J. (1973). Evolution of the brain and intelligence. New York: Academic Press. Johnsrude, I. S., Mackey, A., Hakyemez, H., et al. (2013). Swinging at a cocktail party: Voice familiarity aids speech perception in the presence of a competing voice. Psychological Science, 24, 1995–2004. Joris, P. X., Smith, P. H., & Yin, T. C. T. (1998). Coincidence detection in the auditory system: 50 years after Jeffress. Neuron, 21, 1235–1238. Jurafsky, D., & Martin, J. H. (2014). Speech and language processing. London: Pearson. Kawato, M., Hayakawa, H., & Inui, T. (1993). A forward‐inverse optics model of reciprocal connections between visual cortical areas. Network: Computation in Neural Systems, 4, 415–422. Kujawa, S. G., & Liberman, M. C. (2015). Synaptopathy in the noise‐exposed and aging cochlea: Primary neural degeneration in acquired sensorineural hearing loss. Hearing Research, 330, 191–199. Lakoff, G., & Johnson, M. (1980). Metaphors we live by. Chicago: University of Chicago Press.

How Does the Brain Represent Speech? 95 Leonard, M. K., Baud, M. O., Sjerps, M. J., & Chang, E. F. (2016). Perceptual restoration of masked speech in human cortex. Nature Communications, 7, 13619. Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert‐Kennedy, M. (1967). Perception of the speech code. Psychological Review, 74, 431–461. Liberman, A. M., & Mattingly, I. G. (1985). The motor theory of speech perception revised. Cognition, 21, 1–36. Meddis, R., & O’Mard, L. P. (2005). A computer model of the auditory‐nerve response to forward‐masking stimuli. Journal of the Acoustical Society of America, 117, 3787–3798. Mesgarani, N., & Chang, E. F. (2012). Selective cortical representation of attended speaker in multitalker speech perception. Nature, 485, 233–236. Mesgarani, N., Cheung, C., Johnson, K., & Chang, E. F. (2014). Phonetic feature encoding in human superior temporal gyrus. Science, 343, 1006–1010. Mesgarani, N., David, S. V., Fritz, J. B., & Shamma, S. A. (2008). Phoneme representation and classification in primary auditory cortex. Journal of the Acoustical Society of America, 123, 899–909. Mitchell, T. M., Shinkareva, S. V., Carlson, A., et al. (2008). Predicting human brain activity associated with the meanings of nouns. Science, 320, 1191–1195. Müller‐Preuss, P., & Ploog, D. (1981). Inhibition of auditory cortical neurons during phonation. Brain Research, 215, 61–76. Mumford, D. (1992). On the computational architecture of the neocortex. Biological Cybernetics, 66, 241–251. Nelken, I., Bizley, J. K., Nodal, F. R., et al. (2008). Responses of auditory cortex to complex stimuli: Functional organization revealed using intrinsic optical signals. Journal of Neurophysiology, 99, 1928–1941. Parker Jones, O., Seghier, M. L., Duncan, K. J. K., et al. (2013). Auditory–motor interactions for the production of native

and nonnative speech. Journal of Neuroscience, 33, 2376–2387. Pasley, B. N., David, S. V., Mesgarani, N., et al. (2012). Reconstructing speech from human auditory cortex. PLOS Biology, 10, e1001251. Penfield, W., & Boldrey, E. (1937). Somatic motor and sensory representation in the cerebral cortex of man as studied by electrical stimulation. Brain, 60, 389–443. Price, C. J. (2012). A review and synthesis of the first 20years of PET and fMRI studies of heard speech, spoken language and reading. NeuroImage, 62, 816–847. Prothero, J. W., & Sundsten, J. W. (1984). Folding of the cerebral cortex in mammals. Brain, Behavior and Evolution, 24, 152–167. Pulvermüller, F. (2005). Brain mechanisms linking language and action. Nature Reviews Neuroscience, 6, 576–582. Rabinowitz, N. C., Willmore, B. D. B., King, A. J., & Schnupp, J. W. H. (2013). Constructing noise‐invariant representations of sound in the auditory pathway. PLOS biology, 11, e1001710. Rao, R. P., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra‐classical receptive‐field effects. Nature Neuroscience, 2, 79–87. Rauschecker, J. P., & Scott, S. K. (2009). Maps and streams in the auditory cortex: Nonhuman primates illuminate human speech processing. Nature Neuroscience, 12, 718–724. Saussure, F. (1989). Cours de linguistique générale. Wiesbaden: Otto Harrassowitz. Schnupp, J. W., & Carr, C. E. (2009). On hearing with more than one ear: Lessons from evolution. Nature Neuroscience, 12, 692–697. Schnupp, J. W. H., Garcia‐Lazaro, J. A., & Lesica, N. A. (2015). Periodotopy in the gerbil inferior colliculus: Local clustering rather than a gradient map. Frontiers in Neural Circuits, 9, 37. Schreiner, C. E., & Langner, G. (1988). Periodicity coding in the inferior

96 Sensing Speech colliculus of the cat II: Topographical organization. Journal of Neurophysiology, 60, 1823–1840. Scott, S. K., Blank, C. C., Rosen, S., & Wise, R. J. (2000). Identification of a pathway for intelligible speech in the left temporal lobe. Brain, 123, 2400–2406. Stevens, K. N. (1960). Toward a model for speech recognition. Journal of the Acoustical Society of America, 32, 47–55. Stevens, K. N. (2002). Toward a model for lexical access based on acoustic landmarks and distinctive features. Journal of the Acoustical Society of America, 111, 1872–1891. Stuart, A., & Phillips, D. P. (1996). Word recognition in continuous and interrupted broadband noise by young normal‐hearing, older normal‐hearing, and presbyacusic listeners. Ear and Hearing, 17, 478–489. Sumner, C. J., Lopez‐Poveda, E. A., O’Mard, L. P., & Meddis, R. (2002). A revised model of the inner‐hair cell and auditory‐nerve complex. Journal of the Acoustical Society of America, 111, 2178–2188. Tremblay, P., & Dick, A. S. (2016). Broca and Wernicke are dead, or moving past the classic model of language neurobiology. Brain and Language, 162, 60–71. Walker, K. M., Bizley, J. K., King, A. J., & Schnupp, J. W. (2011). Multiplexed and robust representations of sound features in auditory cortex. Journal of Neuroscience, 31, 14565–14576.

Wernicke, C. (1874). Der aphasische Symptomencomplex: eine psychologische Studie auf anatomischer Basis. Breslau: Max Cohn & Weigert. Wicker, B., Keysers, C., Plailly, J., et al. (2003). Both of us disgusted in my insula: The common neural basis of seeing and feeling disgust. Neuron, 40, 655–664. Willmore, B. D. B., Schoppe, O., King, A. J., et al. (2016). Incorporating midbrain adaptation to mean sound level improves models of auditory cortical processing. Journal of Neuroscience, 36, 280–289. Wolpert, D. M., Ghahramani, Z., & Flanagan, J. R. (2001). Perspectives and problems in motor learning. Trends in Cognitive Sciences, 5, 487–494. Zhang, X., & Carney, L. H. (2005). Analysis of models for the synapse between the inner hair cell and the auditory nerve. Journal of the Acoustical Society of America, 118, 1540–1553. Zhang, X., Heinz, M. G., Bruce, I. C., & Carney, L. H. (2001). A phenomenological model for the responses of auditory‐ nerve fibers: I. Nonlinear tuning with compression and suppression. Journal of the Acoustical Society of America, 109, 648–670. Journal, M. S. A., Bruce, I. C., & Carney, L. H. (2014). Updated parameters and expanded simulation options for a model of the auditory periphery. Journal of the Acoustical Society of America, 135, 283–286.

4 Perceptual Control of Speech K. G. MUNHALL1, ANJA‐XIAOXING CUI2, ELLEN O’DONOGHUE3, STEVEN LAMONTAGNE1, AND DAVID LUTES1 Queen’s University, Canada University of British Columbia, Canada 3 University of Iowa, United States 1 2

There is broad agreement that the American socialite Florence Foster Jenkins was a terrible singer. Her voice was frequently off‐key and her vocal range did not match the pieces she performed. The mystery is how she could not have known this. However, many – including her depiction in the eponymous film directed by Stephen Frears – think it likely that she was unaware of how poorly she sang. The American mezzosoprano Marilyn Horne offered this explanation. “I would say that she maybe didn’t know. First of all, we can’t hear ourselves as others hear us. We have to go by a series of sensations. We have to feel where it is” (Huizenga, 2016). This story about Jenkins contains many of the key questions about the topic of this chapter, the perceptual control of speech. Like singing, speech is governed by a control system that requires sensory information about the effects of its actions, and the major source of this sensory feedback is the auditory system. However, the speech we hear is not what others hear and yet we are able to control our speech motor system in order to produce what others need or expect to hear. For both speech and singing, much is unknown about the auditory‐motor control system that accomplishes this. What role does hearing your voice play in error detection and correction? How does this auditory feedback processing differ from how others hear you? What role does hearing your voice play in learning to speak? Human spoken language has traditionally been studied by two separate com munities (Meyer, Huettig, & Levelt, 2016): those including the majority of contrib utors to this volume who study the perception of speech signals produced by others and those who study the production of the speech signal itself. It is the latter that is the focus of this chapter. More specifically, the chapter focuses on the

The Handbook of Speech Perception, Second Edition. Edited by Jennifer S. Pardo, Lynne C. Nygaard, Robert E. Remez, and David B. Pisoni. © 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.

98 Sensing Speech processing of the rich sensory input accompanying talking, particularly hearing your own voice. As Marilyn Horne suggests, perceiving this auditory feedback is not the same as hearing others. Airborne speech sound certainly arrives at the speaker’s ear as it does at the ears of others, but for the speaker it is mixed with sound transmitted through the body (e.g. Békésy, 1949). A second difference bet ween hearing yourself and hearing others is neural rather than physical. The gen eration of action in speech and other movements is accompanied by information about the motor commands that is transmitted from the motor system to other parts of the brain that might need to know about the movement. One consequence of this distribution of copies of motor commands is that the sensory processing of the effects of a movement is different from the processing of externally generated sensory information (see Bridgeman, 2007, for a historical review). This chapter addresses a number of issues related to the perceptual control of speech production. We first examine the importance of hearing yourself speak through the study of natural and experimental deafening in humans and birds. This work is complemented by recent work involving real‐time manipulations of auditory feedback through rapid signal processing. Next, we review what is known about the neural processing of self‐produced sound. This includes work on corollary discharge or efference copy, as well as studies showing cortical sup pression during vocalizing. Finally, we address the topic of vocal learning and the general question about the relationship between speech perception and speech production. A small number of species including humans learn their vocal reper toire. It is important to understand the conditions that promote this learning and also to understand why this learning is so rare. Through all of our review, we will touch base with research on birdsong. Birdsong is the animal model of human vocal production. The literature on birdsong provides exciting new research direc tions as extensive projects on the genetic and neural underpinnings of vocal learning are carried out demonstrating remarkable similarity to human vocal behavior (Pfenning et al., 2014).

Perceptual feedback processing The study of the perceptual control of spoken language is an investigation of how behavior is monitored by the actor. Feedback can be viewed as a general process wherein online performance is referenced to a target, goal, or intention. When deviations from these targets are detected, these errors are ideally corrected by the speaker. In language, such errors can take numerous forms. A speaker’s meaning might be poorly formulated and misinterpreted by a listener; a single word might be substituted for another or words might be spoken out of order; single syllables or sounds might be dropped, emphasized incorrectly, or mispronounced. Monitoring of such language behavior is often broadly conceptualized according to a perceptual loop (Levelt, 1983). The perceptual loop model was designed to account for perceived errors at various levels of language production, and consists

Perceptual Control of Speech 99 of three phases: (1) self‐interrupting when an error is detected, (2) pausing or intro ducing “editing” terms (um, uh, like), and (3) fixing the error. Two broad features differentiate such high‐level error detection from other forms of target‐based correction, as in speech production. First, language‐error correction often interrupts the flow of output, while the same is not always true of compensation in response to auditory speech feedback perturbations. Second, lan guage‐error correction typically involves conscious awareness. This is inconsistent with speech feedback processing. Two bodies of literature – clinical studies of hearing loss and artificial labora tory perturbation studies – shed light on these unique features of speech feedback processing.

Deafness and Perturbations of auditory feedback Loss of hearing has a drastic impact on the acquisition of speech (Borden, 1979). From the first stage of babbling to adult articulation, speech in those who are pro foundly hearing impaired has distinct acoustic and temporal speech characteris tics. Canonical babbling is delayed in its onset and the number of well‐formed syllables is markedly reduced even after clinical intervention through amplification (Oller & Eilers, 1988). Beyond babbling, Osberger and McGarr (1982) have summa rized the patterns of speech errors in children who have significant hearing impair ments. While the frequencies of errors (and hearing levels) varied between children, there were consistent atypical segmental productions including sound omissions, anomalous timing, and distortions of phonemes. These phonetic patterns are accompanied by inconsistent interarticulator coordination (McGarr & Harris, 1980). In addition, there are consistent suprasegmental issues in the population including anomalies of vocal pitch and vocal‐quality control and inad equate intonation contours (Osberger & McGarr, 1982). These patterns of deficit most likely arise from the effects of deafness on both the perceptual learning of speech in general and the loss of auditory feedback in vocal learning. Data characterizing speech‐production behavior at different ages of deafness onset could shed some light on the extent to which learning to perceive the sound system or learning to hear yourself produce sounds contributes to the reported deficits. However, there are minimal data on humans that provide a window onto the importance of hearing at different stages of vocal learning. Binnie, Daniloff, and Buckingham (1982) provide a case study of a five‐year‐old who abruptly lost hearing. The child showed modest changes immediately after deafness onset but, over the course of a year, the intelligibility of his speech declined due to distortions in segmental production and prosody. Notably, the child rarely deleted sounds and tended to prolong vowels perhaps to enhance kin esthetic feedback. While this case study is not strong evidence for the development of auditory feedback, it is noteworthy that the speech representations that govern fluent speech are well developed even at this young age. Speech quality does not immediately degrade.

100 Sensing Speech Age of onset of deafness of other postlingually deafened individuals, those who lost their hearing following the acquisition of speech, indicates that those deaf ened earlier in life deviated more from the general population than those deafened later in both suprasegmental and segmental characteristics (Waldstein, 1990). However, the patterns of speech deviation differed considerably from individual to individual in the small samples. The study of postlingually deafened individuals represents the best window onto the role played by auditory feedback in a well‐developed human control system. While the effects of hearing loss on speech are not immediate, both consonant and vowel errors emerge over time (Zimmermann & Rettaliata, 1981; Osberger & McGarr, 1982; Waldstein, 1990; Lane & Webster, 1991; Cowie & Douglas‐Cowie, 1992). Other effects on speech caused by a long‐term lack of auditory feedback are of a suprasegmental nature. These include a slower overall rate of speech, higher and more inconsistent pitch, overstressing syllables and words, and a greater mean intensity (Cowie, Douglas‐Cowie, & Kerr, 1982; Plant, 1984; Leder, Spitzer, & Kirchner, 1987; Waldstein, 1990; Lane & Webster, 1991; Cowie & Douglas‐Cowie, 1992). All of these factors contribute to the overall loss of intelligibility of speech. The study of vocal learning in birds has permitted more systematic studies of the effects of deafening at different ages of development. In a classic study by Lombardino and Nottebohm (2000) groups of zebra finches were deafened at intervals ranging from 81 days to six years. Changes in song were strongly corre lated with age of deafening. The songs of birds deafened earlier (e.g. at three months) deteriorated much more quickly (approximately a week) compared to birds deafened between two and five years. The latter took more than a year to show quantifiable deficits. In birds, invasive ablation studies have shown that the relationship between acquired song and auditory feedback is not simple. Anatomical studies have revealed at least two distinct pathways for the vocal control of song that con verge on the song motor cortex. One pathway is strongly influenced by the cor tical premotor area HVC and the other has strong influences from the lateral magnocellular nucleus of the anterior nidopallium (LMAN). An oversimplifica tion of the contributions of these two anatomical regions is that one controls the memorized song (HVC) and the other influences the variability of the pitch and amplitude of productions (LMAN). The role of auditory feedback in mediating the influence of these two systems is intriguing. It has been suggested that auditory feedback may influence the gain of the variability system (Bertram et al., 2014). When birds are deafened, the production of structured song is impaired (e.g. dropped syllables and deteriorated structure of syllables), but later, when LMAN is ablated, the effects of deafening are reversed, at least in those with moderate decline (Nordeen & Nordeen, 2010), and variability is reduced. The authors conclude that deafening induces song deterioration and LMAN activity contributes to that degradation. Thus, in birds, there are neural systems such as LMAN that play a direct role in determining the amount of vocal variability.

Perceptual Control of Speech 101 Overall, the data indicate that hearing is vital for vocal learning but also that the maintenance of accurate vocal production relies on auditory feedback. Without this auditory feedback, sound production changes across the full range of attrib utes (e.g. precision of frequency, timing, consistency).

Real‐time manipulations of auditory feedback Separate from clinical evidence, behavioral studies of auditory feedback in speech have been carried out for more than a century. In 1911 the otolaryngologist Étienne Lombard published “Le signe de l’élévation de la voix” (“The symptom of the raised voice”; Lombard, 1911), in which he noted a patient’s tendency to speak more loudly when a loud noise was transmitted to one ear. This became the first published evidence for a feedback mechanism by which real‐time speech perception could influence speech production (Brumm & Zollinger, 2011) and, more than 100 years later, the Lombard effect remains the most persistent and robust feedback phenomenon within psycholinguistic speech production research. A notable feature of real‐time speech corrections is that they appear to be largely involuntary and often occur without awareness. In one study, speakers who wore headphones persisted in raising their volume when loud noises were played, even when informed by an interviewer that they were doing so (Mahl, 1972). While learned inhibition of the Lombard effect in humans is possible (Pick et al., 1989), it remains persistent in spontaneous speech and has been observed in young chil dren (Siegel et al., 1976) as well as Old World monkeys (Sinnott, Stebbins, & Moody, 1975), whales (Parks et al., 2011), and a multitude of songbird species (see Cynx et al., 1998; Kobayasi & Okanoya, 2003; Leonard & Horn, 2005). Other types of speech feedback distortion also show compensatory responses. In a common paradigm, speakers’ formants (resonances of the vocal tract) are adjusted away from what is actually being produced – for example, a speaker might produce the vowel /ε/, and hear themselves say the vowel /æ/. In response, the speaker may compensate by shifting their own production in the opposite direction in frequency. In this example, in compensation they might produce a vowel closer to /I/ (Houde & Jordan, 1998; Purcell & Munhall, 2006). Interestingly, such compensation is often incomplete, such that the relative magnitude of the response is less than the magnitude of the perturbation and individual variability is considerable (MacDonald, Purcell, & Munhall, 2011). In Figure 4.1, perturba tions of the first formant (F1) and average compensations (dots) are shown (MacDonald, Goldberg, & Munhall, 2010). Three perturbations to F1 are intro duced in steps over a series of trials. The dots show that, on average, subjects responded in a manner that counteracted the perturbation. However, as can be seen, even for the smallest perturbation of 50 Hz, the compensation is incomplete. Subjects make changes less than this even though they are capable of making a compensation large enough to correct this error as evidenced by their response to the 200 Hz perturbation at the end of the series. Vocal‐pitch perturbations produce the same pattern of partial compensation and individual variability. When the fundamental frequency (F0) is raised or

102 Sensing Speech 220 200 180 Normalized first formant frequency [Hz)

160 140 120 100 80 60 40 20 0 –20 –40 –60 –80 0

20

40

60

80

100

120

140

160

Utterance

Figure 4.1 Perturbation (solid line) and average compensation (dots) of first formant frequency in hertz. The frequencies have been normalized to the mean of the baseline phase (Source: Adapted from MacDonald, Goldberg, & Munhall, 2010).

lowered, talkers tend to compensate by producing speech with F0 shifts in the opposite direction in frequency to the perturbation (Burnett et al., 1998; Jones & Keough, 2008). Such pitch compensations can be reduced but not eliminated with specific instruction in conjunction with intensive training (Zarate & Zatorre, 2008). As in the Lombard effect, compensation in response to formant and pitch pertur bations appears to be largely automatic (Munhall et al., 2009). In birdsong, feedback perturbations result in similar responses. Pitch shifting single notes yields a compensatory response wherein vocal output shifts in the direction opposite to the perturbation (Sober & Brainard, 2009). As with humans, this response is often incomplete. Sober and Brainard (2009) found that a 100 percent pitch shift yielded a 50 percent change in response on average; however, contrary to the pattern observed in humans, this compensation is not immediate. In the same experiment, Sober and Brainard (2009) found that pitch shifts developed across a two‐week period, and that, once the pitch shift stimulus was removed, return to baseline was gradual. In humans, compensations in response to feedback perturba tions are observed within single testing sessions (see Purcell & Munhall, 2006;

Perceptual Control of Speech 103 Terband, van Brenk, & van Doornik‐van der Zee, 2014; Zarate & Zatorre, 2008) and even single trials (Tourville, Reilly, & Guenther, 2008), and speech acoustics return to baseline slowly within a session after removal of the perturbation stimulus (Purcell & Munhall, 2006). The reasons for these interspecies differences are unclear; how ever, the evidence overwhelmingly supports the notion that both humans and song birds actively correct for “errors” in vocal production, comparing vocal output to some form of target in real time. A notable exception to direct compensation occurs in response to delayed auditory feedback (DAF), wherein time delays are introduced between speech production and audition. DAF is nearly always followed by errors and interrupted flow of speech. In unaltered speech, the delay between speaking and hearing one’s own speech is about 1 millisecond (Yates, 1963). When this interval is artificially lengthened, numerous speech changes are introduced: vocal intensity rises, pro duction speed slows, and stuttering or word repetitions are common (Chase et al., 1961). In birdsong, DAF yields similar errors as in humans: zebra finches produce more frequent stuttering (more repetitions of introductory notes) and more syl labic omissions when feedback is delayed (Cynx & von Rad, 2001). One of the unique aspects of DAF is that it is not something that can be readily compensated for. Unlike feedback for vocal pitch, loudness, spectral detail, or even the detailed timing of the utterances (e.g. Mitsuya, MacDonald, & Munhall, 2014), all of which define the intentional characteristics of the signal, DAF is an indicator of the transmission speed of the sensorimotor organization. As such, feedback timing acts as a constraint on the use of speech motor feedback. Recently, Mitsuya, Munhall, and Purcell (2017) showed that the amount of compensation for perturbed formant frequency decreased linearly with delay in feedback. In this study a 200 Hz perturbation to F1 auditory feedback was intro duced with 100 ms delay in feedback. Every 10 trials the delay was reduced by 10 ms though the magnitude of the frequency perturbation remained constant. The magnitude of F1 compensation grew as the delay was reduced. These findings demonstrate that auditory feedback beyond a temporal window ceases to play its role as an effective control signal for speech production. Collectively, these findings provide consistent support for the importance of auditory feedback for the development and maintenance of spoken language. This feedback processing is evident for a variety of attributes of spoken language and the data imply the existence of some form of articulatory/acoustic goals that are supported by perceptual feedback. However, the mechanisms underlying this process remain unclear.

Models of feedback processing Computational processing of feedback In functional terms, there are a number of models that could account for these data but also some classes of model that are clearly inadequate. A servomechanism in which

104 Sensing Speech behavior is controlled directly by feedback is too slow to modify the rapid movements of articulation (Lashley, 1951) and such control systems are known to have stability issues. An opposite theoretical approach places more reliance on memory‐based movement control. The idea of a motor program (Schmidt, 1980) is that movements are driven by a detailed motor representation and unfold without reference to sensa tion. Such programs cannot account for the adaptive timing and the flexibility of movement, and thus are viewed as too rigid to account for skilled movement data. More recent models suggest a more intricate role for sensory feedback. In such frame works, auditory feedback is used to establish auditory target regions and to learn and maintain “forward models” that predict the consequences of behavior (e.g. Kawato, 1990). In part, these more computational models were anticipated by earlier physiological ideas about efference copy and corollary discharge. The term efference copy is a direct translation of the German Efferenzkopie, intro duced by von Holst and Mittelstaedt in 1950 to explain how we might distinguish changes in visual sensations due to our own movement and changes in visual sen sations due to movement of the world. Crapse and Sommer (2008) consider corollary discharge (coined by Sperry in the same year, 1950) to be the more general term. Corollary discharges are viewed as copies of motor commands sent to any sensory structures, while efference copies were thought to be sent only to early or primary sensory structures. Two current types of neurocomputational models of speech production differ entiate how such corollary discharges and sensory feedback could influence speech. The Directions into Velocities of Articulators (DIVA) model and its extension, the Gradient Order DIVA (GODIVA) model, use the comparison of overt auditory feedback to auditory target maps as the mechanism to control speech errors (Guenther & Hickok, 2015). The auditory target maps can be under stood as the predictions of the sensory state following a motor program. These predictions are also the goals represented in the speech‐sound map, where a speech sound is defined as a phonetic segment with its own motor program. This model requires two sensory‐to‐movement mappings to be learned in development. The speech‐sound map must be mapped to appropriate movements in what is considered a forward model. When errors are detected by mismatches between feedback and predicted sensory information, a correction must be generated. The sensorimotor mapping responsible for such corrective movements is considered an inverse model. In contrast, the state feedback control model of speech production (SFC), or its extension, the hierarchical state feedback control model (HSFC) assumes an addi tional internal feedback loop (Hickok, 2012; Houde & Nagarajan, 2011; Houde & Chang, 2015). Similar to the DIVA models, the SFC models incorporate a form of corollary discharge. One critical difference is that the corollary discharge in SFC models is checked against an internal target map rather than overt auditory feedback (i.e. a prediction of speech errors is generated and thus provides a mech anism to prevent such errors). Overt auditory feedback is included in the model through its influence on how the speech‐error predictions are converted into cor rections (Houde & Nagarajan, 2011).

Perceptual Control of Speech 105 Both of these models incorporate major but slightly different roles for auditory feedback in speech production. Such models play an important role in advancing our understanding of articulation but also have inherent problems that will make it difficult to unravel the exact form of the speech‐control system. Strengths include systematic frameworks for summarizing a large body of findings in the field and the ability to make novel predictions that lead to specific test experiments. However, all models make assumptions about the structure and processes involved in behavior. Auditory target maps and internal feedback loops, for example, are hypothetical constructs that are far from being rigorously supported. In addition, there is the challenging issue of how many levels of description (e.g. Marr’s com putational, representational, and implementation levels) are needed and how the relationship between these levels can best be studied (see Peebles & Cooper, 2015, and other papers in the same issue). One reductionist approach is to look for neural evidence that might correspond with the computational architectures or constrain the behavior of the models. Both groups of modelers have pursued this approach.

Neural processing of feedback There is an extensive literature on the neural substrates supporting speech pro duction (see Guenther, 2016, for a review). Much of this is based on mapping the speech‐production network using fMRI (Guenther, Ghosh, & Tourville, 2006). Our focus here is more narrow – how speech sounds produced by the talker are dealt with in the nervous system. The neural processing of self‐produced sound neces sitates mechanisms that allow the differentiation between sound produced by one self and sound produced by others. Two coexisting processes may play a role in this: (1) a perceptual suppression of external sound and voices, and (2) specialized processing of one’s own speech (Eliades & Wang, 2008). Cortical suppression has unique adaptive functions depending on the species. In nonhuman primates, for example, the ability to discern self‐vocalization from external sound serves to pro mote antiphonal calling whereby the animal must recognize their species‐specific call and respond by producing the same call (Miller & Wang, 2006). Takahashi, Fenley, and Ghazanfar (2016) have invoked the development of self‐monitoring and self‐recognition as essential in developing coordinated turn taking in mar moset monkeys. Vocal production in nonhuman primates has long been associated with suppressed neural firing in the auditory cortex (Muller‐Preuss & Ploog, 1981), which occurs just prior to the onset of vocalization (Eliades & Wang, 2003). The same effect has been shown in humans, whereby vocalization led to suppression of one third of superior temporal gyrus neurons during open brain surgery (Creutzfeldt, Ojemann, & Lettich, 1989). This suppression preceded vocalization by approximately 100 ms and subsided about 1 second post‐vocalization (Creutzfeldt, Ojemann, & Lettich, 1989). In contrast, when another person spoke in the absence of self‐vocalization by the recorded individual, temporal gyrus activity was not suppressed. Therefore, it is postulated that the same cortical regions that suppress auditory stimuli are responsible for the production and control of speech.

106 Sensing Speech In terms of auditory feedback, studies using unanesthetized marmoset monkeys have confirmed that specific neuron populations in the auditory cortex are sensitive to vocal feedback whereas others are not (Eliades & Wang, 2008). Neurons that are suppressed during speech production show increased firing in response to altered feedback, and thus appear to be more sensitive to errors during speech production (Eliades & Wang, 2008). At the same time, a smaller proportion of neu rons that are generally excited during production show reduced firing in response to altered feedback (Eliades & Wang, 2008). Although these neural response changes could in principle be due to changes in the vocal signal as a result of feedback perturbations, playing recordings of the vocalizations and altered vocal izations does not change the differential neuronal firing pattern in response to altering the sound (Eliades & Wang, 2008). Muller‐Preuss and Ploog (1981) found that most neurons in the primary auditory cortex of unanesthetized squirrel monkeys that were excited in response to a playback of self‐vocalization were either weakened or completely inhibited during phonation. However, approximately half of superior temporal gyrus (pri mary auditory cortex) neurons do not demonstrate that distinction (Muller‐Preuss & Ploog, 1981). This ultimately reflects phonation‐dependent suppression in specific populations of auditory cortical neurons. Electrocorticography data in humans has also supported the idea that specific portions of the auditory cortex are supporting auditory feedback processing (Chang et al., 2013). In a magnetoencephalography (MEG) study, Houde and colleagues (2002) investigated directly whether vocalization‐induced auditory cortex suppression resulted from a neurological comparison between an incoming signal (auditory feedback) and an internal “prediction” of that signal. They created a discrepancy, or “mismatch,” between the signal and expectation by altering the auditory feedback. Specifically, participants heard a sum of their speech and white noise that lasted the duration of their utterance. The authors found that, when feedback was altered using the gated noise (speech plus white noise), self‐produced speech no longer suppressed M100 amplitude in the auditory cortex. Suppression was observed during normal self‐produced speech. Therefore, these findings support a forward model whereby expected auditory feedback during talking produces cortical suppression of the auditory cortex. In order to determine whether a forward model system truly regulates cortical suppression of the human auditory cortex during speech production, Heinks‐ Maldonado and colleagues (2005) examined event‐related potentials (N100) during speech production. Like Houde et al. (2002), they found that the amplitude of N100 was reduced in response to unaltered vocalization relative to both pitch shifted and speech from a different voice. Furthermore, during passive listening, neither perturbation produced any N100 amplitude differences. This suggests that suppression of the auditory cortex is greatest when afferent sensory feedback matches an expected outcome specifically during speech production (Heinks‐ Maldonado et al., 2005). Functional magnetic resonance imaging (fMRI) studies have broadly concurred with the electrophysiological evidence. For example, Tourville, Reilly, and Guenther

Perceptual Control of Speech 107 (2008) compared the blood‐oxygen‐level‐dependent (BOLD) response in trials when there was a first formant feedback shift to trials in which there was no modi fication of the auditory feedback. This comparison showed activation in posterior temporal regions consistent with previous findings in which noise masked the speech (Christoffels, Formisano, & Schiller, 2007), auditory feedback was delayed (Hashimoto & Sakai, 2003), vocal pitch feedback was altered (Zarate & Zatorre, 2008 and MEG studies of pitch shifts: Franken et al., 2018). Tourville, Reilly, and Guenther (2008) also reported greater activity in the shift–no shift comparison in the right‐ hemisphere ventral premotor cortex for trials in which the first formant was shifted. They interpreted their findings as support of the DIVA model components that per form auditory‐error detection and compensatory motor responses. These studies and many others support the existence of neural mechanisms that use the auditory character of the talker’s speech output to control articulation. However, the challenge of mapping high‐level computational models to behavioral and neural data remains. The necessity of different levels of description and the units within the levels are difficult to determine. In short, while there may be only a single neural architecture that manages fluent speech, many abstract cognitive models could be consistent with this architecture (see Griffiths, Lieder, & Goodman, 2015, for a discussion of constraints on cognitive modeling). An addi tional approach is to examine ontogeny for relationships between perception and production.

Auditory feedback and vocal learning Much is made about the uniqueness of human language and at the speech level, the frequent focus of these uniqueness claims is on the perceptual skills of the developing infant. However, the less emphasized side of communication, speaking, is clearly a specialized behavior. Humans are part of a small cohort of species that are classified as vocal learners and who acquire the sounds in their adult repertoire through social learning (Petkov & Jarvis, 2012). This trait seems to be an example of convergent evolution in a few mammalian (humans, dolphins, whales, seal, sea lions, elephants, and bats) and bird (songbirds, parrots, and hummingbirds) species. The behavioral similarities shown by these disparate species are mirrored by their neuroanatomy and gene expression. In a triumph of behavioral, genetic, and neuroanatomic research, a consortium of scientists has shown similarities in brain pathways for vocal learners that are not observed in species that do not learn their vocal repertoires (Pfenning et al., 2014). As shown by the studies of deafness summarized earlier, hearing is vital for vocal learning. However, the role that auditory feedback plays in speech development is unclear. There are few human developmental studies that manip ulate feedback in the early stages of speech development. Jones and colleagues (Scheerer, Liu, & Jones, 2013; Scheerer, Jacobson, & Jones, 2016, 2019) have shown that children as young as two years of age show compensation for F0 perturba tions. However, feedback perturbation of segmental properties such as vowel

108 Sensing Speech formant frequency shows a different pattern of results. MacDonald et al. (2012) tested children at two and four years of age, as well as adults in a formant feedback perturbation paradigm. By the age of four, young children acted like adults and partially compensated in response to F1 and F2 perturbations. At the age of two, however, the toddlers showed two significant patterns in response to feedback perturbations (see Figure 4.2). On average there was no evidence of compensatory behavior when the two‐year‐old children were presented with altered feedback of vowel formants. Further, they produced utterances that were remarkably variable. Variability is one of the hallmarks of early speech and birdsong development. But what role does this variability play in development and does feedback processing have a developmental profile? We will return to this question. The DIVA model proposes that early vocal development involves a closed‐loop imitation process driven initially by two stages of babbling and later a vocal learning stage that directly involves corrective feedback processing. In the first babbling phase, random articulatory motions generate somatosensory and acoustic consequences, and a mapping of the developing vocal tract as a speech device takes place. Separately, infants learn the perceptual categories of their native lan guage. This is crucial to the model. As Guenther (1995, p. 599) states, “the model starts out with the ability to perceive all of the sounds that it will eventually learn to produce.” When the second stage of babbling begins, it involves the mapping between phonetic categories developed through perceptual learning and articula tion. The babbling during this period tunes the feedback system to permit corrective responses to detected errors. In the next imitation phase, infants hear adult syllables that they try to reproduce. In a cyclical process involving sensory feedback from actual productions and better feedforward commands, the system shapes the early utterances toward the native language. Simulations by Guenther and his students support the logic of this account. However, there are significant concerns. First among these is that the data supporting this process are weak or, in the case of MacDonald et al.’s (2012) results, contradict the hypothesis. Early speech feedback processing and the shaping of speech production targets is not well attested. Second, the proposal relies on a strong relationship between the representations of speech perception and speech production. Surprisingly, this relationship is controversial. Models like DIVA predict a phenomenon that is frequently assumed but is not strongly supported by actual data – babbling drift. The hypothesis that the sounds of babbling drift over time was first proposed by Roger Brown (1958). Brown suggested that the phonetic repertoire in the babbling of infants slowly begins to resemble the phonetics of the language environment that they are exposed to and begins to not include sounds that are absent from the native language. As the review by Best et al. (2016) indicates, the support for this idea is mixed, particu larly from transcription studies and perceptual studies where naive listeners attempted to identify the language environment of the infant’s babbling. There is broad agreement that early babbling has common characteristics across languages and a somewhat limited phonetic repertoire. The evidence from later babbling lacks this broad consensus. Older transcription studies of late babbling

250 F1 F2

Adults

200 150 100 50 0 –50 –100 –150 –200

Normalized formant frequency [Hz]

250 Young children

200 150 100 50 0 –50 –100 –150 –200 250

Toddlers

200 150 100 50 0 –50 –100 –150 –200 0

10

20

30

40

50

Utterance

Figure 4.2 Average F1 (circles) and F2 (triangles) frequencies estimates across time for adults (top panel), young children (middle panel), and toddlers (bottom panel). The formant frequencies have been normalized to the average baseline frequencies. The shaded area indicates when subjects were given altered auditory feedback (from MacDonald et al., 2012). Source: MacDonald et al., © 2012, Elsevier.

110 Sensing Speech were often plagued by small sample sizes and bias issues inherent in transcription. Studies with larger sample sizes, however, still show conflicting patterns of results. For example, de Boysson‐Bardies and Vihman (1991) reported that the prevalence of consonants of different manners and places of articulation in the babbling of 12‐month old infants from English, French, Japanese, and Swedish homes corresponded to the distributions of consonants in their language environments. In contrast, a number of other transcription studies have failed to find such differ ences (e.g. Kern, Davis, & Zink, 2009; Lee, Davis, & MacNeilage, 2010). Another approach has been to use recordings of infants babbling as perceptual stimuli and ask adult listeners to categorize what native language the infants have. These studies have also shown mixed results, with some experiments reporting that listeners can discriminate the home language of the infants (e.g. de Boysson‐ Bardies, Sagart, & Durand, 1984) while others showed no perceptual difference (e.g. Thevenin et al., 1985). The more serious concern about these studies is that listeners were likely tuning into prosodic differences in the babbling rather than the segmental differences that would be predicted by babbling drift. The ability to perceptually distinguish the language of babbling has been shown for low‐pass filtered stimuli (e.g. Whalen, Levitt, & Wang, 1991) and this supports the idea that it is prosodic differences that are driving these results. A recent controlled study (Lee et al., 2017) with a large number of stimuli found that perceptual categoriza tions of Chinese‐ and English‐learning babies’ utterances at 8, 10, and 12 months of age were only reliable for a small subset of the stimuli (words or canonical syl lables that resembled words). These effects were modest and suggest that early lexical development rather than babbling may be where the home language shows its earliest influence. Direct measurements of babbling acoustics have shown evidence for babbling drift, albeit only small effects. For example, Whalen, Levitt, and Goldstein (2007) measured voice onset time (VOT) in French‐ and English‐learning infants at ages 9 and 12 months. There were no differences in VOT or in the duration of prevoicing that was observed. However, there was a greater incidence of prevoicing in the French babies which corresponds to adult French–English differences. The most serious concern from the existing data is that there is no evidence for speech‐production tuning of targets based on production errors. MacDonald et al. (2012) data suggest that young children do not correct errors. However, there are several caveats to that conclusion. First, the magnitude of the perturbation may have to be within a critical range and the perturbations for all ages in MacDonald et al. were the same in hertz. It is possible that younger children require larger perturbations to elicit compensations. A related issue is that the perturbations may have been within the noisy categories that the children were producing. The variability of production may be an indicator of the category status. However, even if this were true, it begs the question: How could an organism learn to produce adult targets under these conditions? The challenges are enormous. Juveniles in all species have vocal tracts that do not match their parents’ vocal tracts. Birds and other species show marked production variance as juveniles (e.g. Bertram et al., 2014). There is no obvious feedback base mechanism

Perceptual Control of Speech 111 that permits the mapping from adult targets to young productions (see Messum & Howard, 2015). Error correction as normally envisioned in motor control may not be engaged.

Perception–production interaction This puzzle reflects the general problem of understanding the relationship bet ween the processes of listening to speech and producing it. Liberman (1996, p. 247) stated: In all communication, sender and receiver must be bound by a common under standing about what counts; what counts for the sender must count for the receiver, else communication does not occur. Moreover, the processes of production and per ception must somehow be linked; their representation must, at some point, be the same.

This is certainly true in a very general sense but the roles played in communica tion by the auditory signal that reaches the listener and by the signal that reaches the speaker are dramatically different. For the listener, the signal is involved in categorical discrimination and information transmission, while for the talker the signal is primarily thought to influence motor precision and error correction. These two issues are not independent but are far from equivalent. The problem for researchers is that the perception and production of speech are so intrinsically intertwined in communication that it is difficult to distinguish the influence of these “two solitudes” of speech research on spoken language. While historically the relationship between speech perception and production has been implicated as explanations of language change, patterns of language dis order, and the developmental time course of speech acquisition, there has been little comprehensive theorizing about how speech input and output interact (Levelt, 2013). Recently, Kittredge and Dell (2016) outlined three stark hypotheses about the relationship between speech perception and production. In their view, the representations for perception and production could be completely separate, absolutely inseparable, or separable under some if not many conditions. A number of different types of experimental evidence might distinguish these possibilities, including (1) data that examine whether learning/adaptation changes in perception influence production and vice versa; (2) correlational data showing individual differences in the processing of speech perception and production (e.g. perceptual precision and production variability); and (3) data showing interfer ence between the two processes of perception and production.

Learning/adaptation changes In speech perception, selective adaptation for both consonants and vowels results in changes to category boundaries after exposure to a repetitive adapting stimulus.

112 Sensing Speech Cooper (1974; Cooper & Lauritsen, 1974) reported production changes in pro duced VOT following repeated presentation of a voiceless adapting stimulus. In a manner similar to the effects of selective adaptation on perceptual category bound aries, talkers produced shorter VOTs after adaptation. The effect was attributed to a perceptuomotor mechanism that mediates both the perception and the produc tion of speech. More recently, Shiller et al. (2009) found that, when subjects pro duced fricatives with frequency‐altered feedback, they produced fricatives that compensated for the perturbation. Most interestingly, the subjects shifted their perceptual boundary for /s/–/sh/ identification following this production per turbation. However, as Perkell (2012) cautions, the segment durations of the frica tives were far beyond the natural range, raising the possibility that the effect was acoustic rather than phonetic. Lametti et al. (2014) showed the opposite direction of influence. A perceptual training task designed to alter perceptual boundaries between vowels preceded a production task. No shift was observed in baseline vowel formant values but a difference was observed in the magnitude of compensation to F1 perturbations. Oddly, this difference was observed in a follow‐ up days later. The persistence is surprising for a number of reasons. First, the speech adaptation effects produced by formant shifts themselves drift away relatively quickly within an experimental session following return to normal feedback. Second, the perceptual training didn’t influence baseline vowel produc tion immediately after training nor days later. The influence of perceptual change on production is shown only in the magnitude of compensation (i.e. in the behavior of the auditory feedback processing system). Finally, the length of effect is note worthy. While it is not unheard of for perceptual effects to persist across many days, it is not common; the McCollough effect in vision has been shown to last for months after a 15‐minute training period (Jones & Holding, 1975). However, the reinforcement learning paradigm used by Lametti et al. (2014) is considerably different from the adaptation approach used in other studies and suggests a more selective influence on the perception–production linkage. The published data suggest modest effects from speech‐perception training on speech production and vice versa. As Kittredge and Dell (2016) suggest, the pathway for exchange between the input and output systems may be restricted to a small set of special conditions. Kittredge and Dell suggest that one possibility is that perceptual behavior that involves prediction invokes the motor system and this directly influences production. A separate line of research has suggested this influence may exist but has shown similar, small effect sizes in experiments. In the study of face‐to‐face conversa tions, considerable theoretical proposals support the idea that interlocutors align their language at many levels (Garrod & Pickering, 2004). At the phonetic level, the findings have been weak but consistent. Few acoustic findings support alignment but small perceptual effects have been frequently reported (Pardo et al., 2012; Kim, Horton, & Bradlow, 2011. The surprising aspect of these findings is the small effect size. Given the proposed importance of alignment in communication (and the proposed linkage between perception and production; Pickering & Garrod, 2013), the small influence is problematic.

Perceptual Control of Speech 113

Correlational data The data linking perception and production within individuals are also surpris ingly sparse. Most of the data show that talkers’ perception and production cate gories are somewhat similar. For example, Newman (2003) found small correlations between the VOT prototypes of listeners and their production VOT values (accounting for approximately 27 percent of the variance). However, Frieda et al. (2000) did not find such a correlation for the perceptual prototype for the vowel /i/ and production values. Fox (1982) showed that the factor analysis dimensions derived from listeners’ judgments of similarity between vowels could be predicted by the acoustics of vowels produced by the participants, but only by the corner vowels /i, u, A/. Bell‐Berti et al. (1979) categorized the manner in which partici pants produced the tense/lax distinction in front vowels based on their examina tion of electromyographic recordings. They later found that those participants who used a tongue‐height production strategy showed larger boundary shifts in an anchoring condition in a vowel‐perception test than those who used a muscle tension implementation of tense/lax. Perkell et al. (2004) also grouped partici pants on the basis of measurements of production data, and found that these groups performed differently in perception tests. The more distinct the production contrasts between two vowels that talkers produced the more likely those subjects were able to distinguish tokens in a continuum of those vowels. The most recent evidence in support of this correlational relation between per ception and production abilities comes from Franken et al. (2017). In this study, production variability for vowel formant values was measured and the ability to discriminate between vowel tokens assessed. These two variables were found to correlate in their data. However, the correlations are modest and smaller than those reported by Perkell et al. (2004). The argument put forward in Franken et al. (2017) is that talkers with better perceptual acuity are less variable in production and that these talkers are more sensitive to feedback discrepancies. Indeed, Villacorta, Perkell, and Guenther (2007) showed a greater response to formant perturbation in subjects who had greater acoustic acuity. However, this finding is inconsistent with MacDonald, Purcell, and Munhall’s (2011) meta‐analysis of the variability of pro duction and compensation magnitude in F1 and F2 for 116 subjects. The lack of relationship between variability and compensation observed by MacDonald et al. is important given the large sample size considered in their analysis.

Interference effects If the two processes of perception and production of speech share a common representation or resource, there should be evidence that the performance of one process can interfere with or in some cases enhance the performance of the other process. There are tantalizing findings of this kind in the speech perception–pro duction literature. At the adult level, there are a series of imaging studies and transcranial magnetic stimulation (TMS) studies that document possible interfer ence or enhancement effects (see Skipper, Devlin, & Lametti, 2017, for a review; cf.

114 Sensing Speech Hickok, 2014, for a critical review of findings that relate to the mirror neuron hy pothesis; Hickok, Holt, & Lotto, 2009, for criticism of motor influences on percep tion). Janet Werker and colleagues have reported a series of intriguing studies on early speech development that relate production to perceptual abilities (e.g. Bruderer et al., 2015). As has previously been shown by Werker and others, young infants are capable of perceptually making nonnative phonetic category distinc tions. Using unique methods, they have shown that the ability to make these per ceptual distinctions is reduced when the effector responsible for producing the distinction (i.e. the tongue in Bruderer et al.’s 2015 study) is interfered with by a custom‐designed soother. When a different soother, which did not interfere with the tongue’s movement, was used during the perceptual phase of the study, no perceptual interference was observed. Infants could perceive the lingual nonna tive contrast if the tongue was not constrained by a soother. These six‐month‐old infants had not learned their perceptual or productive native sound inventory at the point of testing. Yet, there appears to be a sensorimotor linkage between phonetic perception and the articulatory system. In a follow‐up study, Choi, Bruderer, and Werker (2019) showed that the perception of nonnative contrasts as shown by Bruderer et al., and native contrasts (e.g. a /b/–/d/ distinction) can be interfered with by soothers that block the movement of key articulators in the pro duction of the distinction (e.g. tongue tip, lips). These studies suggest a direct linkage between the speech motor and perception systems (Bruderer et al., 2015). Further, the results suggest an auditory–effector mapping that is available very early in development. Influences in early infant speech behavior have also been shown from the other direction. Speech‐production tendencies can be correlated with developing per ceptual abilities (e.g. Majorano, Vihman, & DePaolis, 2014; DePaolis, Vihman, & Keren‐Portnoy, 2011). Majorano, Vihman, and DePaolis (2014) tested children learning Italian at 6, 12, and 18 months. At the end of the first year, children whose production favored a single vocal motor pattern, showed a perceptual preference for sounds resulting from those speech movement patterns.

Conclusion None of these findings, however, elucidate how auditory feedback processing develops, nor what role hearing your own productions plays in learning to pro nounce words. The findings of MacDonald et al. (2012) raise the possibility that early productions by the child are not tuning the word‐formation system on the basis of auditory‐error corrections. Instead, the early focus may be on the adult models. Cooper, Fecher, and Johnson (2018) recently showed that two‐and‐a‐half‐year‐old children preferred recordings of adults over recordings of their own and other tod dlers’ speech. In fact, the children showed no familiarity effect and thus did not prefer their mother’s speech from another adult’s. As Cooper, Fecher, and Johnson (2018) suggest, the driving force in lexical acquisition may be the adult targets and not the successive shaping of children’s targets by corrective auditory feedback.

Perceptual Control of Speech 115 Howard and Messum (2011, 2014, Messum & Howard, 2015) have proposed a model of early word pronunciation that is consistent with this idea. According to their view, infants develop a set of vocal motor schemes (Vihman, 1991, 1993, 1996) during babbling. When early word learning takes place, infants are producing sounds that parents reinforce in the context. By this account, it is the adult who is initially solving the correspondence problem of perceiving the match between the children’s production and the adult target. Adult modeling and reinforcement guide the infants’ early lexical acquisition. Using their computational model called Elija, Howard and Messum have shown that early productions can be acquired by the model learning the associations between “infant” motor patterns and the acoustics of the adult produced in that context. One of the key requirements for successful reinforcement learning is explora tion. Sampling the control space allows the organism to learn the value of a range of different actions. In vocal learning, variability of productions allows the learner to produce a broad range of activities that can be selectively reinforced. In adult movement, it may be desirable to reduce or control peripheral or execution vari ability (Harris & Wolpert, 1998). However, variability such as that generated in the birdsong circuitry indicates that this type of “noise” is an essential part of the sen sorimotor system (Bertram et al., 2014). Motor learning may require this variability (Dhawale, Smith, & Ölveczky, 2017) and the variable behavior of toddlers learning to produce speech may be showing exactly the prerequisite talking required to be vocal learners. In summary, it is clear that hearing yourself speak is important to speech motor control. The precision and regularity of articulation seem to depend on auditory feedback. However, how this process works is less clear. The considerable hyster esis of degradation shown when profound hearing loss occurs, or when artificial modifications of speech feedback are introduced, suggests that auditory feedback may be an important stabilizing force in the long term but it is not essential for moment‐to‐moment control. Further, the data show that hearing others speak involves a different process from the perception of the sound of your own speech. The loose relationship shown between perceptual and production individual dif ferences is consistent with the hypothesis that perception and production are largely separate but can influence each other at times (Dell & Chang, 2014; Kittredge & Dell, 2016).

REFERENCES Békésy, G. v. (1949). The structure of the middle ear and the hearing of one’s own voice by bone conduction. Journal of the Acoustical Society of America, 21(3), 217–232. Bell‐Berti, F., Raphael, L. J., Pisoni, D. B., & Sawusch, J. R. (1979). Some relationships

between speech production and perception. Phonetica, 36(6), 373–383. Bertram, R., Daou, A., Hyson, R. L., et al. (2014). Two neural streams, one voice: Pathways for theme and variation in the songbird brain. Neuroscience, 277, 806–817.

116 Sensing Speech Best, C. T., Goldstein, L. M., Nam, H., & Tyler, M. D. (2016). Articulating what infants attune to in native speech. Ecological Psychology, 28(4), 216–261. Binnie, C. A., Daniloff, R. G., & Buckingham, H. W., Jr. (1982). Phonetic disintegration in a five‐year‐old following sudden hearing loss. Journal of Speech and Hearing Disorders, 47(2), 181–189. Borden, G. J. (1979). An interpretation of research on feedback interruption in speech. Brain and Language, 7(3), 307–319. Bridgeman, B. (2007). Efference copy and its limitations. Computers in Biology and Medicine, 37(7), 924–929. Brown, R. (1958). Words and things. New York: Free Press. Bruderer, A. G., Danielson, D. K., Kandhadai, P., & Werker, J. F. (2015). Sensorimotor influences on speech perception in infancy. Proceedings of the National Academy of Sciences of the United States of America, 112(44), 13531–13536. Brumm, H., & Zollinger, S. A. (2011). The evolution of the Lombard effect: 100 years of psychoacoustic research. Behaviour, 148(11–13), 1173–1198. Burnett, T. A., Freedland, M. B., Larson, C. R., & Hain, T. C. (1998). Voice F0 responses to manipulations in pitch feedback. Journal of the Acoustical Society of America, 103(6), 3153–3161. Chang, E. F., Niziolek, C. A., Knight, R. T., et al. (2013). Human cortical sensorimotor network underlying feedback control of vocal pitch. Proceedings of the National Academy of Sciences of the United States of America, 110(7), 2653–2658. Chase, R. A., Sutton, S., First, D., & Zubin, J. (1961). A developmental study of changes in behavior under delayed auditory feedback. Journal of Genetic Psychology, 99(1), 101–112. Choi, D., Bruderer, A. & Werker, J. (2019). Sensorimotor influences on speech perception in pre‐babbling infants: Replication and extension of Bruderer et al. (2015). Psychonomic Bulletin & Review, 26, 1388–1399.

Christoffels, I. K., Formisano, E., & Schiller, N. O. (2007). Neural correlates of verbal feedback processing: An fMRI study employing overt speech. Human Brain Mapping, 28(9), 868–879. Cooper, A., Fecher, N., & Johnson, E. K. (2018). Toddlers’ comprehension of adult and child talkers: Adult targets versus vocal tract similarity. Cognition, 173, 16–20. Cooper, W. E. (1974). Selective adaptation for acoustic cues of voicing in initial stops. Journal of Phonetics, 2(4), 303–313. Cooper, W. E., & Lauritsen, M. R. (1974). Feature processing in the perception and production of speech. Nature, 252(5479), 121–123. Cowie, R., & Douglas‐Cowie, E. (1992). Postlingually acquired deafness. New York: Mouton de Gruyter. Cowie, R., Douglas‐Cowie, E., & Kerr, A. G. (1982). A study of speech deterioration in post‐lingually deafened adults. Journal of Laryngology & Otology, 96(2), 101–112. Crapse, T. B., & Sommer, M. A. (2008). Corollary discharge across the animal kingdom. Nature Reviews Neuroscience, 9(8), 587–600. Creutzfeldt, O., Ojemann, G., & Lettich, E. (1989). Neuronal activity in the human lateral temporal lobe. Experimental Brain Research, 77(3), 451–475. Cynx, J., Lewis, R., Tavel, B., & Tse, H. (1998). Amplitude regulation of vocalizations in noise by a songbird, Taeniopygia guttata. Animal Behaviour, 56(1), 107–113. Cynx, J., & von Rad, U. (2001). Immediate and transitory effects of delayed auditory feedback on bird song production. Animal Behaviour, 62(2), 305–312. de Boysson‐Bardies, B., Sagart, L., & Durand, C. (1984). Discernible differences in the babbling of infants according to target language. Journal of Child Language, 11(1), 1–15. de Boysson‐Bardies, B., & Vihman, M. M. (1991). Adaptation to language: Evidence from babbling and first words in four languages. Language, 67(2), 297–319.

Perceptual Control of Speech 117 Dell, G. S., & Chang, F. (2014). The P‐chain: Relating sentence production and its disorders to comprehension and acquisition. Philosophical Transactions of the Royal Society B: Biological Sciences, 369(1634), 20120394. DePaolis, R. A., Vihman, M. M., & Keren‐ Portnoy, T. (2011). Do production patterns influence the processing of speech in prelinguistic infants? Infant Behavior & Development, 34(4), 590–601. Dhawale, A. K., Smith, M., & Ölveczky, B. (2017). The role of variability in motor learning. Annual Review of Neuroscience, 40, 479–498. Eliades, S. J., & Wang, X. (2003). Sensory‐ motor interaction in the primate auditory cortex during self‐initiated vocalizations. Journal of Neurophysiology, 89(4), 2194–2207. Eliades, S. J., & Wang, X. (2008). Neural substrates of vocalization feedback monitoring in primate auditory cortex. Nature, 453(7198), 1102–1106. Fox, R. A. (1982). Individual variation in the perception of vowels: Implications for a perception–production link. Phonetica, 39(1), 1–22. Franken, M. K., Acheson, D. J., McQueen, J. M., et al. (2017). Individual variability as a window on production–perception interactions in speech motor control. Journal of the Acoustical Society of America, 142(4), 2007–2018. Franken, M. K., Acheson, D. J., McQueen, J. M., et al. (2018). Opposing and following responses in sensorimotor speech control: Why responses go both ways. Psychonomic Bulletin & Review, 25(4), 1458–1467. Frieda, E. M., Walley, A. C., Flege, J. E., & Sloane, M. E. (2000). Adults’ perception and production of the English vowel /i/. Journal of Speech, Language, and Hearing Research, 43(1), 129–143. Garrod, S., & Pickering, M. J. (2004). Why is conversation so easy?. Trends in cognitive sciences, 8(1), 8–11. Griffiths, T. L., Lieder, F., & Goodman, N. D. (2015). Rational use of cognitive resources: Levels of analysis between the

computational and the algorithmic. Topics in Cognitive Science, 7(2), 217–229. Guenther, F. H. (1995). Speech sound acquisition, coarticulation, and rate effects in a neural network model of speech production. Psychological Review, 102(3), 594–621. Guenther, F. H. (2016). Neural control of speech. Cambridge, MA: MIT Press. Guenther, F. H., Ghosh, S. S., & Tourville, J. A. (2006). Neural modeling and imaging of the cortical interactions underlying syllable production. Brain and Language, 96(3), 280–301. Guenther, F. H., & Hickok, G. (2015). Role of the auditory system in speech production. In M. J. Aminoff, F. Boller, & D. F. Swaab (Eds), Handbook of clinical neurology: Vol. 129. The human auditory system: Fundamental organization and clinical disorders (pp. 16–175). Amsterdam: Elsevier. Harris, C. & Wolpert, D. (1998). Signal‐ dependent noise determines motor planning. Nature, 394, 780–784. Hashimoto, Y., & Sakai, K. L. (2003). Brain activations during conscious self‐ monitoring of speech production with delayed auditory feedback: An fMRI study. Human Brain Mapping, 20(1), 22–28. Heinks‐Maldonado, T. H., Mathalon, D. H., Gray, M., & Ford, J. M. (2005). Fine‐ tuning of auditory cortex during speech production. Psychophysiology, 42(2), 180–190. Hickok, G. (2012). Computational neuroanatomy of speech production. Nature Reviews Neuroscience, 13(2), 135–145. Hickok, G. (2014). The myth of mirror neurons: The real neuroscience of communication and cognition. New York: W. W. Norton. Hickok, G., Holt, L. L., & Lotto, A. J. (2009). Response to Wilson: What does motor cortex contribute to speech perception? Trends in Cognitive Sciences, 13(8), 330–331. Houde, J. F., & Chang, E. F. (2015). The cortical computations underlying

118 Sensing Speech feedback control in vocal production. Current Opinion in Neurobiology, 33, 174–181. Houde, J. F., & Jordan, M. I. (1998). Sensorimotor adaptation in speech production. Science, 279(5354), 1213–1216. Houde, J. F., & Nagarajan, S. S. (2011). Speech production as state feedback control. Frontiers in Human Neuroscience, 5, 82. Houde, J. F., Nagarajan, S. S., Sekihara, K., & Merzenich, M. M. (2002). Modulation of the auditory cortex during speech: an MEG study. Journal of Cognitive Neuroscience, 14(8), 1125–1138. Howard, I. S., & Messum, P. (2011). Modeling the development of pronunciation in infant speech acquisition. Motor Control, 15(1), 85–117. Howard, I. S., & Messum, P. (2014). Learning to pronounce first words in three languages: An investigation of caregiver and infant behavior using a computational model of an infant. PLOS One, 9(10), e110334. Huizenga, T. (2016). Killing me sharply with her song”: The improbable story of Florence Foster Jenkins. NPR, August 10. Retrieved August 4, 2020, from www.npr. org/sections/deceptivecade nce/2016/08/10/488724807/ killing‐me‐sharply‐with‐her‐song‐the‐ improbable‐story‐of‐florence‐foster‐ jenkins. Jones, P. D., & Holding, D. H. (1975). Extremely long‐term persistence of the McCollough effect. Journal of Experimental Psychology: Human Perception and Performance, 1(4), 323–327. Jones, J. A., & Keough, D. (2008). Auditory‐ motor mapping for pitch control in singers and nonsingers. Experimental Brain Research, 190(3), 279–287. Kawato, M. (1990) Computational schemes and neural network models for formation and control of multijoint arm trajectory. In W. T. Miller III, R. S. Sutton, & P. J. Werbos (Eds.), Neural networks for

control (pp. 197–228). Cambridge, MA: MIT Press. Kern, S., Davis, B., & Zink, I. (2009). From babbling to first words in four languages. In F. d’Errico & J. M. Hombert (Eds), Becoming eloquent: Advances in the emergence of language, human cognition, and modern cultures. Philadelphia: John Benjamins. Kim, M., Horton, W. S., & Bradlow, A. R. (2011). Phonetic convergence in spontaneous conversations as a function of interlocutor language distance. Laboratory Phonology, 2(1), 125–156. Kittredge, A. K., & Dell, G. S. (2016). Learning to speak by listening: Transfer of phonotactics from perception to production. Journal of Memory and Language, 89, 8–22. Kobayasi, K. I., & Okanoya, K. (2003). Context‐dependent song amplitude control in Bengalese finches. Neuroreport, 14(3), 521–524. Lametti, D. R., Krol, S. A., Shiller, D. M., & Ostry, D. J. (2014). Brief periods of auditory perceptual training can determine the sensory targets of speech motor learning. Psychological Science, 25(7), 1325–1336. Lane, H., & Webster, J. W. (1991). Speech deterioration in postlingually deafened adults. Journal of the Acoustical Society of America, 89(2), 859–866. Lashley, K. S. (1951). The problem of serial order in behavior. In L. A. Jeffress (Ed.), Cerebral mechanisms in behavior (pp. 112–131). New York: Wiley. Leder, S. B., Spitzer, J. B., & Kirchner, J. C. (1987). Speaking fundamental frequency of postlingually profoundly deaf adult men. Annals of Otology, Rhinology & Laryngology, 96(3), 322–324. Lee, C. C., Jhang, Y., Chen, L. M., et al. (2017). Subtlety of ambient‐language effects in babbling: A study of English‐ and Chinese‐learning infants at 8, 10, and 12 months. Language Learning and Development, 13(1), 100–126.

Perceptual Control of Speech 119 Lee, S. A. S., Davis, B., & MacNeilage, P. (2010). Universal production patterns and ambient language influences in babbling: A cross‐linguistic study of Korean‐ and English‐learning infants. Journal of Child Language, 37(2), 293–318. Leonard, M. L., & Horn, A. G. (2005). Ambient noise and the design of begging signals. Proceedings of the Royal Society B: Biological Sciences, 272(1563), 651–656. Levelt, W. J. (1983). Monitoring and self‐ repair in speech. Cognition, 14(1), 41–104. Levelt, W. J. (2013). A history of psycholinguistics: The pre‐Chomskyan era. Oxford: Oxford University Press. Liberman, A. M. (1996). Speech: A special code. Cambridge, MA: MIT press. Lombard, E. (1911). Le signe de l’élévation de la voix. Annales des Maladies de l’Oreille et du Larynx, 37, 101–119. Lombardino, A. J., & Nottebohm, F. (2000). Age at deafening affects the stability of learned song in adult male zebra finches. Journal of Neuroscience, 20(13), 5054–5064. MacDonald, E. N., Goldberg, R., & Munhall, K. G. (2010) Compensation in response to real‐time formant perturbations of different magnitude. Journal of the Acoustical Society of America, 127, 1059–1068. MacDonald, E. N., Johnson, E. K., Forsythe, J., et al. (2012). Children’s development of self‐regulation in speech production. Current Biology, 22(2), 113–117. MacDonald, E. N., Purcell, D. W., & Munhall, K. G. (2011). Probing the independence of formant control using altered auditory feedback. Journal of the Acoustical Society of America, 129(2), 955–965. Mahl, G. F. (1972). People talking when they can’t hear their voices. In A. W. Siegman & B. Pope (Eds), Studies in Dyadic Communication (pp. 211–264). Oxford: Pergamon. Majorano, M., Vihman, M. M., & DePaolis, R. A. (2014). The relationship between infants’ production experience and their

processing of speech. Language Learning and Development, 10(2), 179–204. McGarr, N. S., & Harris, K. S. (1980). Articulatory control in a deaf speaker. Haskins Laboratories Status Report on Speech Research, 307–330. Retrieved August 4, 2020, from http://www. haskins.yale.edu/sr/SR063/SR063_18. pdf. Messum, P., & Howard, I. S. (2015). Creating the cognitive form of phonological units: The speech sound correspondence problem in infancy could be solved by mirrored vocal interactions rather than by imitation. Journal of Phonetics, 53, 125–140. Meyer, A. S., Huettig, F., & Levelt, W. J. (2016). Same, different, or closely related: What is the relationship between language production and comprehension? Journal of Memory and Language, 89, 1–7. Miller, C. T., & Wang, X. (2006). Sensory‐ motor interactions modulate a primate vocal behavior: Antiphonal calling in common marmosets. Journal of Comparative Physiology A, 192(1), 27–38. Mitsuya, T., MacDonald, E. N., & Munhall, K. G. (2014). Temporal control and compensation for perturbed voicing feedback. Journal of the Acoustical Society of America, 135(5), 2986–2994. Mitsuya, T., Munhall, K. G., & Purcell, D. W. (2017). Modulation of auditory‐motor learning in response to formant perturbation as a function of delayed auditory feedback. Journal of the Acoustical Society of America, 141(4), 2758–2767. Muller‐Preuss, P., & Ploog, D. (1981). Inhibition of auditory cortical neurons during phonation. Brain Research, 215(1–2), 61–76. Munhall, K. G., MacDonald, E. N., Byrne, S. K., & Johnsrude, I. (2009). Talkers alter vowel production in response to real‐ time formant perturbation even when instructed not to compensate. Journal of

120 Sensing Speech the Acoustical Society of America, 125(1), 384–390. Newman, R. S. (2003). Using links between speech perception and speech production to evaluate different acoustic metrics: A preliminary report. Journal of the Acoustical Society of America, 113(5), 2850–2860. Nordeen, K. W., & Nordeen, E. J. (2010). Deafening‐induced vocal deterioration in adult songbirds is reversed by disrupting a basal ganglia‐forebrain circuit. Journal of Neuroscience, 30(21), 7392–7400. Oller, D. K., & Eilers, R. E. (1988). The role of audition in infant babbling. Child Development, 59(2), 441–449. Osberger, M. J., & McGarr, N. S. (1982). Speech production characteristics of the hearing impaired. In N. J. Lass (Ed.), Speech and language: Advances in basic research and practice (pp. 221–284). New York: Academic Press. Pardo, J. S., Gibbons, R., Suppes, A., & Krauss, R. M. (2012). Phonetic convergence in college roommates. Journal of Phonetics, 40(1), 190–197. Parks, S. E., Johnson, M., Nowacek, D., & Tyack, P. L. (2011). Individual right whales call louder in increased environmental noise. Biology Letters, 7(1), 33–35. Peebles, D., & Cooper, R. P. (2015). Thirty years after Marr’s vision: Levels of analysis in cognitive science. Topics in Cognitive Science, 7(2), 187–190. Perkell, J. S. (2012). Movement goals and feedback and feedforward control mechanisms in speech production. Journal of Neurolinguistics, 25(5), 382–407. Perkell, J. S., Guenther, F. H., Lane, H., et al. (2004). The distinctness of speakers’ productions of vowel contrasts is related to their discrimination of the contrasts. Journal of the Acoustical Society of America, 116(4), 2338–2344. Petkov, C. I., & Jarvis, E. (2012). Birds, primates, and spoken language origins: Behavioral phenotypes and neurobiological substrates. Frontiers in Evolutionary Neuroscience, 4, 12.

Pfenning, A. R., Hara, E., Whitney, O., et al. (2014). Convergent transcriptional specializations in the brains of humans and song‐learning birds. Science, 346(6215), 1256846. Pick, H. L., Siegel, G. M., Fox, P. W., et al. (1989). Inhibiting the Lombard effect. Journal of the Acoustical Society of America, 85(2), 894–900. Pickering, M. J., & Garrod, S. (2013). An integrated theory of language production and comprehension. Behavioral and Brain Sciences, 36(4), 329–347. Plant, G. (1984). The effects of an acquired profound hearing loss on speech production: A case study. British Journal of Audiology, 18(1), 39–48. Purcell, D. W., & Munhall, K. G. (2006). Compensation following real‐time manipulation of formants in isolated vowels. Journal of the Acoustical Society of America, 119(4), 2288–2297. Scheerer, N. E., Jacobson, D. S., & Jones, J. A. (2016). Sensorimotor learning in children and adults: Exposure to frequency‐altered auditory feedback during speech production. Neuroscience, 314, 106–115. Scheerer, N. E., Jacobson, D. S., & Jones, J. A. (2019). Sensorimotor control of vocal production in early childhood. Journal of Experimental Psychology: General, 149(6), 1071–1077. Scheerer, N. E., Liu, H., & Jones, J. A. (2013). The developmental trajectory of vocal and event‐related potential responses to frequency‐altered auditory feedback. European Journal of Neuroscience, 38(8), 3189–3200. Schmidt, R. A. (1980). Past and future issues in motor programming. Research Quarterly for Exercise and Sport, 51(1), 122–140. Shiller, D. M., Sato, M., Gracco, V. L., & Baum, S. R. (2009). Perceptual recalibration of speech sounds following speech motor learning. Journal of the Acoustical Society of America, 125(2), 1103–1113.

Perceptual Control of Speech 121 Siegel, G. M., Pick, H. L., Olsen, M. G., & Sawin, L. (1976). Auditory feedback on the regulation of vocal intensity of preschool children. Developmental Psychology, 12(3), 255–261. Sinnott, J. M., Stebbins, W. C., & Moody, D. B. (1975). Regulation of voice amplitude by the monkey. Journal of the Acoustical Society of America, 58(2), 412–414. Skipper, J. I., Devlin, J. T., & Lametti, D. R. (2017). The hearing ear is always found close to the speaking tongue: Review of the role of the motor system in speech perception. Brain and Language, 164, 77–105. Sober, S. J., & Brainard, M. S. (2009). Adult birdsong is actively maintained by error correction. Nature Neuroscience, 12(7), 927–931. Sperry, R. W. (1950). Neural basis of the spontaneous optokinetic response produced by visual inversion. Journal of Comparative and Physiological Psychology, 43(6), 482–489. Takahashi, D. Y., Fenley, A. R., & Ghazanfar, A. A. (2016). Early development of turn‐ taking with parents shapes vocal acoustics in infant marmoset monkeys. Philosophical Transactions of the Royal Society B: Biological Sciences, 371(1693), 20150370. Terband, H., Van Brenk, F., & van Doornik‐ van der Zee, A. (2014). Auditory feedback perturbation in children with developmental speech sound disorders. Journal of Communication Disorders, 51, 64–77. Thevenin, D. M., Eilers, R. E., Oller, D. K., & Lavoie, L. (1985). Where’s the drift in babbling drift? A cross‐linguistic study. Applied Psycholinguistics, 6(1), 3–15. Tourville, J. A., Reilly, K. J., & Guenther, F. H. (2008). Neural mechanisms underlying auditory feedback control of speech. NeuroImage, 39(3), 1429–1443. Vihman, M. M. (1991). Ontogeny of phonetic gestures: Speech production. In I. Mattingly & M. Studdert‐Kennedy (Eds), Modularity and the motor theory of speech perception: Proceedings of a

conference to honor Alvin M. Liberman (pp. 69–84). Hillsdale, NJ: Lawrence Erlbaum. Vihman, M. M. (1993). Variable paths to early word production. Journal of Phonetics, 21(1–2), 61–82. Vihman, M. M. (1996). Phonological development: The origins of language in the child. Oxford: Blackwell. Villacorta, V. M., Perkell, J. S., & Guenther, F. H. (2007). Sensorimotor adaptation to feedback perturbations of vowel acoustics and its relation to perception. Journal of the Acoustical Society of America, 122(4), 2306–2319. von Holst, E., & Mittelstaedt, H. (1950). Das Reafferenzprinzip. Naturwissenschaften, 37(20), 464–476. Waldstein, R. S. (1990). Effects of postlingual deafness on speech production: Implications for the role of auditory feedback. Journal of the Acoustical Society of America, 88(5), 2099–2114. Whalen, D. H., Levitt, A. G., & Goldstein, L. M. (2007). VOT in the babbling of French‐and English‐learning infants. Journal of Phonetics, 35(3), 341–352. Whalen, D. H., Levitt, A. G., & Wang, Q. (1991). Intonational differences between the reduplicative babbling of French‐ and English‐learning infants. Journal of Child Language, 18(3), 501–516. Yates, A. J. (1963). Recent empirical and theoretical approaches to the experimental manipulation of speech in normal subjects and in stammerers. Behaviour Research and Therapy, 1(2‐4), 95–119. Zarate, J. M., & Zatorre, R. J. (2008). Experience‐dependent neural substrates involved in vocal pitch regulation during singing. NeuroImage, 40(4), 1871–1887. Zimmermann, G., & Rettaliata, P. (1981). Articulatory patterns of an adventitiously deaf speaker: Implications for the role of auditory information in speech production. Journal of Speech, Language, and Hearing Research, 24(2), 169–178.

Part II

Perception of Linguistic Properties

5 Features in Speech Perception and Lexical Access SHEILA E. BLUMSTEIN Brown University, United States

One of the goals of speech research has been to characterize the defining properties of speech and to specify the processes and mechanisms used in speech perception and word recognition. A critical part of this research agenda has been to determine the nature of the representations that are used in perceiving speech and in lexical access. However, there is a lack of consensus in the field about the nature of these representations. This has been largely due to evidence showing tremendous variability in the speech signal: there are differences in vocal tract sizes; there is variability in production even within an individual from one utterance to another; speakers have different accents; contextual factors, including vowel quality and phonetic position, affect the ultimate acoustic output; and speech occurs in a noisy channel. This has led researchers to claim that there is a lack of stability in the mapping from acoustic input to phonetic categories (sound segments) and mapping from phonetic categories to the lexicon (words). In this view, there are no invariant or stable acoustic properties corresponding to the phonetic categories of speech, nor is there a one‐to‐one mapping between the representations of phonetic categories and lexical access. As a result, although there is general consensus that phonetic categories (sound segments) are critical units in perception and production, studies of word recognition generally bypass the mapping from the auditory input to phonetic categories (i.e. phonetic segments), and assume that abstract representations of phonetic categories and phonetic segments have been derived in some unspecified manner from the auditory input. Nonetheless, there are some who believe that stable speech representations can be derived from the auditory input. However, there is fundamental disagreement among these researchers about the nature of those representations. In one view,

The Handbook of Speech Perception, Second Edition. Edited by Jennifer S. Pardo, Lynne C. Nygaard, Robert E. Remez, and David B. Pisoni. © 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.

126 Perception of Linguistic Properties the stability is inherent in motor or speech gestures; in the other, the stability is inherent in the acoustic properties of the input. In this chapter, we will use behavioral, psychoacoustic, and neural evidence to argue that features (properties of phonetic segments) are basic representational units in speech perception and in lexical access. We will also argue that these features are mapped onto phonetic categories of speech (phonetic segments), and subsequently onto lexical representations; that these features are represented in terms of invariant (stable) acoustic properties; and that, rather than being binary (either present or not), feature representations are graded, providing a mapping by degrees from sounds to words and their meanings during lexical access.

Preliminaries To set the stage for our discussion, it is necessary first to provide a theoretical framework of the functional architecture of the word recognition system. Here, we will briefly specify the various components and stages of processing, identify the proposed representations in each of these components, and describe the nature of the information flow between the components. It is within this framework that we will consider feature representations. As a starting point for the discussion of features as representational units, it is useful to provide motivation and evidence for the theoretical construct of features. We will then turn to the evidence that features are indeed representational units in speech perception and word recognition.

Functional architecture of word recognition It is assumed in nearly all models of word recognition that there are multiple components or stages of processing in the mapping from sound to words. The first stage of processing involves the transformation of the auditory input from the peripheral auditory system into a spectro‐temporal representation based on the extraction of auditory patterns or properties from the acoustic signal. This representation is in turn converted at the next stage to a more abstract phonetic‐phonological representation corresponding to the phonetic categories of speech. The representation units at this stage of processing are considered to include segments and (as we will claim) features. These units then interfaces with the lexical processing system where the segment and feature representations map onto the lexicon (words). Here, a particular lexical entry is ultimately selected from a potential set of lexical candidates or competitors. Each lexical entry in turn activates its lexical semantic network where the meaning of the lexical entry is ultimately contacted. What is critically important is the functional architecture of this hierarchical system. Current models consider that the system is characterized by a distributed, network‐like architecture in which representations at each level of processing are realized as patterns of activation with properties of activation, inhibition, and competition (McClelland & Elman, 1986; McClelland & Rumelhart, 1986; Gaskell

Features in Speech Perception and Lexical Access 127 & Marslen‐Wilson, 1999). Not only do the dynamic properties of the network influence the degree to which a particular representation (e.g. a feature, a segment, a word) is activated or inhibited, but patterns of activation also spread to other representations that share particular structural properties. It is also assumed that the system is interactive with information flow being bidirectional; lower levels of representations may influence higher levels, and higher levels may influence lower levels. Thus, there is spreading activation not only within a level of representation (e.g. within the lexical network), but also between and within different levels of representation (e.g. phonological, lexical, and semantic levels). There are several consequences of such a functional architecture, as shown in Figure 5.1. First, there is graded activation throughout the speech‐lexical processing system; that is, the extent to which a given representation is activated is a function of the “goodness” of the input. Thus, the activation of a potential candidate is not all‐or‐none but rather is graded or probabilistic. For example, the activation of a phonetic category such as [k] will be influenced by the extent to which the acoustic‐ phonetic input matches its representation. It is worth noting that graded activation is more complex than simply the extent to which a particular phonetic attribute matches its representation. Rather, the extent of activation reflects the totality of the acoustic properties giving rise to a particular phonetic category. Thus, the activation of the phonetic feature [voicing] would include the probabilities of voice onset time, burst amplitude and duration, fundamental frequency, to name a few (see Lisker, 1986). Second, because the system is interactive, activation patterns at one level of processing will influence activation at another level of processing.

MOUSE

PURR

SPOON

BAT

CAT DOG

[kat]

SPOON

BAT

CAT HAT

MOUSE

PURR

DOG

HAT

[k*at]

Figure 5.1 Several properties of the functional architecture of auditory word recognition and lexical processing are shown, including graded activation, lexical competition, and interactivity (cascading activation). The left and right panels show how one level of processing influences activation of another downstream from it in an interactive system. The left panel shows activation of a good phonetic exemplar [kat] on activation of the lexical representation cat and the graded activation of phonological and semantic competitors in its lexical network. Note spoon which is neither phonologically nor semantically related is not activated. The right panel shows the cascading effects of a poor phonetic input [k*at] on the network. There is reduced activation of its lexical representation and even greater reduction of activation of competitors.

128 Perception of Linguistic Properties For example, a poor acoustic‐phonetic exemplar of a phonetic category such as [k] will influence the activation of the lexical representation of a word target such as cat. Third, because of the network properties of the system and the structure of the representations, there is competition between potential candidates at each stage of processing (e.g. within the sound [segment and feature], lexical, and semantic levels of representation). The degree of competition is a function of the extent to which the candidate(s) share(s) properties with the particular target. This influences the time course and patterns of activation of the target and, ultimately, the performance of the network. Multiple competing representations may also influence activation and processing at other stages of processing (Gaskell & Marslen‐Wilson, 1999; McClelland, 1979). These properties of the functional architecture of the system – graded activation, interactivity, and competition – influence lexical representations. As we will see, each of these properties provides evidence for features as representational units.

Distinctive features as a theoretical construct The theoretical foundations for features come from linguistics, and in particular from phonology (the study of the sound systems of language). Phonological evidence for features as a theoretical construct is based on consideration of (1) the nature of the phonological inventories across languages; (2) the phonological processes of languages synchronically, that is, at a particular point in time; and (3) the processes by which sound inventories in language change historically, that is, over time. To begin, it is assumed that the sound segments of language are composed of a smaller set of features or parameters of sound (at this point, we will not focus on whether the features are acoustically or articulatorily based, nor will we describe them in detail). These parameters reflect the fact that the phonetic segments (phonetic categories) of language can be further broken down into general classes. These classes are attributes that characterize how a sound may be produced or perceived, including whether the phonetic segment is a consonant or vowel; its manner of articulation (e.g. stop, nasal, fricative); where in the mouth the articulation occurs, for example, at the lips (labial), behind the teeth (alveolar), or at the velum (velar); and the laryngeal state of the production (e.g. voiced or voiceless). Thus, segments (phonetic categories) that share features are closer in articulatory or acoustic space than those that do not. As we shall see, this has ramifications for the performance of listeners in the perception of speech. The basic notion that underlies the study of phonology is that sound inventories of language are not composed of a random selection of potential phonetic categories. Rather, sound segments tend to group into natural classes reflecting shared sets of feature parameters, for example, manner of articulation (obstruent, continuant, nasal), place of articulation (labial, alveolar, palatal, velar), or the state of the glottis (voicing). Speech inventories across languages also tend toward symmetry. For example, a language inventory may have voiced and voiceless stop labial, alveolar, and velar consonants, as in English, but would typically not have

Features in Speech Perception and Lexical Access 129 an inventory consisting solely of a voiceless labial, voiced alveolar, and voiceless palatal stop consonant. Synchronically, languages have phonological processes (called morphophonemic rules) in which phonemic changes occur to morphemes (minimal units of sound–meaning relations) when they combine with other morphemes to form words. Such changes are typically systematic changes to particular features. For example, in English, a plural morpheme {s} is realized phonetically as an [s] when it is preceded by a word ending in a voiceless obstruent, for example, book [s]; it is realized as a [z] when it is preceded by a voiced obstruent, for example, dog [z]; and it is realized as the syllable [əz] when it is preceded by a fricative, for example, horse [əz]. Interestingly, the same morphophonemic processes occur in English for possessives, for example, Dick’s book, Doug’s book, Gladys’ book, and for the third singular, for example, he kicks/jogs/kisses. Historically, language systems change over time. Again, such changes typically involve changes in features. Moreover, many changes involve changes not to one phonetic segment in the language inventory but to a class of sounds. Two such changes are the Great Vowel Shift and Grimm’s law. In the Great Vowel Shift, as Modern English developed from Middle English, features of vowels systematically changed: low vowels became mid‐vowels, mid‐vowels became high vowels, and high vowels became diphthongs. Grimm’s law affected features of consonants: voiceless stops became voiceless fricatives (a change in the feature [obstruent]); voiced stops became voiceless stops, and voiced aspirated stops became voiced stops or fricatives (both changes of laryngeal features) as German developed from Proto‐Indo‐European. Taken together, linguistic evidence suggests that features are the building blocks for the sounds of language and for characterizing its phonological processes (Jakobson, Fant, & Halle, 1951). That the speaker or hearer has internalized these properties in the daily use of language in knowing and in applying (even if unconsciously) the phonological rules of the language suggests that features are a part of the human cognitive apparatus. The next section will consider this claim by examining evidence that features are representational units in the functional architecture of language, and underlie speech perception and lexical access processes.

Feature dimensions Speech perception As described earlier, phonetic segments are composed of a set of features. Comparing the feature composition of segments provides a means of specifying how many features they share. It has been assumed that the more features two segments share, the more similar they are to each other. Behavioral studies support this notion (Bailey & Hahn, 2005). In particular, studies examining ratings of similarity between syllable pairs differing in consonants have shown that listeners rate pairs of syllables that are distinguished by one feature as more similar to each

130 Perception of Linguistic Properties other than stimuli distinguished by two or more features (Greenberg & Jenkins, 1964). Discrimination studies using same/different or two auditory forced choice paradigms show that subjects take longer to discriminate and make more errors for syllable pairs distinguished by one feature (e.g. [p] and [b] differ by voicing or [p] and [t] differ by place of articulation) than stimuli distinguished by several features (e.g. [b] and [t] differ by both voicing and place of articulation) (Wickelgren, 1965; 1966; Wang & Bilger, 1973; Blumstein & Cooper, 1972). Thus, the number of features shared appears to reflect the psychological distance or space between and within phonetic segments. It is the case that not only the number of features but the particular features present also influence speech perception. Thus, differences also emerge in the perception of single feature contrasts. In general, there is a hierarchy of performance, with increasing numbers of errors occurring for place of articulation contrasts compared to voicing contrasts, and the fewest errors occurring for manner of articulation contrasts (Miller & Nicely, 1955; Blumstein & Cooper, 1972). Different neural responses have also been shown for features. Given a set of voiced and voiceless stop consonants and fricatives ([b d f p s t v z] in the context of the vowels [a i u]), neural regions encompassing the dorsal speech pathway have been identified that respond to the features place of articulation and to manner of articulation. In particular, Correia, Jansma, and Bonte (2015) trained a classifier to discriminate labial and alveolar place of articulation in the stop consonant syllables [pa] vs. [ta]. They demonstrated generalization in the same neural areas to fricatives distinguished by the same labial vs. alveolar contrast, [fa] vs. [sa]. Similarly, they showed generalization across manner of articulation. Training on a stop vs. fricative pair, [pa] vs. [fa], generalized to [ta] vs. [sa] (see also Guediche et al., 2018 for similar generalization results for the feature voicing). Aphasic patients show patterns consistent with these findings. In particular, they have more difficulties discriminating minimal pair nonwords and minimal pair words that differ by a single feature compared to stimulus pairs distinguished by two features. Additionally, they have greater difficulties discriminating stimuli distinguished by place of articulation compared to voicing (Blumstein, Baker, & Goodglass, 1977).

Lexical access In a seminal paper, Luce and Pisoni (1998) showed that access to an auditorily presented lexical target is affected by the word’s lexical neighborhood. A target word that is in a dense neighborhood, that is, where a large number of words share phonetic segments with the target word, is more difficult to access than a word in a sparse neighborhood, that is, where a few words share phonetic segments with the target word. These findings support a functional architecture in which the lexicon is a network where words compete for access with each other based on their phonological similarity. The more words are phonologically similar, the greater the competition, and the increased difficulty in ultimately selecting the lexical target word in word recognition.

Features in Speech Perception and Lexical Access 131 Density is determined by computing the number of words that result from substituting, adding, or deleting a segment in the target word. Thus, these findings in themselves do not speak to whether features are a part of the lexical representation of words. However, several studies (Luce, 1986; Goldinger, Luce, & Pisoni, 1989; Luce et al., 2000) examined word recognition by investigating priming for word pairs that were maximally similar or maximally dissimilar phonologically. The metric used to determine similarity was based on subject judgments of the degree of similarity between individual consonants and vowels in the target and other words in the lexicon. Although the number of feature differences was not expressly determined, results showed that maximally similar prime target pairs resulted in slowed reaction times compared to maximally dissimilar word pairs. For example, a prime‐target pair that was maximally similar, for example, fawn [fɔn] and thumb [θəm] (see Luce et al., 2000) produced slowed reaction time latencies in a shadowing task compared to a prime‐target pair that was maximally dissimilar, for example, cheat and thumb. Comparing fawn and thumb, both consonant contrasts [f]–[θ] and [n]–[m] are distinguished by place of articulation, and [ɔ] and [ə] are distinguished by vowel height. In contrast, comparing cheat and thumb, [č]–[θ] are distinguished by manner and place of articulation, [t]–[m] are distinguished by manner, place, and voicing, and [i] and [ə] are distinguished by vowel height, tenseness, and frontness. Thus, in this case, the number of features shared (or the minimal number of features distinguished) appears to increase phonological competition between prime and target stimulus pairs. Perhaps more compelling is evidence that shows that features play a role not only in accessing the lexical (phonemic‐phonetic) representation of words but also in influencing access to the meaning of words. In particular, the magnitude of semantic priming in a lexical decision task is influenced by the feature distance between a nonword and a prime real word that is semantically related to a target (Connine, Blasko, & Titone, 1993; Milberg, Blumstein, & Dworetzky, 1988). For example, nonwords that are distinguished from a real word by a single feature, for example, gat is distinguished from cat by the feature voicing, show a greater magnitude of priming for semantically related words, for example, dog, than nonword primes that are distinguished by two or more features, for example, wat (Milberg, Blumstein, & Dworetzky, 1988). Using the visual world paradigm, Apfelbaum, Blumstein, and McMurray (2011) showed that greater semantic priming occurred for words semantically related to a visual target as a function of the lexical density of the target. Greater semantic priming occurred for low‐density compared to high‐density word targets presumably because the greater number of competitor words in high‐density neighborhoods ultimately reduced the activation of the lexical semantic network of the target word, resulting in less semantic priming compared to target words from low‐density neighborhoods. It is always possible that feature effects for adults could reflect learning and experience with the language rather than fundamental, intrinsic properties of language. However, compelling evidence that features serve as the building blocks for phonological and ultimately word representations comes from the developmental literature. In an auditory word discrimination task, Gerken, Murphy, and Aslin (1995) showed that three‐ to four‐year‐old children were sensitive to the

132 Perception of Linguistic Properties degree of mismatch between a target word and a nonword that varied in the extent of feature overlap. Either the stimuli differed by a single feature segment or by two feature segments. Such graded sensitivity to feature attributes in accessing words was shown in even younger children. Using a preference looking paradigm, White and Morgan (2008) presented 19‐month‐old toddlers with a visual presentation of two objects corresponding to a familiar word or an unfamiliar word. When an auditory stimulus was presented that named the familiar object or was one‐ (place), two‐ (place and voicing), or three‐feature (place, voicing, and manner) mispronunciations from the initial consonant of the familiar object, the toddlers showed, as do adults (White et al., 2013), graded sensitivity to the degree of mismatch, with progressively fewer looks to the familiar object as the feature distance between the correctly named familiar object and the mispronounced stimuli increased. Taken together, these findings provide further evidence that (1) features are used in mapping from sounds to words; (2) importantly, even nonwords access the lexicon, with activation of a lexical entry a function of the number of overlapping features shared; and (3) the activation of a lexical entry is graded, the extent of the activation being a function of the goodness of fit between the auditory input and its lexical phonological (feature) representation; and (4) the degree to which a lexical representation is activated has a cascading effect on the degree of activation of its lexical semantic network.

Features: Binary or graded In linguistic theory, features are binary; that is, they are either present (+) or absent (−). For example, the stop consonant [d] would be marked as [−nasal] to contrast it from the nasal consonant [n] which is marked [+nasal].The preceding section on “Feature dimensions” in speech perception and lexical access is consistent with this view. However, it has long been known that membership in phonetic categories is not all‐or‐none. Rather, there is a range of values associated with parameters or features of speech with some values appearing to be more representative than others. On the face of it, it would suggest then that, in speech perception, features may be graded. What is critical is whether such graded representations influence not only speech‐perception processes but also lexical access processes. We now turn to this question. As we shall see, experimental evidence challenges the view that features are binary and rather supports the view that features are graded. This has consequences not only for speech‐perception processes but also for the mapping from phonetic category representations to lexical representations and for ultimately accessing the lexical semantic network.

Speech perception Asking participants to categorize a continuum of speech stimuli varying in equal acoustic steps from one phonetic category to another results in a categorical‐like function, for example, [d] to [t] varying in 10 millisecond (ms) voice‐onset time

Features in Speech Perception and Lexical Access 133 steps. There is a range of stimuli consistently categorized as [d] and a range of stimuli consistently categorized as [t], with one or two stimuli at the edges of or on the boundary between the two categories less consistently categorized. Nonetheless, it turns out that not all members within a phonetic category are perceived equally. Using more sensitive measures than phonetic categorization including reaction time (Pisoni & Tash, 1974) and judgments of category goodness (Miller, 1997; Iverson & Kuhl, 1996), speech‐perception studies have shown that listener reaction times are slower in identifying a stimulus on a continuum, and their judgments of category goodness are reduced as stimuli approach the phonetic category boundary. Variations in task requirements support the robustness of these gradient effects (e.g. Carney, Widin, & Viemeister, 1977). Thus, phonetic categories are graded and have an internal structure to them (see Miller, 1997). In this sense, categories are not truly binary representations that are either present or absent, but rather some exemplars of a category are better representations of the category than others. Such findings support a functional architecture in which the degree of activation of a representation is itself graded and influences, as well, the degree of activation of potential competitors. As a stimulus approaches the phonetic category boundary, its activation decreases and there is a concomitant increase in the activation, and hence the extent of competition with the contrasting phonetic category representation. For example, assume a [da]–[ta] continuum ranging in 10 ms steps from 0–40 ms voice onset time (VOT) with a category boundary of 20 ms. As described earlier, there is competition between stimuli that share acoustic properties. Thus, presentation of a 40 ms stimulus (perceived as a [d]) would compete with the representation of the contrasting voiced phonetic category [t]. However, a stimulus with a VOT of 30 ms is a poorer exemplar of the voiceless phonetic category, and thus not only does it activate the phonetic representation of [t] more weakly, but there is an increase in the activation of the contrasting voiced phonetic category [d] (see Blumstein, Myers, & Rissman, 2005). Neural evidence also supports the gradient nature of phonetic categories. Both temporal and frontal areas show graded responses as a function of the goodness of the phonetic category input, with the least activation for the best exemplar of the phonetic category and increased activation as stimuli on a continuum approach the phonetic category boundary (Blumstein, Myers, & Rissman, 2005; Frye et al., 2007; Guenther et al., 2004). Importantly, other neural areas (middle frontal gyrus, supramarginal gyrus) fail to show such graded activation, displaying sensitivity only to between‐phonetic‐category and not to within‐phonetic‐category differences (Joanisse, Zevin, & McCandliss, 2007; Myers et al., 2009). That there is both graded and categorical perception of phonetic categories reflects two critical aspects of speech perception: the need for sensitivity to fine acoustic differences on the one hand, and sensitivity to category membership on the other. We will return to this point in the Conclusion of this chapter.

Lexical access Results showing that the perception of phonetic categories is graded and is influenced by the goodness of the stimulus input raises the question of potential effects

134 Perception of Linguistic Properties on lexical access. One possibility is that phonetic category membership is resolved at the phonetic‐phonological level and the fine acoustic differences leading to graded responses at this level are not mapped onto higher levels of processing. Alternatively, given the functional architecture of the speech and lexical processing system proposed earlier, graded phonetic category representations should influence the mapping from the phonetic‐phonological levels to lexical representations. Results of a series of experiments using various methodologies support this latter hypothesis. Using the visual world paradigm, it has been shown that access to lexical representations is affected by the fine acoustic structure of the auditory input. In particular, it has been shown that looks to a visual target are affected by within‐ phonetic‐category acoustic differences (McMurray, Tanenhaus, & Aslin, 2002; see also McMurray, Tanenhaus, & Aslin, 2009). In the 2002 study of McMurray and colleagues, eye movements were measured as participants identified a named target word (using a mouse click) from an array of four pictures. The pictures consisted of a target word whose name began with a labial stop consonant, for example pear, a minimal pair of the target bear, and two phonologically unrelated words, for example lamp and ship. The auditorily presented names varied along a [b]–[p] VOT continuum ranging from 0–40 ms in 5 ms steps. Results showed graded responses with increasing looks to the competitor minimal pair (i.e. bear) as the acoustic‐phonetic input (a VOT variant of [p]) approached the phonetic boundary. These findings show that lexical access is indeed graded. Perhaps stronger evidence of the effects of within‐phonetic‐category differences on access to higher levels of processing comes from studies showing that within‐ phonetic‐category effects not only influence access to lexical representations but also cascade to the lexical semantic network. Examining semantic priming in a lexical decision task, Andruski Blumstein, and Burton (1994) presented prime words semantically related to a target stimulus in which the initial voiceless stop consonant of the prime was an exemplar stimulus (spoken naturally and acoustically unmodified) or it was a poorer exemplar of the voiceless stop phonetic category (the VOT was reduced by one third or two thirds). Shortening the VOT of the stimuli rendered them closer to the voiced phonetic category boundary. Importantly, pilot work showed that stimuli presented alone were perceived correctly as beginning with voiceless stop consonants. However, the reduction of the VOT for the acoustically modified stimuli resulted in a reduction in the magnitude of semantic priming relative to the unmodified prime stimuli, particularly for the primes reduced by two thirds. In a later study, Misiurski and colleagues (2005) showed that, in addition to the effects described of reduced semantic priming when an initial voiceless stop consonant of a prime word is shortened, that is, reduced priming for t*ime‐clock compared to time‐clock, mediated semantic priming emerged for the minimal pair competitor, that is, t*ime primed penny via dime. Not surprisingly, the magnitude of the mediated semantic priming was less than that obtained for dime–penny. These mediated priming results provide further evidence for the cascading effects of the acoustic input on access to the lexical semantic network. In particular, it shows that

Features in Speech Perception and Lexical Access 135 the acoustic modification of a prime stimulus not only influences the activation of the lexical semantic network of its semantically related target, but also partially activates the lexical representation of the contrasting voiced minimal pair competitor and subsequently its lexical semantic network.

Feature representations: Articulatory or acoustic The motor theory of speech perception We turn now to an unresolved question: What is the nature of feature representations? The crux of the problem turns on variability in the phonetic input. As indicated at the beginning of this chapter, there are many sources of variability that affect and influence the ultimate speech input that the listener receives. The question is whether, despite this variability, there are patterns (articulatory or acoustic) that provide a stable mapping from acoustic input to features and ultimately phonetic categories. At this point, no one has solved this invariance problem, that is, no one has solved the transformation of a variable input to a constant feature or phonetic category representation. Even if one were to assume that lexical representations are episodic, containing fine structure acoustic differences that are used by the listener, as has been proposed by Goldinger (1998) and others, such a view still begs the question. It does not elucidate the nature of the mapping from input to sublexical or lexical representations and thus fails to provide an explanation for how the listener knows that a given stimulus belongs to one phonetic category and not another; that is, what property of the signal tells the listener that the input maps onto the lexical representation of pear and not bear or that the initial consonant is a variant of [p] and not [b]. The pioneering research of Haskins Laboratories in the 1950s tried to solve the invariance problem. It is important to understand the historical context in which this research was conducted. At that time, state‐of‐the‐art speech technology consisted of the sound spectrograph and the pattern playback (see Cooper, 1955; Koenig, Dunn, & Lacy, 1946). The sound spectrograph provided a visual graph of the Fourier transform of the speech input, with time represented on the abscissa, frequency represented on the ordinate, and amplitude represented by the darkness of the various frequency bands. The pattern playback converted the visual pattern of a representation of the sound spectrogram to an auditory output (see Studdert‐ Kennedy & Whalen, 1999, for a review). Thus, examining the patterns of speech derived from sound spectrograms, it was possible to make hypotheses about particular portions of the signal or cues corresponding to particular features of sounds or segments (phonetic categories). Using the pattern playback, these potential cues were then systematically varied and presented to listeners for their perception. Results reported in their seminal paper (Liberman et al., 1967) showed clearly that phonetic segments occur in context, and cannot be defined as separate “beads on a string.” Indeed, the context ultimately influences the acoustic manifestation of the particular phonetic segment, resulting in acoustic differences for

136 Perception of Linguistic Properties the same features of sound. For example, sound spectrograms of stop consonants show a burst and formant transitions, which potentially serve as cues to place of articulation in stop consonants. Varying the onset frequency of the burst or the second formant transition and presenting them to listeners provided a means of systematically assessing the perceptual role these cues played. Results showed there was no systematic relation between a particular burst frequency or onset frequency of the second formant transition to place of articulation in stop consonants (Liberman, Delattre, & Cooper, 1952). For example, there was no constant burst frequency or formant transition onset that signaled [d] in the syllables [di] and [du]. Rather, the acoustic manifestation of sound segments (and the features that underlie them) is influenced by the acoustic parameters of the phonetic contexts in which they occur. Liberman et al. (1967) recognized that listener judgments were still consistent. What then allowed for the various acoustic patterns to be realized as the same consonant? They proposed the motor theory of speech perception, hypothesizing that what provided the stability in the variable acoustic input was the production of the sounds or the articulatory gestures giving rise to them (for reviews see Galantucci, Fowler, & Turvey, 2006; Liberman et al., 1967; Fowler, 1986; Fowler, Shankweiler, & Studdert‐Kennedy, 2016). In this view, despite their acoustic variability, constant articulatory gestures provided phonetic category stability – [p] and [b] are both produced with the stop closure at the lips, [t] and [d] with the stop closure at the alveolar ridge, and [k] and [g] are produced with the closure at the velum. It is worth noting, that even the motor theory fails to provide the nature of the mapping from the variable acoustic input to a particular articulatory gesture. That is, it is not specified what it is in the acoustic signal that allows for the transformation of the input to a particular motor pattern. In this sense, the motor theory of speech perception did not solve the invariance problem. That said, there are many proponents of the motor (gesture) theory of speech perception (see Fowler, Shankweiler, & Studdert‐Kennedy, 2016, for a review), and recently evidence from cognitive neuroscience has been used to provide support (see D’Ausilio, Craighero, & Fadiga, 2012 for a review). In particular, it has been shown in a number of studies that the perception of speech not only activates auditory areas of the brain (temporal structures) but also, under some circumstances, activates motor areas involved in speech production. For example, using fMRI, activation has been shown in motor areas during passive listening to syllables, areas activated in producing these syllables (Wilson et al., 2004), and greater activation has been shown in these areas for nonnative speech sounds compared to native speech sounds (Wilson & Iacoboni, 2006). Transmagnetic stimulation (TMS) studies showed a change in the perception of labial stimuli near the phonetic boundary of a labial– alveolar continuum after stimulation of motor areas involving the lips; no perceptual changes occurred for continua not involving labial stimuli, for example, alveolar–velar continua (Mottonen & Watkins, 2009; Fadiga et al., 2002). Nonetheless, activation of motor areas during speech perception in both the fMRI and TMS studies appears to occur under challenging listening conditions such as

Features in Speech Perception and Lexical Access 137 when the acoustic stimuli are of poor quality, for example, when sounds are not easily mapped to a native‐language inventory or during the perception of boundary stimuli, but not when the stimuli are good exemplars. These findings raise the possibility that frontal areas are recruited when additional neural resources are necessary, and thus are not core areas recruited in the perception of speech (see Schomers & Pulvermüller, 2016, for a contrasting view). It would not be surprising to see activation of motor areas during the perception of speech, as listeners are also speakers, and speakers perceive the acoustic realization of their productions. That there is a neural circuit bridging temporal and motor areas then would be expected (see Hickok & Poeppel, 2007). However, what needs to be shown in support of the motor (gesture) theory of speech is that the patterns of speech‐perception representations are motoric or gestural. It is, of course, possible that there are gestural as well as acoustic representations corresponding to the features of speech. However, at the minimum, to support the motor theory of speech, gestures need to be identified that provide a perceptual standard for mapping from auditory input to phonetic feature. As we will see shortly, the evidence to date does not support such a view (for a broad discussion challenging the motor theory of speech perception see Lotto, Hickok, & Holt, 2009).

The acoustic theory of speech perception Despite the variability in the speech input, there is the possibility that there are more generalized acoustic patterns that can be derived that are common to features of sounds, patterns that override the fine acoustic detail derived from analysis of individual components of the signal such as burst frequency or frequency of the onset of formant transitions. The question is where in the signal such properties might reside and how they can be identified. One hypothesis that became the focus of the renewed search for invariant acoustic cues was that more generalized patterns could be derived at points where there are rapid changes in the spectrum. These landmarks serve as points of stability between transitions from one articulatory state to another (Stevens, 2002). Once the landmarks were identified, it was necessary to identify the acoustic parameters that provided stable patterns associated with features and ultimately phonetic categories. To this end, research focused on the spectral patterns that emerged from the integration of amplitude and frequency parameters within a window of analysis rather than considering portions of the speech signal that had been identified on the sound spectrogram and considered to be distinct acoustic events. The first features examined in this way were place of articulation in stop consonants, the features that failed to show invariance in the Haskins research. In a series of papers, Stevens and Blumstein explored whether the shape of the spectrum in the 25‐odd ms at consonant release could independently characterize labial, alveolar, and velar stop consonants across speakers and vowel contexts. Here, labial consonants were defined in terms of a flat or falling spectral shape, alveolar consonants were defined in terms of a rising spectral shape, and velar consonants were defined in terms of a compact spectral shape with one peak

138 Perception of Linguistic Properties dominating the spectrum (Stevens & Blumstein, 1978). Results of acoustic analysis of productions by six speakers of the consonants [p t k b d g] produced in the context of the vowels [i e a o u] classified the place of articulation of the stimuli with 85 percent accuracy (Blumstein & Stevens 1979). Follow‐up perceptual experiments showed that listeners could identify place of articulation (as well as the following vowel) with presentation of only the first 20 ms at the onset of the burst, indicating that they were sensitive to the spectral shape at stop consonant onset (Blumstein & Stevens, 1980; see also Chang & Blumstein, 1981). Invariant properties were identified for additional phonetic features, giving rise to a theory of acoustic invariance hypothesizing that, despite the variability in the acoustic input, there were more generalized patterns that provided the listener with a stable framework for the perception of the phonetic features of language (Blumstein & Stevens, 1981; Stevens & Blumstein, 1981; see also Kewley‐Port, 1983; Nossair & Zahorian, 1991). These features include those signifying manner of articulation for [stops], [glides], [nasals], and [fricatives] (Kurowski & Blumstein, 1984; Mack & Blumstein, 1983; Shinn & Blumstein, 1984; Stevens & Blumstein, 1981). Additionally, research has shown that if the speech auditory input were normalized for speaker and vowel context, generalized patterns can be identified for both stop (Johnson, Reidy, & Edwards, 2018) and fricative place of articulation (McMurray & Jongman, 2011). A new approach to the question of invariance provides perhaps the strongest support for the notion that listeners extract global invariant acoustic properties in processing the phonetic categories of speech. Pioneering work from the lab of Eddie Chang is examining neural responses to speech using electrocorticography (ECoG). Here, intracranial electrophysiological recordings are made in patients with intractable seizures, with the goal of identifying the site of seizure activity. A grid of electrodes is placed on the surface of the brain and neural activity is recorded directly, with both good spatial and temporal resolution. In a recent study (Mesgarani et al., 2014), six participants listened to 500 natural speech sentences produced by 400 speakers. The sentences were segmented into sequences of phonemes. Results showed, not surprisingly, responses to speech in the posterior and mid‐superior temporal gyrus, consistent with fMRI studies showing that the perception of speech recruits temporal neural structures adjacent to the primary auditory areas (for reviews see Price, 2012; Scott & Johnsrude, 2003). Critically important were the patterns of activity that emerged. In particular, Mesgarani et al. (2014) showed selective responses of individual electrodes to features defining natural classes in English. That is, selective responses occurred for stop consonants including [p t k b d g], fricative consonants [s z f š ϴ], and nasals [m n ƞ]. That these patterns emerged across speakers, vowel, and phonetic contexts indicate that the inherent variability in the speech stream was essentially averaged out, leaving generalized patterns common to those features representing manner of articulation (see also Arsenault & Buchsbaum, 2015). It is unclear whether the patterns extracted are the same as those identified in the Stevens and Blumstein studies described above. However, what is clear is that the basic representational units corresponding to these features are acoustic in nature.

Features in Speech Perception and Lexical Access 139 That responses in the temporal lobe are acoustic in nature is not surprising. A more interesting question is: What are the patterns of response to speech perception in frontal areas? As discussed earlier, some fMRI and TMS studies showed frontal activation during the perception of speech. However, what is not clear is what the neural patterns of those responses were; that is, did they reflect sensitivity to the acoustic parameters of the signal or to the articulatory gestures giving rise to the acoustic patterns? In another notable study out of the Chang lab, Cheung and colleagues (2016) used ECoG to examine neural responses to speech perception in superior temporal gyrus sites, as they did in Mesgarani et al. (2014). Critically, they also examined neural responses to both speech perception and speech production in frontal areas, in particular in the motor cortex – the ventral half of lateral sensorimotor cortex (vSMC). Nine participants listened to and produced the consonant–vowel (CV) syllables [pa ta ka ba da ga, sa, ša] in separate tasks, and in a third task, passively listened to portions of a natural speech corpus (TIMIT) consisting of 499 sentences spoken by a total of 400 male and female speakers. As expected, for production, responses in the vSMC reflected the somatotopic representation of the motor cortex with distinct clustering as a function of place of articulation. That is, as expected, separate clusters emerged reflecting the different motor gestures used to produce labial, alveolar, and velar consonants. Results of the passive listening task replicated Mesgarani et al.’s (2014) findings, showing selective responses in the superior temporal gyrus (STG) to manner of articulation as a function of manner of articulation, that is, the stop consonants clustered together and the fricative consonants clustered together. Of importance, a similar pattern emerged in the vSMC: neural activity clustered in terms of manner of articulation, although interestingly the consonants within each cluster did not group as closely as the clusters that emerged in the STG. Thus, frontal areas are indeed activated in speech perception; however, this activation appears to correspond to the acoustic representation of speech extracted from the auditory input rather than being a transformation of the auditory input to articulatory, motor, or gestural representations. While only preliminary, these neural findings suggest that the perceptual representation of features, even in motor areas, are acoustic or auditory in nature, not articulatory or motor. These results are preliminary but provocative. Additional research is required to examine neural responses in frontal areas to auditory speech input to the full consonant inventory across vowel contexts, phonetic position, and speakers. The question is: When consonant, vowel, or speaker variability is increased in the auditory input, will neural responses in frontal areas pattern with spectral and temporal features or gestural features.

Conclusion This chapter has examined the role of features in speech perception and auditory word recognition. As described, while features have generally been considered

140 Perception of Linguistic Properties representational units in speech perception, there has been a lack of consensus about the nature of the feature representations themselves. In our view, one of the major conflicts in current theories of speech has its roots in whether researchers have focused on identifying the attributes that define the phonetic categories of speech or, alternatively, have focused on characterizing the ways in which contextual factors can influence the boundaries between phonetic categories (see Samuel, 1982). In the former, the emphasis has been on describing the acoustic‐ articulatory structure of phonetic categories in the latter, the emphasis has been on characterizing the ways in which acoustic changes ultimately affect the perception of boundaries between phonetic categories. These different emphases have also resulted in different conclusions. Studies focusing on the boundaries between phonetic segments have documented the ease with which boundary shifts have been obtained consequent to any number of acoustic manipulations, and as such the conclusion has been that there is no stable pattern of acoustic information that corresponds to these categories. Analyses of the acoustic characteristics of speech have produced mixed results. Focusing on individual cues and considering them as distinct events failed to show stable acoustic patterns associated with these cues. In contrast, focusing on the integration of spectral‐temporal properties revealed more generalized patterns or properties which contribute to the identification of a phonetic segment or phonetic feature. So what is the story? Does acoustic invariance obviate variability? Does variability trump invariance? In both cases, we believe not. Both stable acoustic patterns and variability inherent in the speech stream play a critical role in speech perception and word recognition processes. Invariant acoustic patterns corresponding to features allow for stability in perception. As such, features serve as essential building blocks for the speaker‐hearer in processing the sounds of language. They provide the framework for the speaker‐hearer in processing speech and ultimately words, by allowing for acoustically variable manifestations of sound in different phonetic contexts to be realized as one and the same phonetic dimension. In short, they serve as a means of bootstrapping the perceptual system for the critical job of mapping the auditory input not only onto phonetic categories but also onto words. But variability plays a crucial role as well. It allows for graded activation within the language‐processing stream and hence provides the perceptual system with a richness and flexibility in accessing phonetic features, words, and even meanings that would be impossible were variability to be treated as “noise” and not be represented by the listener. Sensitivity to variability allows listeners to recognize differences that are crucial in language communication. For example, retaining fine structure information allows us to recognize the speaker of a message. And variability allows for the establishment and internalization of probability distributions. Presumably, acoustic inputs that are infrequently produced would require more processing and neural resources compared to acoustic inputs that are in the center of a category or that match more closely a word representation. As such, both processing and neural resources would be freed up when more frequent features and lexical items occur, and additional resources would be needed for less

Features in Speech Perception and Lexical Access 141 frequent occurrences. In this way, the system is not only flexible but also plastic, affording a means for the basic stable structure of speech to be shaped and influenced by experience.

REFERENCES Andruski, J. E., Blumstein, S. E., & Burton, M. (1994). The effect of subphonetic differences on lexical access. Cognition, 52(3), 163–187. Apfelbaum, K. S., Blumstein, S. E., & McMurray, B. (2011). Semantic priming is affected by real‐time phonological competition: Evidence for continuous cascading systems. Psychonomic Bulletin & Review, 18(1), 141–149. Arsenault, J. S., & Buchsbaum, B. R. (2015). Distributed neural representations of phonological features during speech perception. Journal of Neuroscience, 35(2), 634–642. Bailey, T. M., & Hahn, U. (2005). Phoneme similarity and confusability. Journal of Memory and Language, 52(3), 339–362. Blumstein, S. E., Baker, E., & Goodglass, H. (1977). Phonological factors in auditory comprehension in aphasia. Neuropsychologia, 15(1), 19–30. Blumstein, S., & Cooper, W. (1972). Identification versus discrimination of distinctive features in speech perception. Quarterly Journal of Experimental Psychology, 24(2), 207–214. Blumstein, S. E., Myers, E. B., & Rissman, J. (2005). The perception of voice onset time: An fMRI investigation of phonetic category structure. Journal of Cognitive Neuroscience, 17(9), 1353–1366. Blumstein, S. E., & Stevens, K. N. (1979). Acoustic invariance in speech production: Evidence from measurements of the spectral characteristics of stop consonants. Journal of the Acoustical Society of America, 66(4), 1001–1017.

Blumstein, S. E., & Stevens, K. N. (1980). Perceptual invariance and onset spectra for stop consonants in different vowel environments. Journal of the Acoustical Society of America, 67(2), 648–662. Blumstein, S. E., & Stevens, K. N. (1981). Phonetic features and acoustic invariance in speech. Cognition, 10(1), 25–32. Carney, A. E., Widin, G. P., & Viemeister, N. F. (1977). Noncategorical perception of stop consonants differing in VOT. Journal of the Acoustical Society of America, 62(4), 961–970. Chang, S., & Blumstein, S. E. (1981). The role of onsets in perception of stop place of articulation: Effects of spectral and temporal discontinuity. Journal of the Acoustical Society of America, 70(1), 39–44. Cheung, C., Hamilton, L. S., Johnson, K., & Chang, E. F. (2016). The auditory representation of speech sounds in human motor cortex. eLife, 5, e12577. Connine, C. M., Blasko, D. G., & Titone, D. (1993). Do the beginnings of spoken words have a special status in auditory word recognition? Journal of Memory and Language, 32(2), 193–210. Cooper, Franklin S. (1955). Some instrumental aids to research on speech. In Report of the Fourth Annual Round Table Meeting on Linguistics and Language Teaching (pp. 46–53). Washington, DC: Institute of Languages and Linguistics, Georgetown University. Correia, J. M., Jansma, B. M. B., & Bonte, M. (2015). Decoding articulatory features from fMRI responses in dorsal speech regions. Journal of Neuroscience, 35(45), 15015–15025.

142 Perception of Linguistic Properties D’Ausilio, A., Craighero, L., & Fadiga, L. (2012). The contribution of the frontal lobe to the perception of speech. Journal of Neurolinguistics, 25(5), 328–335. Fadiga, L., Craighero, L., Buccino, G., & Rizzolatti, G. (2002). Speech listening specifically modulates the excitability of tongue muscles: A TMS study. European Journal of Neuroscience, 15, 399–402. Fowler, C. A. (1986). An event approach to the study of speech perception from a direct‐realist perspective. Journal of Phonetics, 14(1), 3–28. Fowler, C. A., Shankweiler, D., & Studdert‐ Kennedy, M. (2016). Perception of the speech code revisited: Speech is alphabetic after all. Psychological Review, 123(2), 125–150. Frye, R. E., Fisher, J. M., Coty, A., et al. (2007). Linear coding of voice onset time. Journal of Cognitive Neuroscience, 19(9), 1476–1487. Galantucci, B., Fowler, C. A., & Turvey, M. T. (2006). The motor theory of speech perception reviewed. Psychonomic Bulletin & Review, 13(3), 361–377. Gaskell, M. G., & Marslen‐Wilson, W. D. (1999). Ambiguity, competition, and blending in spoken word recognition. Cognitive Science, 23, 439–462. Gerken, L., Murphy, W. D., & Aslin, R. N. (1995). Three‐and four‐year‐olds’ perceptual confusions for spoken words. Attention, Perception, & Psychophysics, 57(4), 475–486. Goldinger, S. D. (1998). Echoes of echoes? An episodic theory of lexical access. Psychological Review, 105(2), 251–279. Goldinger, S. D., Luce, P. A., & Pisoni, D. B. (1989). Priming lexical neighbors of spoken words: Effects of competition and inhibition. Journal of Memory and Language, 28, 501–518. Greenberg, J. H., & Jenkins, J. J. (1964). Studies in the psychological correlates of the sound system of American English. Word, 20(2), 157–177. Guediche, S., Minicucci, D., Shih, P., & Blumstein, S. E. (2018). The neural

system is sensitive to abstract properties of speech. Unpublished paper. Guenther, F. H., Nieto‐Castanon, A., Ghosh, S. S., & Tourville, J. A. (2004). Representation of sound categories in auditory cortical maps. Journal of Speech, Language, and Hearing Research, 47(1), 46–57. Hickok, G., & Poeppel, D. (2007). The cortical organization of speech processing. Nature Reviews Neuroscience, 8(5), 393–402. Iverson, P., & Kuhl, P. K. (1996). Influences of phonetic identification and category goodness on American listeners’ perception of/r/and/l/. Journal of the Acoustical Society of America, 99(2), 1130–1140. Jakobson, R., Fant, C. G., & Halle, M. (1951). Preliminaries to speech analysis: The distinctive features and their correlates. Cambridge, MA: MIT Press. Joanisse, M. F., Zevin, J. D., & McCandliss, B. D. (2007). Brain mechanisms implicated in the preattentive categorization of speech sounds revealed using fMRI and a short‐interval habituation trial paradigm. Cerebral Cortex, 17(9), 2084–2093. Johnson, A. A., Reidy, P. F., & Edwards, J. R. (2018). Quantifying robustness of the/t/–/k/contrast using a single, static spectral feature. Journal of the Acoustical Society of America, 144(2), EL105–111. Kewley‐Port, D. (1983). Time‐varying features as correlates of place of articulation in stop consonants. Journal of the Acoustical Society of America, 73(1), 322–335. Koenig, W., Dunn, H. K., & Lacy, L. Y. (1946). The sound spectrograph. Journal of the Acoustical Society of America, 17, 19–49. Kurowski, K., & Blumstein, S. E. (1984). Perceptual integration of the murmur and formant transitions for place of articulation in nasal consonants. Journal of the Acoustical Society of America, 76(2), 383–390.

Features in Speech Perception and Lexical Access 143 Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert‐Kennedy, M. (1967). Perception of the speech code. Psychological Review, 74(6), 431–461. Liberman, A. M., Delattre, P., & Cooper, F. S. (1952). The role of selected stimulus‐ variables in the perception of the unvoiced stop consonants. American Journal of Psychology, 65, 497–516. Lisker, L. (1986). “Voicing” in English: A catalogue of acoustic features signaling/b/versus/p/in trochees. Language and Speech, 29(1), 3–11. Lotto, A. J., Hickok, G. S., & Holt, L. L. (2009). Reflections on mirror neurons and speech perception. Trends in Cognitive Sciences, 13(3), 110–114. Luce, P. A. (1986). Neighborhoods of words in the mental lexicon. Unpublished doctoral dissertation. Indiana University, Bloomington. Luce, P. A., Goldinger, S. D., Auer, E. T., & Vitevitch, M. S. (2000). Phonetic priming, neighborhood activation, and PARSYN. Attention, Perception, & Psychophysics, 62(3), 615–625. Luce, P. A., & Pisoni, D. B. (1998). Recognizing spoken words: The neighborhood activation model. Ear and Hearing, 19, 1–36. Mack, M., & Blumstein, S. E. (1983). Further evidence of acoustic invariance in speech production: The stop–glide contrast. Journal of the Acoustical Society of America, 73(5), 1739–1750. McClelland, J. L. (1979). On the time relations of mental processes: An examination of systems of processes in cascade. Psychological Review, 86(4), 287–330. McClelland, J. L., & Elman, J. (1986). The TRACE model of speech perception. Cognitive Psychology, 18, 1–86. McClelland, J. L., & Rumelhart, D. (1986). Parallel distributed processing: Vol. 2. Psychological and biological models. Cambridge, MA: MIT Press. McMurray, B., & Jongman, A. (2011). What information is necessary for speech

categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations. Psychological Review, 118(2), 219–246. McMurray, B., Tanenhaus, M. K., & Aslin, R. N. (2002). Gradient effects of within‐ category phonetic variation on lexical access. Cognition, 86(2), B33–B42. McMurray, B., Tanenhaus, M. K., & Aslin, R. N. (2009). Within‐category VOT affects recovery from “lexical” garden‐paths: Evidence against phoneme‐level inhibition. Journal of Memory and Language, 60(1), 65–91. Mesgarani, N., Cheung, C., Johnson, K., & Chang, E. F. (2014). Phonetic feature encoding in human superior temporal gyrus. Science, 343(6174), 1006–1010. Milberg, W., Blumstein, S., & Dworetzky, B. (1988). Phonological factors in lexical access: Evidence from an auditory lexical decision task. Bulletin of the Psychonomic Society, 26(4), 305–308. Miller, G. A., & Nicely, P. E. (1955). An analysis of perceptual confusions among some English consonants. Journal of the Acoustical Society of America, 27(2), 338–352. Miller, J. L. (1997). Internal structure of phonetic categories. Language and Cognitive Processes, 12, 865–869. Misiurski, C., Blumstein, S. E., Rissman, J., & Berman, D. (2005). The role of lexical competition and acoustic–phonetic structure in lexical processing: Evidence from normal subjects and aphasic patients. Brain and Language, 93(1), 64–78. Mottonen, R., & Watkins, K. E. (2009). Motor representations of articulators contribute to categorical perception of speech sounds. Journal of Neuroscience, 29, 9819–9825. Myers, E. B., Blumstein, S. E., Walsh, E., & Eliassen, J. (2009). Inferior frontal regions underlie the perception of phonetic category invariance. Psychological Science, 20(7), 895–903. Nossair, Z. B., & Zahorian, S. A. (1991). Dynamic spectral shape features as

144 Perception of Linguistic Properties acoustic correlates for initial stop consonants. Journal of the Acoustical Society of America, 89(6), 2978–2991. Pisoni, D. B., & Tash, J. (1974). Reaction times to comparisons within and across phonetic categories. Perception & Psychophysics, 15, 289–290. Price, C. J. (2012). A review and synthesis of the first 20 years of PET and fMRI studies of heard speech, spoken language and reading. Neuroimage, 62(2), 816–847. Samuel, A. G. (1982). Phonetic prototypes. Attention, Perception, & Psychophysics, 31(4), 307–314. Schomers, M. R., & Pulvermüller, F. (2016). Is the sensorimotor cortex relevant for speech perception and understanding? An integrative view. Frontiers in Human Neuroscience, 10, art. 435. Scott, S. K., & Johnsrude, I. S. (2003). The neuroanatomical and functional organization of speech perception. Trends in Neurosciences, 26(2), 100–107. Shinn, P., & Blumstein, S. E. (1984). On the role of the amplitude envelope for the perception of [b] and [w]. Journal of the Acoustical Society of America, 75(4), 1243–1252. Stevens, K. N. (2002). Toward a model for lexical access based on acoustic landmarks and distinctive features. Journal of the Acoustical Society of America, 111(4), 1872–1891. Stevens, K. N., & Blumstein, S. E. (1978). Invariant cues for place of articulation in stop consonants. Journal of the Acoustical Society of America, 64(5), 1358–1368. Stevens, K. N., & Blumstein, S. E. (1981). The search for invariant acoustic correlates of phonetic features. In P. D. Eimas & J. L. Miller (Eds.), Perspectives on the study of speech (pp. 1–38). Hillsdale, NJ: Lawrence Erlbaum.

Studdert‐Kennedy, M., & Whalen, D. H. (1999). A brief history of speech perception research in the United States [1989]. In J. J. Ohala, A. J. Bronstein, M. Grazia Busà, et al. (Eds.), A guide to the history of the phonetic sciences in the United States (pp. 21–25). Berkeley: University of California Press. Wang, M. D., & Bilger, R. C. (1973). Consonant confusions in noise: A study of perceptual features. Journal of the Acoustical Society of America, 54(5), 1248–1266. White, K. S., & Morgan, J. L. (2008). Sub‐segmental detail in early lexical representations. Journal of Memory and Language, 59(1), 114–132. White, K. S., Yee, E., Blumstein, S. E., & Morgan, J. L. (2013). Adults show less sensitivity to phonetic detail in unfamiliar words, too. Journal of Memory and Language, 68(4), 362–378. Wickelgren, W. A. (1965). Distinctive features and errors in short‐term memory for English vowels. Journal of the Acoustical Society of America, 38(4), 583–588. Wickelgren, W. A. (1966). Distinctive features and errors in short‐term memory for English consonants. Journal of the Acoustical Society of America, 39(2), 388–398. Wilson, S. M., & Iacoboni, M. (2006). Neural responses to non‐native phonemes varying in producibility: Evidence for the sensorimotor nature of speech perception. Neuroimage, 33(1), 316–325. Wilson, S. M., Saygin, A. P., Sereno, M. I., & Iacoboni, M. (2004). Listening to speech activates motor areas involved in speech production. Nature Neuroscience, 7, 701–702.

6 Speaker Normalization in Speech Perception KEITH JOHNSON1 AND MATTHIAS J. SJERPS2 University of California, Berkeley, United States Dutch Inspectorate of Education, The Netherlands

1 2

Introduction Talkers differ from each other in a great many ways. Some of the difference is in the choice of linguistic variants for particular words, as immortalized in the song by George and Ira Gershwin “Let’s call the whole thing off”: You say either [iðɚ] and I say either [aɪðɚ], You say neither [niðɚ] and I say neither [naɪðɚ] Either [iðɚ], either [aɪðɚ], neither [niðɚ], neither [naɪðɚ] Let’s call the whole thing off. You like potato and I like potahto You like tomato and I like tomahto Potato, potahto, tomato, tomahto. Let’s call the whole thing off.

Listeners have experienced different pronunciations of words, and many of the variants that we know are tinged with social or personal nuance. This “multiple‐ listing” notion, that listeners store more than one variant of each word in memory is the dominant hypothesis among sociolinguists regarding the cognitive representation of social phonetic variation (Thomas, 2011), and has been proposed as a way to account for the listener’s ability to “normalize” for talker differences in speech perception (Johnson, 1997). Beyond having experience and associations with particular variants of words, though, listeners are tolerant of unfamiliar variation. Unexperienced variants can

The Handbook of Speech Perception, Second Edition. Edited by Jennifer S. Pardo, Lynne C. Nygaard, Robert E. Remez, and David B. Pisoni. © 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.

146 Perception of Linguistic Properties nonetheless be recognized. It was common to experience this in the early days of text‐to‐speech synthesis when speech synthesizers could be counted on to pronounce some words in totally novel ways. For example, the hand‐tuned orthography‐to‐pronunciation rules of a synthesizer would incorrectly pronounce San Jose as [sænʤoz], by analogy with hose (Liberman & Church, 1992). Listeners can also be exposed to previously unexperienced variations when listening to an unfamiliar dialect, and, with a little exposure, be able to cope with a new speech pattern. Interestingly, perceptual learning of such variation occurs over as few as three sessions of 40 sentences each (Greenspan, Nusbaum, & Pisoni, 1988). Listeners can use semantic context to guess the identity of a word even though its pronunciation is unfamiliar, and rapidly develop the ability to recognize new pronunciation variants. Both a process involving multiple listing of variants, and a top‐down parsing process that relies on perceptual learning – can be seen as mechanisms to lend coherence to speech in the face of linguistic and phonetic variation. This chapter considers whether there are aspects of auditory processing that help remove talker differences before the signal enters a multilisting/top‐down guided word‐ recognition system. We conclude that the answer is yes in two ways. First, auditory spectral analysis and encoding removes some talker differences; and, second, contrast coding in an auditory or phonetic frame of reference seems to apply before lexical processing begins. However, we will find that these mechanisms are partial, and that there is evidence both from behavioral studies and from neuroimaging that indicates a role for expectation‐guided coherence‐lending mechanisms in speech perception.

Physiological and acoustic differences between talkers People have different acoustic voice signatures because of individual differences in the physiology of the jaw, tongue, lips, and throat. Beyond these physical talker differences, talkers may also differ in terms of their habits of articulation: these talker differences may be due to differences in dialect or social group, or even idiosyncratic habits (i.e. instances pronounced with the same linguistic variants). Thus, voice can be used in biometric identification (Nott, 2018), and listeners can recognize familiar talkers (Hollien, 2001). The largest acoustic difference between talkers is the difference between men and women and children. The physiological property that underlies this is the larger size of the larynx, and its lower location in the neck in men (Fitch & Giedd, 1999). The larger size of the larynx is accompanied by longer and thicker vocal folds in men, and thus the mean rate of vibration of the vocal folds in voiced speech (the fundamental frequency, F0) is lower in men than in women or children. The lower position of the larynx in the neck results in a longer vocal tract (and proportionally longer pharynx), and thus the resonant frequencies of the vocal tract (the vowel formant frequencies F1, F2, F3, F4, . . .) are generally lower in men than in women. There is evidence (Dabbs & Mallinger, 1999; Zimman, 2017)

Speaker Normalization in Speech Perception 147 of a tie between these voice features and testosterone, though this probably depends on the testosterone level (Glaser, York, & Dimitrakakis, 2016). Even within gender, vocal tract length (VTL) is recoverable from the acoustic signal. Lammert and Narayanan (2015) found that a multiple‐regression formula combining the first four vowel formant frequencies can predict the measured length of the vocal tract to within about a centimeter (about 6% error). Their observation is that F4 is more reliable as a source of information about VTL than the lower formants. [See also Reby & McComb’s, 2003, method of finding vocal tract length from formants.] Pisanski et al. (2014) found that VTL estimates are not particularly well correlated with height (r ≈ 0.3), though listeners do have quite consistent perceptual judgments about talker height and these judgements are better correlated with VTL than with weight (Smith & Patterson, 2005). Clearly, there are phonetic details that distinguish speakers beyond just their gender and relative size of vocal tract. Among these, voice quality, the pattern of vocal fold vibration has received some attention (Ferrand, 2002; Harnsberger et al., 2010), as has a possible role of palate shape on vowel acoustics (Johnson, Ladefoged, & Lindau, 1993), while acoustic differences due to nasal morphology (Guilherme et al., 2009; Subramaniam et al., 1998; Maddux et al., 2017) seems to be an unexplored area for fruitful research on talker differences. Much of the research done on talker normalization has focused on understanding how listeners must (it is assumed) map the acoustic properties of speech produced by men and women to talker‐independent linguistic representations. Within‐ gender talker variation has not been a point of concern for most theorists, despite the fact that men do differ from other men, and women differ from other women, in terms of vocal tract and vocal folds, and also in terms of individual differences in dentition, palate, voice, and nasal cavity which are not relatable to gender.

The vowel‐normalization problem In Peterson and Barney’s (1952) seminal study of American English vowels (Figure 6.1, upper left panel) there is an impressive degree of overlap in the locations of the vowels in the two‐dimensional F1/F2 vowel space. And so we have a problem to explain: How can listeners correctly identify vowels that overlap so much in the vowel space? It turns out that, in a statistical sense, the answer to this question is to add more acoustic information about the vowel than just the F1 and F2 frequencies (Hillenbrand et al., 1995). Automatic classification of carefully produced vowels, using time‐ varying trajectories of F0, together with F1–F4 can produce vowel classification that is as accurate as identification done by listeners (Hillenbrand et al., 1995). The theoretical challenge, then, is to understand the neurocognitive mechanisms that make use of these complex patterns of individual acoustic variation. Consideration of a practical problem in describing the vowel systems of languages and dialects will set the stage for our discussion of perception. The practical problem is that, in describing the vowels of a language or dialect, we feel that we

148 Perception of Linguistic Properties Syrdal & Gopal (1986) ΔBark

Mel F1/F3

2 8

6

4

Z1–Z0

600

2500

1500

500

0

2

4

F2 (Hz)

8

10

1.0

0.2 0.4 F1/ΔF

0.8 1.0

0.8 1.5

1.0

0.5

2.5

F2/ΔF

2.0

1.5

1.0

0.5

3.0

2.5 2.0

1.0 0.5

Lobanov (1971) – z-score normalization

Watt & Fabricius

0

z(F1)

2

1.5

1

1.0

F1 F1

1.0 1.5

2.0

2.0

InF1– InF1

–1

0.5

–2

Nearey (1978) – centered log(F)

1.5

F2/ΔF

InF2 – InF

0.5

0.4

ΔF, VTLN

0.2 0.4 0.6

InF1 – InF

0.4 0.6 0.8

2.0

0.6

Mel F2/F3

1.0 2.5

0.8

Nearey (1978) centered log(allF)

0.2

Lammert & Narayanan (2015) – VTLN

F1/ΔF

6

Z3–Z2

0.6

F1 (Hz)

1000 1400

3500

0.6 0.5 0.4 0.3 0.2 0.1

Peterson (1951) – F3 anchored

0

200

Peterson & Barney (1952)

2.0

1.5

1.0

InF2 – InF2

0.5

2.0

1.5

1.0 F2 F2

0.5

2

1

0

–1

–2

z(F2)

Figure 6.1 Vowel formant measurements from Peterson and Barney’s classic study (1952), and eight different ways to normalize them. Top row: vowel intrinsic information only. Middle row: vowel extrinsic information with a uniform scaling factor. Bottom row: vowel extrinsic information with nonuniform scaling. Source: Modified from Peterson & Barney, 1952.

must “normalize” some of the differences between talkers so that a more general, shared, talker‐independent linguistic pattern of vowel production can be seen. This is a part of the theoretical discussion too, because children learn to “imitate” the speech of their speech community, despite the fact that they produce speech that is acoustically and auditorily very different from that of the adults in the community. In this section we review some practical vowel‐normalization methods, and in the following sections we will discuss the perceptual processes that listeners may use to accomplish speech recognition in the face of talker variation.

Speaker Normalization in Speech Perception 149 As mentioned, much of the variation between talkers in vowel acoustics is due to the differences in vocal tract length between men, women, and children, and it is possible to use acoustic measurements to estimate the talker’s VTL. We can then normalize vowel formant measurements relative to VTL, thus removing one of the main sources of the acoustic difference between talkers. In the study that follows, for each vowel‐normalization algorithm we ask three questions: 1. Extrinsic versus intrinsic information: Is the information used to scale the vowel formants intrinsic to the vowel, or must we gather extrinsic information about the speaker from context in order to calculate the normalized values? 2. Uniform versus non‐uniform scaling: Are separate scaling factors used for the different formants? 3. What frequency scale is used in making the calculations? For the purposes of this chapter, it is important to realize that vowel‐ normalization algorithms succeed by shifting talkers onto the same center, and scaling their range of variation in some way. This is a logical description of the talker‐normalization process, the first of Marr’s (1982) levels of analysis. Nordström & Lindblom (1975) used the frequency of F3 in open vowels (where F1 is greater than 600 Hz) to estimate the speaker’s vocal tract length and from this scaled all speakers onto a vocal tract of a “standard” length (i.e. male). Wakita (1977) also used higher formants to estimate VTL. Lammert and Narayanan (2015) found support for Nordström and Lindblom’s intuition that higher formants may be more reliable indicators of VTL, compared to F1 and F2. However, they found that methods relying only on F3 and F4 were not as accurate as models that include F1–F4. Reby and McComb (2003) also used all of the available vowel formant measurements in their approach to measuring VTL, calculating ΔF, the average interval between formants as the slope of a line relating formant frequency to formant number (see also Fitch, 1997). Using all available formant measurements for a talker, following Johnson (2020), we can calculate ΔF by (1) and then the VTL by (2). In formula (1) we take the average across all vowels j produced by a talker, of the scaled value of each formant i [F1, F2, F3, F4], where the scale factor (i −0.5) is derived from the formant number. Thus, each value Fij/(i − 0.5) is an estimate of ΔF. 1.

F

1 / mn

m

n

j

i

Fij i 0.5

, where i is formant number and j is token number

2. VTL = c/2ΔF, c = 34000 cm/s VTL normalization (Nordström & Lindblom, 1975) is a uniform scaling method. This means that formant measurements are mapped onto a reference scale by estimating the length of the speaker’s vocal tract and then scaling the formant frequencies with a single scale factor based on vocal tract length. Thus, the ΔF approach simply expresses vowel formant frequencies in terms of an acoustic measure of VTL – the average interval between formants.

150 Perception of Linguistic Properties Fant (1975) observed that uniform VTL scaling may not be quite right because male and female vowel spaces can be made to match better if there are many scale factors (one for each formant of each vowel). His proposed nonuniform scaling method allowed for vowel‐specific and formant‐specific scaling factors. In Fant’s view, one motivation for nonuniform scaling is that, in addition to VTL differences, men and women differ in the relative lengths of their oral and pharyngeal cavities. So the same constriction locations may lead to different formant patterns depending on the vowel. It is interesting, though, that in Fant’s (1975) data, the scaling factors for different vowel/formant combinations are correlated with the formant frequency. This suggests a nonlinearity in talker differences which may be captured using a nonlinear auditory frequency scale. Figure 6.1 shows the vowel spaces that are obtained when various vowel‐ normalization algorithms are applied to the Peterson and Barney (1952) vowel data (as distributed in Santiago Barreda’s R package phonTools). The plots in Figure 6.1 were produced by the plot_vowels() function in this package. There is a proliferation of normalization algorithms (and we have added one, based on Reby & McComb’s [2003] ΔF method of finding VTL) and it is apparent from the figure that many of the algorithms produce substantially similar results, so some discussion of their shared assumptions is worthwhile. Most of the algorithms illustrated in Figure 6.1 use vowel extrinsic information to perform vowel normalization (see also Table 6.1); the only ones that don’t are Syrdal and Gopal’s (1986) auditory formant distances method and Peterson’s (1951) F3 anchoring method. The vocal tract length normalization (VTLN) procedures by Lammert and Narayanan (2015), and the ΔF method use information that is extrinsic to the token being scaled. The only difference between these is in how VTL is found. The point is that data from many tokens produced by the speaker are used to calculate the scale factor. Similarly, methods that use statistics calculated over all (or selected) tokens from a talker to then normalize that talker’s vowel space are “extrinsic” vowel‐ normalization methods. This includes algorithms that use the mean values of formants (Lobanov, 1971; Fabricius, Watt, & Johnson, 2009; Nearey, 1978), or the formant range (Gerstman, 1968). Miller’s (1989) formant ratio model is an unusual hybrid because he uses an extrinsic measure of voice pitch (which he called the sensory reference [SR]) calculated over a span of prior speech, while the remaining parameters are intrinsic to the vowel being classified. For a database of isolated word reading such as the Peterson and Barney (1952) set, Miller’s SR is not very necessary, but in running speech, F0 variation should be leveled out somehow (Johnson, 1990a) and the way SR is calculated accomplishes this. The middle row in Figure 6.1 shows methods that use uniform scaling. A single scale factor is applied equally to both F1 and F2. In VTLN this factor is related to the calculated length of the vocal tract. The Nearey uniform scaling method that is shown here uses the geometric mean of the log formants (F1, F2, and F3) as the scaling factor, so the interpretation of the units on the axis is not very straightforward. A variant of this method was used by Labov, Ash, and Boberg (2006).

Speaker Normalization in Speech Perception 151 The bottom row in Figure 6.1 shows methods of vowel normalization that use nonuniform scale factors (Fant, 1975). In each of these, a scale factor is calculated separately for each formant for each speaker. The scale factor in each of the nonuniform methods shown in Figure 6.1 is related to the mean value of the formant being scaled (which is marked with the cross‐hairs in the graph). In z‐score normalization and in Watt and Fabricius’s (2002) method the center is calculated in hertz, while in Nearey’s (1978) approach the center is calculated in log(Hz). The unit of measure in these methods is also different, being the standard deviation of the formant in z‐score normalization and the ratio between the formant and the mean formant in the other methods. Several researchers (Hindle, 1978; Disner, 1980; Adank, Smits, & van Hout, 2004) have compared vowel‐normalization methods using classification accuracy of vowels and reduction of speaker information as criteria for evaluating the different methods. As seems obvious in Figure 6.1, there is a great deal of similarity between many of the normalization methods. We tested this by implementing each normalization algorithm and then we built support vector machine (SVM) classification models using the normalized data to measure their classification accuracy – both in classifying vowels (which should be well separated in a talker‐ normalized representation) and in classifying talkers (which should be collapsed in a talker‐normalized representation). The top portion of Table 6.1 shows identification performance by algorithms that use two dimensions, centering and scaling F1 and F2 in various ways. The bottom half of the table compares algorithms that use three dimensions. It is noteworthy that the VTLN methods differ quite substantially from each other in vowel‐classification accuracy even though they use the same normalization method. This highlights the fact that an accurate estimate of VTL is crucial in this approach. It is also noteworthy that none of these normalization methods completely removes talker‐type information. Classification by type (e.g. man, woman, child) is always better than chance. The Watt and Fabricius method (Watt & Fabricius, 2002; Fabricius, Watt, & Johnson, 2009) was designed to be quite careful about the selection of the midpoint of the talker’s vowel space by selecting a subset of vowels that sample the extremes of the available formant space for each speaker. This care in selecting the midpoint of the vowel space is needed in comparing dialects (see, e.g., Bigham, 2008) but not necessary for the current task, so we estimated the center of the space as the mean values of F1 and F2 (as is done in z‐score normalization and in the Nearey 1 method). The nonuniform extrinsic methods essentially estimate the VTL separately with each formant – an estimate of VTL from F1, a separate one from F2, and so on. Table 6.1 also shows classification accuracy without any vowel normalization when three (F1, F2, F3) or four (F0, F1, F2, F3) acoustic vowel measurements are provided. The results largely agree with those reported by Hillenbrand et al. (1995), who found that a linear discriminant function using only F1 and F2 could correctly classify only 68 percent of American English vowel tokens. But, when F3 was added to the function, the correct classification jumped to 81 percent, and 79 percent were correctly classified when the set of predictors was F0, F1, and F2.

152 Perception of Linguistic Properties Table 6.1 Percentage of correct identification by support vector machine models of Peterson and Barney (1952; PB52) and Hillenbrand et al. (1995; H95) vowels and talkers (man, woman, child [MWC], or man, woman, boy, girl [MWBG]) by different vowel‐normalization methods. The best algorithms maximize vowel classification and eliminate talker information in the representation. If there is no talker information in the representation, then the percentage of correct identification for MWC in PB52 will be lower than 40% and for MWBG in H95 lower than 31% (estimated by a permutation test). Method

Type

PB52 vowels (%)

H95 (vowels) (%)

PB MWC (%)

H95 MWBG (%)

Raw F1 and F2 Mean λ (Patterson & Irino, 2013) F3 anchor (Peterson, 1951) F1 anchor (Peterson, 1961) Mean F* anchor (Sussman, 1986) VTLN (Nordström & Lindblom, 1975) VTLN (Lammert & Narayanan, 2015) Mean F*, log difference (Nearey 2; Nearey, 1978) Range normalization (Gerstman, 1968) VTLN (ΔF) Mean F, log difference (Nearey 1; Nearey, 1978) Mean F, ratio (Watt & Fabricius, 2002) Z‐score normalization (Lobanov, 1971) Raw F1, F2, F3 Mel F1, F2, F3 Mean λ (Patterson & Irino, 2013) Formant ratios (Miller, 1989)

None Intrinsic

77.3 79.5

62.9 67.0

66.7 49.9

53.2 44.9

Intrinsic Intrinsic Intrinsic

78.6 79.4 80.1

71.3 72.0 72.3

52.7 49.7 49.5

44.7 40.9 41.3

Uniform

82.5

72.7

49.8

41.9

Uniform

87.6

77.5

51.1

43.0

Uniform

88.0

77.8

51.7

42.8

Nonuniform 85.2

74.8

47.6

40.6

Uniform 88.2 Nonuniform 90.9

78.1 80.1

50.9 51.6

42.9 42.4

Nonuniform 90.8

80.7

50.8

41.4

Nonuniform 92.6

84.4

49.3

39.8

NONE NONE Intrinsic

86.5 86.4 82.0

76.9 76.8 72.7

83.4 84.3 67.7

69.9 70.3 55.5

Extrinsic

86.0

78.1

59.2

52.8 (Continued)

Speaker Normalization in Speech Perception 153 Table 6.1 (Continued) Method

Type

PB52 vowels (%)

H95 (vowels) (%)

PB MWC (%)

H95 MWBG (%)

Bark differences (Syrdal & Gopal, 1986) No normalization (F0, F1, F2, F3)

Intrinsic

83.9

77.1

58.3

44.0

None

90.0

81.3

94.8

75.5

They found that a discriminant function with F0, F1, F2, and F3, taken from three temporal locations in each vowel, plus the vowel duration could correctly classify 95 percent of the tokens. This indicates that each vowel contains within itself information that, when appropriately utilized, can correctly identify the vowel without any vowel‐extrinsic information. Table 6.1 also shows that, when several raw dimensions are used to represent vowels, the representation supports both talker and vowel classification. So, we have these two observations about the information conveyed by vowels. (1) Extrinsic information can be used to place the main cues for vowel identity into a VTL frame of reference that facilitates vowel classification on the basis of just two or three acoustic attributes. (2) At the same time, information that is intrinsic to the token, the F0, and higher formants may also, at least partially, provide such a frame of reference for classification. The broad strokes of a model of auditory vowel perception that emerges from this review is that, with intrinsic vowel information, we have a signal‐driven representation of vowels that supports both vowel and talker identification. Extrinsic to the vowel token itself we have contextual information that provides a frame of reference that can guide the interpretation of the vowel‐intrinsic information. We turn now to consider how listeners seem to make use of these complementary sources of information in speech perception.

Intrinsic normalization This section and the next review speech‐perception and neurophonetic studies that show how information about talkers is integrated with linguistic and phonetic perception. In a typical behavioral study, a continuum of synthetic phonetic tokens is presented to listeners together with manipulation of segment‐internal talker cues, or with contextual talker cues, and the perceptual identification of the tokens along the continuum is measured. In this design, a talker effect is said to occur when the identification function of the continuum shifts as a function of the values of the talker cues. Talker cues may be the F0 of the vowel token (a segment‐internal talker cue), or the overall level of the formant frequencies of a carrier phrase in which the

154 Perception of Linguistic Properties token is spliced (a contextual talker cue). In a typical neurophonetic study, a similar mix of phonetic tokens differing in segment‐internal or contextual talker cues is presented, but the response variable is the pattern of neural firing, where we learn something from the location and timing of cortical talker effects. This section focuses on segment‐internal talker cues (intrinsic normalization), and provides a general overview of cortical organization for speech perception. The following section focuses on contextual talker cues (extrinsic normalization).

Vowels as formant patterns Formant ratio theory was proposed by Lloyd (1890), who wrote: There is no way in which single isolated resonances can be imagined strongly to differ except in absolute pitch. But when it has been shown that the principal vowels all probably possess *two* resonances we are at once delivered from the necessity of any such inference. It at once becomes conceivable that the fundamental cause of any given vowel quality is the relation in pitch between two resonances, irrespective of any narrow limit in absolute pitch. (Lloyd, 1890, p. 169)

The idea that vowels are best thought of as formant patterns has been echoed many times in subsequent studies. For example, Potter and Steinberg (1950) stated that in vowel perception “a certain spatial pattern of stimulation on the basilar membrane may be identified as a given sound regardless of position along the membrane”; they compared vowel perception to the perception of musical chords, saying, “the ear can identify a chord as a major triad, irrespective of its pitch position” (p. 812). This conception of vowels as formant patterns has been used by many speech‐perception researchers in various forms (Peterson, 1951, 1961; Sussman, 1986; Bladon, Henton, & Pickering, 1984), sometimes the vowel F0 is also considered to be a part of the vowel spectral pattern (Traunmüller, 1981, 1984; Syrdal & Gopal, 1986; Miller, 1989; Patterson & Irino, 2013) which makes sense given how prominent the harmonics of the fundamental are in vowel auditory spectra. In the vowel‐normalization literature, the most common way to represent the vowel formant spectral pattern is as formant ratios (Peterson, 1951, 1961; Sussman, 1986), or as formant ratios with F0 (Traunmüller, 1981, 1984; Miller, 1989; Syrdal & Gopal, 1986). In contrast to this, Bladon, Henton, and Pickering (1984) implemented a whole‐spectrum matching model of vowel perception of this idea by simply sliding auditory spectra up or down on the frequency scale. This is similar to the procedure used for VTLN in automatic speech recognition (e.g. Garau, Renals, & Hain, 2005; Kinnunen & Li, 2010).

F0 normalization Evidence for the perceptual effects of intrinsic information in vowels comes from studies of both the perceptual effects of F0 (the fundamental frequency of voicing) and of higher vowel formants.

Speaker Normalization in Speech Perception 155 The perceptual effect of vowel F0 was studied by Miller (1953), who found that the perceptual category boundary between vowels shifted when the F0 was doubled from 120 Hz to 240 Hz. Fujisaki and Kawashima (1968) studied this further and found F1 boundary shifts of 100 Hz to 200 Hz when F0 was shifted by 200 Hz. Slawson (1968) estimated that an octave change in F0 (doubling) produced a perceived change in F1 and F2 of about 10 to 12 percent. The direction of these effects mirrors the correlations found in speech production, namely that as F0 increases the perceived values of the formants also increase. Barreda and Nearey (2012) concluded that F0 impacts vowel perception indirectly through the “perceived identity of the speaker” (Johnson, 1990a, p. 643), rather than directly as envisioned by Syrdal and Gopal (1986) or Miller (1989), who included a term for the F1/F0 ratio in their representations of vowels. This indirect effect is mirrored in the neural representation of speech, as we shall see.

Higher formant normalization It has also been reported that the boundaries between vowel categories are sensitive to the frequencies of a vowel’s higher formants (F3–F5). Fujisaki and Kawashima (1968) demonstrated an F3 effect with two different vowel continua. An F3 shift of 1500 Hz produced a vowel‐category boundary shift of 200 Hz in the F1–F2 space for an /u/–/e/ continuum, but a boundary shift of only 50 Hz in an /o/–/a/ continuum. Slawson (1968) found very small effects of shifting F3 in six different vowel continua. Nearey (1989) found a small shift in the midpoint of the /ʊ/ vowel region when the frequencies of F3–F5 were raised by 30 percent, but this effect occurred for only one of the two sets of stimuli tested. A possible explanation for some of the inconsistencies found in this literature was offered by Johnson (1989), who also found an F3 boundary shift, and attributed it to spectral integration of F2 and F3 (Chistovich, Sheikin, & Lublinskaja, 1979) because the F3 frequency manipulation only influenced the perception of front vowels (when F2 and F3 are within 3 Bark of each other) and not back vowels which have a larger frequency separation of F2 and F3.

Neural correlates of intrinsic normalization When it comes to the neural infrastructure that is involved in the processing and/ or representation of speech sounds, or at least vowels, three questions are relevant: (1) What is the dominant representation? Abstracted phonemes, or rather their underlying acoustic‐phonetic properties? (2) How veridical are these representations? Are their acoustic properties preserved, or are they normalized? (3) If normalized representations exist, what properties of auditory cortex processing allow them to emerge? As will become clear, the first question can be best answered as: The dominant representations of speech sounds in the auditory cortex seem to clearly reflect acoustic‐phonetic representations. The second question can be partially answered. It appears that the brain does give rise to warped vowel‐ identity‐related representations in a separate processing stream than the one for

156 Perception of Linguistic Properties speaker‐identity‐related representations, although it remains unclear whether these representations become completely or only partially separated. For the third question, only suggestive evidence can be offered. These findings will, however, be useful in guiding future research into this topic. The basic auditory‐processing hierarchy To understand how speech sounds are processed by the human brain, it seems useful first to briefly discuss some of the most relevant properties of early auditory processing, focusing especially on representations in the auditory cortex since that is where more complex and speech‐specific representations become dominant. Figure 6.2 provides a visualization of the anatomical cortical landmarks of the regions involved. The majority of auditory information reaches the cortex through ascending connections to the primary auditory cortex (PAC) which is mostly situated within the Sylvian fissure on so‐ called Heschl’s gyrus. A dominant property of PAC is that it partly inherits the tonotopic organization of the cochlea (a place coding of acoustic frequencies). That is, adjacent cortical areas on PAC are sensitive to slightly different frequencies in the auditory signal (Baumann Petkov, & Griffiths, 2013; Bitterman et al., 2008; Humphries, Liebenthal, & Binder, 2010; Formisano et al., 2003; Moerel, De Martino, & Formisano, 2012; Saenz & Langers, 2014). Many of the acoustic cues that are critical for the perception of speech, such as formants, formant transitions, and amplitude modulations, are represented on PAC as spatial and spatiotemporal patterns of activation (e.g. Young, 2008). The most important cortical structure receiving direct information from PAC is the broader secondary auditory cortex. This includes the planum temporale and the planum polare, both situated within the Sylvian fissure, and the laterally exposed superior temporal gyrus (STG). Patches of tissue in the secondary auditory cortex are often described as behaving like filters that are sensitive to increasingly complex spectro‐temporal information (i.e. combinations of frequencies and/or frequency sweeps), like that observed in natural speech. It is generally observed that the processing of sound becomes increasingly speech specific for patches of tissue located further away from PAC (see Obleser & Eisner, 2009; Price, 2012, Liebenthal et al., 2014; Turkeltaub & Coslett, 2010; DeWitt & Rauschecker, 2012; Overath et al., 2015), especially in ventral and anterior directions on the STG. A dominant idea has been that beyond PAC processing there is a stream of activation that is especially involved in the transition from acoustic‐phonetic sound representations to the activation of lexico‐semantic representations, resulting in comprehension. This flow of information involves the spreading of activation from PAC and closely surrounding regions toward the anterior (anterior temporal lobe [ATL]) and ventral regions (middle temporal gyrus [MTG]) of the temporal lobe, which are often thought to be involved in lexical‐level processing (Davis & Johnsrude, 2003; DeWitt & Rauschecker, 2012; Hickok & Poeppel, 2007; Scott, Blank, Rosen, & Wise, 2000). This flow of information is typically contrasted with a second flow of information directed outside of the temporal lobe, which is thought to be involved in sensorimotor integration and phonological working memory (the so‐called dorsal stream). The functional properties of this second

Speaker Normalization in Speech Perception 157

STG STS MTG ATL

Anterior

Middle

r Posterio

Figure 6.2 Anatomical landmarks of the temporal lobe on and around the regions involved in early speech‐sound processing. Regions outside the temporal lobe are displayed as transparent, allowing for the visualization of Heschl’s gyrus and planum polare and planum temporale, which are all situated inside the Sylvian fissure. STG = superior temporal gyrus; STS = superior temporal sulcus; MTG = middle temporal gyrus; ATL = anterior temporal lobe.

flow of information will not be discussed in further detail here (for discussion, see Hickok & Poeppel, 2007; Rauschecker & Scott, 2009; Rauschecker & Tian, 2000; Scott & Johnsrude, 2003). As suggested earlier, one of the important questions concerns what the dominant representational form of speech sounds is. This could involve representations based on syllables, phonemes, gestures, or acoustic‐phonetic features, to name a few. In a recent investigation relying on electrocorticography (EcoG; involving the measuring of electrical activity directly from the brain surface), Mesgarani et al. (2014) observed that focal patterns of cortical activity displayed selectivity for phonetic features. For example, activity recorded from individual electrodes were found to display a reliable response to sets of phonemes (e.g. plosives /d/, /b/, /g/, /p/, /k/, and /t/, or sibilants /ʃ/, /z/ and /s/, or low back vowels /a/ and /aʊ/, or high front vowels and glides (/i/ and /j/). Interestingly, none of the electrodes from which they recorded displayed preferences for individual phonemes. These observations suggest that at the level of the STG the human auditory cortex overwhelmingly represents speech sounds as organized by acoustic‐phonetic features (see also Arsenault & Buchsbaum, 2015; Steinschneider et al., 2011). These findings also align with earlier studies using recordings at the level of single auditory cortex neurons (Chan et al., 2014; Creutzfeldt Ojemann, & Lettich, 1989). The cortical separation of vowel types and voice properties It turns out that vowels can be quite accurately classified based on recordings of these spatially distributed acoustic‐phonetic feature representations. However, classification accuracy in those cases is thus closely related to acoustic‐phonetic (dis)similarity of vowel tokens which, as previous sections have demonstrated, provides enough information for

158 Perception of Linguistic Properties accurate classification. Indeed, functional magnetic resonance imaging (fMRI) research has demonstrated that classification techniques that are sensitive to spatially distributed patterns of activation (e.g. multivoxel pattern analysis) allow for an accurate separation of at least the corner vowels /i/, /a/, and /u/ in the bilateral auditory cortex (e.g. Formisano et al., 2008). Furthermore, in magnetoencephalography (MEG) experiments it has also been observed that those vowels that are acoustically most distinct also give rise to most dissimilar responses (e.g. Obleser, Lahiri, & Eulitz, 2004; Shestakova et al. 2004). Mesgarani et al. (2014) further confirm this pattern as they observed a high correlation (r = 0.88) between acoustic distances in the F1–F2 space of pairs of vowels and the differences in patterns of cortical activation. In addition to the observation that vowel types can be distinguished on the basis of their neural responses, it has also been demonstrated that the neural representation of vowel types and the representation of speaker‐specific information (presumably related to F0 and higher formants) involve partially nonoverlapping cortical patches (e.g. Formisano et al., 2008; Edmonds et al., 2010; Bonte et al., 2014). For example, Edmonds et al. (2010) presented listeners with steady vowel sounds (or vowel‐like noise) that transitioned into other vowel sounds and/or sounds cueing a speaker change (with no gap in between items), while recording an electroencephalogram (EEG). These authors modified formant frequencies to induce the percept of the vowels /a/, /e/, /i/, /o/, and /u/, while also modifying the overall ratio between the different formants. They reported larger event‐related potential (ERP) deflections, estimated to originate from planum temporale (PT) and planum polare (PP), when one vowel transitioned into another vowel (spoken by the same speaker) than when a specific vowel transitioned into the same vowel (spoken by another speaker). Similar observations have recently been reported based on MEG (Andermann et al., 2017). Those authors reported spatially separable sources that related to changes in pitch (estimated to originate from Heschl’s gyrus) and vowel type and mean formant frequency (both localized to PT). A related result has been reported in an MEG study by Monahan and Idsardi (2010). They relied on the observation that the response latency of the M100 component tends to decrease (i.e. the component tends to arise earlier) when the frequency of F1 is nearer 1000 Hz. They observed that latency of this component was not only dependent on F1, but also sensitive to the frequency of F3 (for two out of three vowels tested). They concluded that these results provide evidence that the M100 is actually coding the F3 to F1 ratio – a parameter in Peterson’s (1951, 1961) vowel‐ normalization method (see Table 6.1). These studies support the notion that different dimensions of the speech signal may become separately represented, and that those representations are partly “invariant” with respect to changes on the other dimensions. Other support for this notion comes from the observation that the robustness of these separate representations is modulated by task demands. fMRI research has shown that more accurate cortical vowel classification is observed in vowel‐identification tasks, and better cortical speaker‐identity classification in speaker identification tasks (Bonte et al., 2014; von Kriegstein et al., 2010). Moreover, it has been shown that models trained on the BOLD data

Speaker Normalization in Speech Perception 159 from listening to one set of speakers can be used to accurately classify vowel identity from held‐out speakers (Formisano et al, 2008). It bears mentioning, however, that these studies do not seem to provide unequivocal evidence for normalized vowel representations. That is, while the separation of vowel‐type and speaker‐identifying information is promising, it is important to appreciate that these properties are also partly represented as different parts of the speech signal (F1–F2 for vowel type and F0; and higher formants for speaker information). Hence, it remains unclear whether the separability of these properties in cortical activation simply reflects the specific sensitivities of different patches of cortex, some responding preferentially to lower formant frequency ranges and others responding to pitch and/or higher formant frequencies. Moreover, a number of these findings appear somewhat inconclusive. For example, the F3:F1 ratio interpretation offered by Monahan and Idsardi (2010) required a significant theoretical hedge to explain why [o] and [ɛ] showed the effect while [ə] did not. In the study of Edmonds et al. (2010) the overall acoustic changes in F1 and F2 seem to have been larger for the most extreme vowel changes than for the most extreme speaker changes, which could explain the larger ERP deflections observed for vowel‐type changes. More generally, it is also clear that the representation of voice acoustics and vowel‐type information, while perhaps partly nonoverlapping, in fact seems to involve mostly the same general cortical regions (e.g. Chandrasekaran, Chan, & Wong, 2011). In Formisano et al. (2008, fig. 4) and Bonte et al. (2014, fig. 5), for example, it is clear that the vowel‐classification maps do still retain speaker‐gender information. Some of the uncertainty in interpreting the studies discussed above arises from the fact that those studies were partly restricted in either spatial or temporal resolution, which presents a challenge to being able to track the representation of specific speech cues. A recent ECoG study has addressed intrinsic talker normalization in the perception of intonation contours (Tang, Hamilton, & Chang, 2017). Tang, Hamilton, and Chang presented listeners with sentences that were each synthesized to contain four different intonation contours. These sentences were synthesized with male and female voices, thereby creating large absolute differences in pitch frequencies. Tang et al. (2017) found that the auditory cortex contains cortical patches of tissue that follow the speaker‐normalized pitch contours. That is, they responded in the same way to linguistically identical pitch contours, irrespective of absolute F0 and phonetic content. Moreover, only very few electrodes demonstrated sensitivity to absolute instead of normalized pitch. This study thus demonstrates what the representational outcome of a successful normalization process for vowels could look like at a more fine‐grained level of representation. More importantly, it confirmed that the auditory cortex indeed generates normalized representations for at least some speech cues, buttressing the findings of vowel‐type normalization reported earlier. How may frequency‐independent coding of formants emerge in the auditory cortex? As discussed earlier, a number of normalization approaches have relied on a relative coding scheme, that is, a scheme where formants are interpreted not as absolute features but rather on the basis of their relation to each other and perhaps to the

160 Perception of Linguistic Properties fundamental frequency. Assuming that these normalized representations indeed exist, an important question is what properties of auditory cortex processing may give rise to them. In the rest of the chapter we will discuss findings that shed some light on this issue, focusing on properties of auditory cortex processing that could play an important role in achieving such a format. The speech signal contains both fluctuations in the overall amplitude envelope (the spacing and prominence of peaks in the amplitude envelope) and fluctuations in the spectral envelope (i.e. the spacing and prominence of peaks in the power spectrum). The ΔF measure of formant spacing that was discussed earlier as a computational vowel‐normalization factor (Table 6.1) is related to this notion. Research on auditory processing in animals has demonstrated that patches of tissue in the auditory cortex display tuning for specific combinations of spectral and temporal modulation frequencies (e.g. Depireux et al., 2001; Woolley et al., 2005; Nagel & Doupe, 2008), that is, they display tuning to specific changes in amplitude across time, and to changes across the frequency spectrum. More recently, it has been demonstrated that this is also a dominant property of human auditory processing (Hullett et al., 2016). The human STG broadly displays an anterior‐to‐posterior organization of different spectro‐temporal tuning profiles. Posterior STG sites displays a preference for speech sounds that have relatively constant energy across the frequency range (low spectral modulation), but that are temporally changing at a fast rate. Anterior STG sites display preferences for speech‐sound sequences that show a high degree of spectral variation across the frequency range (high spectral modulation) but that are temporally changing at a slow rate (Hullet et al., 2016). In support of this finding, BOLD response patterns of the auditory cortex can be quite accurately predicted on the basis of models that consider a combination of spectral properties, spectral modulations, and temporal modulations in the acoustic signal (Santoro et al., 2014, 2017). This observation may be important to normalization because modulation frequencies are independent of specific frequencies, and, rather, represent the pattern of peaks and troughs across a range of frequencies. Hence, this property of auditory cortex processing has the potential to play a role in the frequency‐independent representation of formant patterns.

Extrinsic normalization Our discussion so far has focused on how the perceptual system can compensate for talker variability by integrating different simultaneous auditory cues in the speech signal. However, speech sounds rarely occur in isolation. Typically we hear speech sounds in the context of some preceding and following speech sequences, and we may also be looking at the person we are talking to. This is important because such contexts can provide constraints on the possible interpretations of speech cues. That is, “knowing” that one is listening to a tall male speaker may enhance one’s expectation of that speaker’s formant and pitch ranges (e.g. Joos, 1948). Indeed, a considerable literature has demonstrated that listeners use such acoustic and visual contextual information when interpreting speech sounds.

Speaker Normalization in Speech Perception 161

Extrinsic vowel normalization The first demonstration of the fact that speech‐sound perception is highly dependent on the acoustic properties of preceding context came from a series of experiments by Ladefoged and Broadbent (Broadbent, Ladefoged, & Lawrence, 1956; Ladefoged & Broadbent, 1957). Their participants were asked to listen to synthesized versions of the words bit, bet, bat, and but, These target words were presented in isolation or were preceded by a precursor phrase in which the speaker’s voice properties had been altered (by shifting the overall formant range to higher or lower frequencies). Quite strikingly, it was observed that vowel perception was strongly dependent on the voice properties in a preceding sentence. A target vowel that had been predominantly identified as “bet” when presented in isolation, was overwhelmingly identified as “bit” (which has a lower F1 than “bet”) when the preceding sentence had a high F1 range. Listeners thus seemed to have adjusted the expected dynamic formant range when interpreting the incoming target vowels. Such normalization to a particular speaker’s voice properties has since been replicated on various occasions, demonstrating that it generalizes across languages (Sjerps & Smiljanić, 2013) and that it applies to the perception of different spectral cues such as F1 and F2 (Nearey, 1989; Ladefoged, 1989, Sjerps & Smiljanić, 2013; Darwin, McKeown, & Kirby, 1989; Watkins & Makin, 1994, Mitterer, 2006; Reinisch & Sjerps, 2013), but also to F0 (in the context of lexical tone: Moore & Jongman, 1997; Cantonese: Wong & Diehl, 2003; Leather, 1983; Fox & Qi, 1990; Jongman & Moore, 2000; Peng et al., 2012; Zhang Peng, & Wang, 2012, 2013; Francis et al., 2006; Lee, Tao, & Bond, 2009), spectral tilt (Kiefte & Kluender, 2008), and duration cues (see, e.g., Miller, Aibel, & Green, 1984; Miller, Green, & Schermer, 1984; Miller & Grosjean, 1981; Kidd, 1989; Reinisch, Jesse, & McQueen, 2011a, 2011b; Summerfield, 1981; Newman & Sawusch, 2009; Sawusch & Newman, 2000; see Miller & Liberman, 1979; Toscano & McMurray, 2015; Dilley & Pitt, 2010; Morrill et al., 2014; Pitt, Szostak, & Dilley, 2016). These results have been uncovered across a broad range of studies attempting to better understand what properties of perception give rise to these influences. Given the scope of this chapter, we will focus here only on contextual influences on vowel formants, assuming that the normalization of other cues involves different functional mechanisms. Within the domain of extrinsic normalization of formants, two types of influences have been established: (1) low‐level auditory processes that enhance perceptual contrast between acoustic context and a target sound; and (2) higher‐level influences that depend on acquired knowledge about the relation between talker properties (such as gender) and the acoustic realization of speech sounds.

Mechanisms of extrinsic normalization A role for auditory contrast In a classic study of how context acoustics affect subsequent perception, Watkins and Makin (1994) carried out a very similar experiment to that of Ladefoged and Broadbent (1957), except that, instead of shifting the formants of the context sentences, they filtered a context sentence such that its long‐term average spectrum (LTAS) matched that of either a low‐ or

162 Perception of Linguistic Properties high‐frequency average F1 carrier sentence but without directly altering formant center frequencies themselves (see also Watkins, 1988). They observed qualitatively similar shifts in category boundaries as those observed by Ladefoged and Broadbent. Moreover, Watkins (1991) applied similar filters to a speech‐shaped noise signal and used those as preceding contexts; he observed similar effects as well (although numerically smaller). Watkins and Makin (1994) thus argued that it was not the range of the context’s F1 frequency that shifted target perception, but rather its LTAS, which suggested that the influence of context acoustics on vowel‐ category perception could be better explained by an inverse‐filtering heuristic, whereby reliable spectral properties of a precursor, as reflected in its long‐term spectral characteristics, are filtered out of the target sound before relevant acoustic properties are extracted for further processing. Importantly, Watkins and Makin (1994) also argued that LTAS‐based normalization may in fact result from contrastive effects that originate from at least two separate auditory‐processing stages (also see Stilp, 2020). The noise‐carrier effects that Watkins (1991) observed were found only when the noise context and the subsequent target were presented to the same ear, and when there was only a small (≤ 160 ms) silent gap between context signal and the target sound. With speech precursors, the influence of contexts did apply to contralateral presentation and also over longer silent intervals. Furthermore, some effects of perceptual streaming also seemed to play a role. When speech contexts were presented with different interaural time delays than the targets (i.e. inducing differences in percept of location) then contrastive effects were reduced. Watkins thus suggested the existence of two stages in the auditory‐processing hierarchy that may induce LTAS‐based effects. A peripheral stage, perhaps similar to the type of “negative auditory afterimage” reported by Summerfield et al., 1984 (explaining unilateral noise effects), and a more central contrastive mechanism (explaining the contralateral effects obtained with speech). Only the central mechanism, then, is argued to operate over longer time scales and may be more specific to speech or speech‐ like stimuli. But even those higher‐level (i.e. speech‐specific) normalization effects appear to operate on at least prelexical and potentially even general auditory levels of representation. Despite long silent intervals between context sentences and target sounds, normalization is independent of listeners’ familiarity with the context language (Sjerps & Smiljanić, 2013), and nonwords have also been found to induce normalization effects (Mitterer, 2006). Similarly, speech from one speaker can have a normalizing influence on speech from another speaker (Watkins, 1991). And, perhaps more strikingly, reversed speech sounds are as successful in inducing normalization as normal speech sounds (Watkins, 1991). Also, extrinsic normalization effects are stronger for the portion of the vowel space where the contexts differ (Mitterer, 2006), so that the influence of high vowels in context is restricted to high vowels in targets. It appears, then, that these effects cannot be the result of learned associations between talkers and their phonetic realization or of strategic shifts in category boundaries. Indeed, extrinsic normalization effects have not only been observed in categorization designs, but also in discrimination

Speaker Normalization in Speech Perception 163 tasks that do not require listeners to make category judgments (Sjerps, McQueen, & Mitterer, 2013). Moreover, effects are independent of whether listeners attend to the contexts themselves (Sjerps, McQueen, & Mitterer, 2012; Bosker, Reinisch, & Sjerps, 2017). Generally, these auditorily driven normalization effects have been interpreted to support the notion that listeners are especially sensitive to acoustic change (Stilp et al., 2010; Kiefte & Kluender, 2008; Sjerps, Mitterer, & McQueen, 2011a). That is, the auditory system may calibrate to reliable properties of a listening environment in ways that enhance sensitivity to less predictable (more informative) aspects of sounds (Alexander & Kluender, 2010). Indeed, normalization effects are contrastive in nature, that is, the typical pattern is that a high‐formant context sentence leads to an increase in the percept of a low‐formant target option, while a low‐ formant context leads to more high‐formant target percepts. Tuning in to talkers Importantly, however, effects based on acoustic contrast cannot be the sole explanation of normalization effects. It has long been known that listeners use higher‐level information when making judgments about speech‐ sound categories (e.g., Evans & Iverson, 2004). Evans and Iverson asked participants to rate category goodness for vowels that were presented in the context of sentences spoken in two different regional accents (northern or southern English accent). They demonstrated that participants’ perceived quality ratings depended on what they expected a speaker with a certain accent to produce. This effect cannot be explained by LTAS‐based effects, for example, because this effect interacted with listeners’ own dialectal background (Evans & Iverson, 2004). Moreover, listeners also adjust category boundaries as a result of nonauditory information. Johnson, Strand and D’Imperio (1999) presented listeners with sounds from a synthetic hood–hud continuum (spoken by an androgynous voice). Participants who were told that they were listening to the speech of a female speaker had the vowel‐category boundary closer to the female speaker boundary than listeners who were told that they were listening to the speech of a male speaker. Similarly, when these sounds were presented in combination with a male or a female picture, listeners responded with more talker‐appropriate category boundaries. These findings suggest that perception is mediated by a representation of perceived talker identity (see also Walker, Bruce, & O’Malley, 1995; Schwippert & Benoit, 1997). In addition to normalization approaches based on categorization, another method involves the presentation of word lists that are either spoken by the same speaker across a block or spoken by different speakers on subsequent trials. The typical finding is that switching speakers results in lower identification accuracy (e.g., Verbrugge et al., 1976; Barreda & Nearey, 2012; Magnuson & Nusbaum, 2007; Nusbaum & Morin, 1992). Moreover, lists that involve switching talkers result in overall longer reaction times (e.g., Choi, Hu, & Perrachione, 2018) and larger talker‐normalization boundary shifts (Johnson, 1990b). These results have led to the suggestion that talker‐identity‐based normalization involves a cognitively demanding process (see Barreda & Nearey, 2012 for a review).

164 Perception of Linguistic Properties

Neural extrinsic talker normalization An important difference between intrinsic normalization and extrinsic normalization is that the latter requires the system to achieve and maintain a stable representation of the talker and their acoustic voice properties so as to provide a frame of reference for further interpretation. As outlined in the previous section, extrinsic normalization effects may arise as the result of at least two types of influences: auditory‐driven contrastive processes and higher‐level influences of expected speech‐sound realizations based on known speaker properties. In the following we will discuss what is known about the cortical processing of these properties, and whether they may affect the cortical representation of speech sounds in a manner consistent with the normalization effects that have been so abundantly observed in behavioral experimentation. The representation of talker identities in cortical processing As suggested earlier, the acoustically driven normalization effects have often been interpreted in the framework of contrast enhancement. One way in which such contrast enhancement may be implemented is through neural processing properties such as adaptive gain control (Rabinowitz et al., 2011) or stimulus‐specific adaptation (Ulanovsky et al., 2004; Pérez‐González & Malmierca, 2014). These mechanisms have typically been investigated in the context of adaptation to differences in loudness or the presence of background noise. However, they may play a role in extrinsic normalization. Specific neural populations that display tuning to the frequency of one context sentence (say, a high‐formant sentence in Ladefoged & Broadbent, 1957) may adapt when a listener hears that sentence. Such adaptation could then affect the responsiveness of these populations during subsequent target processing, which could bias cortical representations of subsequent target sounds away from the context‐specific F1 range. Hence, this may provide a more mechanistic implementation of the inverse filtering heuristic suggested by Watkins and Makin (1994). Importantly, the effects of stimulus‐specific adaptation and forward suppression become more dominant, and are longer lasting, at cortical levels of processing (Philips, Schreiner, & Hasenstaub, 2017; Fitzpatrick et al., 1999). This could, in principle, explain why auditory‐contrast‐based effects are typically stronger for speech sounds than for nonspeech sounds. Speech sounds induce considerably stronger cortical activation than nonspeech sounds (for general review see DeWitt & Rauschecker, 2012; Scott & Johnsrude, 2003; Price, 2012), which may increase the amount of shared neural infrastructure between target and precursor and hence also the amount of adaptation. Apart from acoustically driven influences, a considerable amount of work has been devoted to investigating the cortical representation of talker‐specific acoustic voice properties and talker identities. Indeed, patterns of activation in the temporal lobe allow for accurate dissociation between speakers. A number of studies, however, have suggested that there exists a separation between, on the one hand, regions that are involved in the immediate processing of the acoustics of a particular voice and, on the other hand, regions that are involved in the representation of

Speaker Normalization in Speech Perception 165 talker identities (e.g. Andics et al., 2010; von Kriegstein et al., 2003; Myers & Theodore, 2017). The representation of token‐specific voice acoustics has been found to involve bilateral temporal regions (e.g., STG and superior temporal sulcus [STS]). These regions are, thus, in terms of topography, largely overlapping with those regions involved in the representation of speech sounds more generally (although slightly more right dominated; e.g. Bonte, Valente, & Formisano, 2009; Bonte et al., 2014; Formisano et al., 2008; von Kriegstein et al., 2003). However, von Kriegstein et al. (2006) also found that changes in vocal tract length are associated with changes in activity along the precortical auditory pathway in the medial geniculate body. The representation of talker identities (or access to known voices) has most often been associated with activation in the right anterior temporal lobe (ATL; Andics et al., 2010, 2013; Belin & Zatorre, 2003; Campanella & Belin, 2007), and processing in the inferior frontal gyrus (IFG; Andics et al., 2013; Latinus, Crabbe, & Belin, 2011; Pernet et al., 2015; Zäske, Hasan, & Belin, 2017). The ATL especially seems to play an important, and heteromodal, role as lesions to that region are associated with a reduction in the ability to name famous faces and famous voices (Abel et al., 2015; Drane et al., 2013; Damasio et al., 1996; Waldron, Manzel, & Tranel, 2014; see Blank, Wieland, & von Kriegstein, 2014, for review), indicating that auditory and visually based identity‐processing streams converge in the ATL. Intriguingly, Myers and Theodore (2017) found that phonetic atypicality (a rather more aspirated /k/ than is typical for English) provokes a heightened response in core phonetic‐processing areas (bilateral MTG and STG), while talker phonetic atypicality (a more unusual /k/ than one has come to expect for a particular talker) is associated with deactivation in the right posterior MTG. Intriguingly, talker typicality also modulated connectivity between this deactivated MTG region and the left motor cortex. The extrinsic rescaling of vowels in cortical processing The existence of contrast‐ enhancing properties, along with the robust representation of talker identity information in the ATL and IFG, may thus allow for both auditory‐contrast‐based and talker‐identity‐based influences on vowel representations in auditory cortex. Indeed, it does appear that, at the level of cortical auditory‐processsing, speech‐ sound representations in the auditory cortex become normalized as a result of preceding context. In an EEG study Sjerps, Mitterer, and McQueen (2011b), for example, presented listeners with sequences of short (three‐syllable) context‐target pairs in an oddball EEG design. Target vowels involved “standard” sounds that were perceptually ambiguous between [ε] and [ɪ]: [εɪ]. The (infrequent) “deviants” consisted of clear instances of [ε] and [ɪ] (an F1 distinction). Context bysillables (/ papu/) were synthesized to contain either generally heightened or lowered F1 distributions. It was observed that, after high F1 contexts, a shift from an ambiguous standard [εɪ] to clear deviant [ε] resulted in a larger neural mismatch signal than a shift to clear [ɪ]. The reverse pattern was observed when the context had a low F1 distribution. This pattern of results suggests that the context syllables induced a perceptual shift of the ambiguous standard, leading to smaller or larger

166 Perception of Linguistic Properties mismatch signals (and lower or higher oddball detection scores). This normalizing effect was observed as early as the N1 time window, which is more strongly related to the physical properties of the stimulus, than to participants’ perceptual decisions (e.g., Roberts, Flagg, & Gage, 2004; Toscano et al., 2010; Näätänen & Winkler, 1999). In a more recent study Sjerps et al. (2019), relying on intracranial recordings, demonstrated more directly that vowel and formant tuning of individual patches of cortex are dependent on formant ranges in preceding speech sounds. That is, the tuning preference of a patch of cortex that generally preferred a given F1 frequency was shifted – now preferring slightly higher‐formant frequencies – when the carrier vowel was preceded by a sequence of speech sounds with a generally low F1 range. These findings thus demonstrate that vowel processing in the parabelt auditory cortex already displays changes in tuning profiles that are fully consistent with speaker normalization.

Extrinsic normalization In this section we reviewed behavioral evidence that ongoing perception of talker identity influences speech perception. These data highlight a role for auditory contrast as a perceptual mechanism that may be involved in talker normalization. We also reviewed studies on a role for talker identity in speech perception that suggest that auditory context effects may operate at a somewhat abstract level, affecting the listener’s perceptual expectations. This section also reviewed neuroscience data on the implementation of both acoustically driven contrast effects, and also on the cortical representation of talker voices.

Conclusions Vowel normalization is a practical problem in language description, and practical solutions to it point to some of the acoustic‐phonetic information that listeners may use in perceiving speech produced by different talkers. Our review of computational vowel‐normalization methods suggested that the most successful methods make use of the whole spectrum (all available vowel formants) to provide an acoustic‐phonetic frame of reference for the evaluation of vowel quality. We reviewed speech‐perception studies showing that intrinsic F0 and extrinsic vowel formant range are the most important cues that listeners use in perceptual vowel normalization. The picture that emerges from our review of neural studies is that the process of talker normalization in speech perception is dispersed between several neurocognitive mechanisms. On the one hand, some basic properties of the auditory system in how sound is contextually coded may produce normalization effects. Another low‐level phenomenon may involve the perception of the length of the talker’s vocal tract, perhaps in terms of spectral modulation, or the number of spectral peaks in the bottom two thirds of the auditory spectrum, which then is available to warp the auditory representation even precortically. The evolutionary need to

Speaker Normalization in Speech Perception 167 code the size of conspecific individuals probably drove the emergence of vocal‐ tract size perception long before the emergence of language. And yet the warping and filtering in primary auditory processing is not the whole story. Behaviorally, we know that speech perception is also influenced by talker expectations running in parallel with the auditory stream (or, even paradoxically, in a somewhat separable stream within audition, as appears to be the case for the role of voice pitch). The field is relatively wide open for neural‐processing studies that explore the interaction of higher‐level talker expectations and speech processing, and studies like Myers and Theodore’s (2017) on how memory for specific talkers may interact, perhaps in multiple stages of processing, with the extraction of linguistic/phonetic information will reveal much about the neurocognitive mechanisms that support speech perception in a world filled with talker variation.

REFERENCES Abel, T. J., Rhone, A. E., Nourski, K. V., et al. (2015). Direct physiologic evidence of a heteromodal convergence region for proper naming in human left anterior temporal lobe. Journal of Neuroscience, 35(4), 1513–1520. Adank, P., Smits, R., & van Hout, R. (2004). A comparison of vowel normalization procedures for language variation research. Journal of the Acoustical Society of America, 116, 3099–3107. Alexander, J. M., & Kluender, K. R. (2010). Temporal properties of perceptual calibration to local and broad spectral characteristics of a listening context. Journal of the Acoustical Society of America, 128, 3597–3613. Andermann, M., Patterson, R. D., Vogt, C., et al. (2017). Neuromagnetic correlates of voice pitch, vowel type, and speaker size in auditory cortex. NeuroImage, 158, 79–89. Andics, A., McQueen, J. M., Petersson, K. M., et al. (2010). Neural mechanisms for voice recognition. NeuroImage, 52, 1528–1540. Arsenault, J. S., & Buchsbaum, B. R (2015). Distributed neural representations of phonological features during speech

perception. Journal of Neuroscience, 35(2), 634–642. Barreda, S., & Nearey, T. M. (2012). Direct and indirect roles of fundamental frequency in vowel perception. Journal of the Acoustical Society of America, 131(1), 466–477. Baumann, S., Petkov, C. I., & Griffiths, T. D. (2013). A unified framework for the organization of the primate auditory cortex. Frontiers in Systems Neuroscience, 7, 11. Belin, P., & Zatorre, R. J. (2003). Adaptation to speaker’s voice in right anterior temporal lobe. Neuroreport, 14(16), 2105–2109. Bigham, D. (2008). Dialect contact and accommodation among emerging adults in a university setting. Unpublished doctoral thesis, University of Texas at Austin. Bitterman, Y., Mukamel, R., Malach, R., et al. (2008). Ultra‐fine frequency tuning revealed in single neurons of human auditory cortex. Nature, 451(7175)., 197–201. Bladon, R. A., Henton, C. G., & Pickering, J. B. (1984). Towards an auditory theory of speaker normalization. Language Communication, 4, 59–69.

168 Perception of Linguistic Properties Blank, H., Wieland, N., & von Kriegstein, K. (2014). Person recognition and the brain: Merging evidence from patients and healthy individuals. Neuroscience & Biobehavioral Reviews, 47, 717–734. Bonte, M., Hausfeld, L., Scharke, W., et al. (2014). Task‐dependent decoding of speaker and vowel identity from auditory cortical response patterns. Journal of Neuroscience, 34, 4548–4557. Bonte, M., Valente, G., & Formisano, E. (2009). Dynamic and task‐dependent encoding of speech and voice by phase reorganization of cortical oscillations. Journal of Neuroscience, 29(6), 1699–1706. Bosker, H. R., Reinisch, E., & Sjerps, M. J. (2017). Cognitive load makes speech sound fast, but does not modulate acoustic context effects. Journal of Memory and Language, 94, 166–176. Broadbent, D. E., Ladefoged, P., & Lawrence, W. (1956). Vowel sounds and perceptual constancy. Nature, 178, 815–816. Campanella, S., & Belin, P. (2007). Integrating face and voice in person perception. Trends in cognitive sciences, 11(12), 535–543. Chan, A. M., Dykstra, A. R., Jayaram, V., et al. (2014). Speech‐specific tuning of neurons in human superior temporal gyrus. Cerebral Cortex, 24(10), 2679–2693. Chandrasekaran, B., Chan, A. H., & Wong, P. C. (2011). Neural processing of what and who information in speech. Journal of Cognitive Neuroscience, 23(10), 2690–2700. Chistovich, L. A., Sheikin, R. L., & Lublinskaja, V. V. (1979). “Centres of gravity” and spectral peaks as the determinants of vowel quality. In B. Lindblom & S. Öhman (Eds.), Frontiers of speech communication research (pp. 143–157). New York: Academic Press. Choi, J. Y., Hu, E. R., & Perrachione, T. K. (2018). Varying acoustic‐phonemic ambiguity reveals that talker normalization is obligatory in speech processing. Attention, Perception, & Psychophysics, 80(3), 784–797.

Creutzfeldt, O., Ojemann, G., & Lettich, E. (1989). Neuronal activity in the human lateral temporal lobe: I. Responses to speech. Experimental Brain Research, 77, 451–475 Dabbs, J. M., & Mallinger, A. (1999). High testosterone levels predict low voice pitch among men. Personality and Individual Differences, 27(4), 801–804. Damasio, H., Grabowski, T. J., Tranel, D., et al. (1996). A neural basis for lexical retrieval. Nature, 380(6574)., 499. Darwin, C. J., McKeown, J. D., & Kirby, D. (1989). Perceptual compensation for transmission channel and speaker effects on vowel quality. Speech Communication, 8(3), 221–234. Davis, M. H., & Johnsrude, I. S. (2003). Hierarchical processing in spoken language comprehension. Journal of Neuroscience, 23(8), 3423–3431. Depireux, D. A., Simon, J. Z., Klein, D. J., & Shamma, S. A. (2001). Spectro‐temporal response field characterization with dynamic ripples in ferret primary auditory cortex. Journal of Neurophysiology, 85(3), 1220–1234. DeWitt, I., & Rauschecker, J. P. (2012). Phoneme and word recognition in the auditory ventral stream. Proceedings of the National Academy of Sciences of the United States of America, 109(8), E505–514. Dilley, L. C., & Pitt, M. A. (2010). Altering context speech rate can cause words to appear or disappear. Psychological Science, 21(11), 1664–1670. Disner, S. (1980). Evaluation of vowel normalization procedures. Journal of the Acoustical Society of America, 67, 253–261. Drane, D. L., Ojemann, J. G., Phatak, V., et al. (2013). Famous face identification in temporal lobe epilepsy: Support for a multimodal integration model of semantic memory. Cortex, 49(6), 1648–1667. Edmonds, B. A., James, R. E., Utev, A., et al. (2010). Evidence for early specialized

Speaker Normalization in Speech Perception 169 processing of speech formant information in anterior and posterior human auditory cortex. European Journal of Neuroscience, 32, 684–692. Evans, B., & Iverson, P. (2004). Vowel normalization for accent: An investigation of best exemplar locations in northern and southern British English sentences. Journal of the Acoustical Society of America, 115, 352–361. Fabricius, A., Watt, D., & Johnson, D. E. (2009). A comparison of three speaker‐ intrinsic vowel formant frequency normalization algorithms for sociophonetics. Language Variation and Change, 21, 413–435. Fant, G. (1975). Non‐uniform vowel normalization. STL‐QPSR, 16(2–3), 1–19. Retrieved July 28, 2020, from http:// www.speech.kth.se/prod/publications/ files/qpsr/1975/1975_16_2‐3_001‐019. pdf. Ferrand, C. T. (2002). Harmonics‐to‐noise ratio: An index of vocal aging. Journal of Voice, 16, 480–487. Fitch, W. T. (1997). Vocal tract length and formant frequency dispersion correlate with body size in rhesus macaques. Journal of the Acoustical Society of America, 102, 1213–1222. Fitch, W. T., & Giedd, J. (1999). Morphology and development of the human vocal tract: A study using MRI. Journal of the Acoustical Society of America, 106, 1511–1522. Fitzpatrick, D. C., Kuwada, S., Kim, D. O., et al. (1999). Responses of neurons to click‐pairs as simulated echoes: Auditory nerve to auditory cortex. Journal of the Acoustical Society of America, 106, 3460–3472. Formisano, E., De Martino, F., Bonte, M., & Goebel, R. (2008). “Who” is saying “what”? Brain‐based decoding of human voice and speech. Science, 322, 970–973. Formisano, E., Kim, D. S., Di Salle, F., et al. (2003). Mirror‐symmetric tonotopic maps in human primary auditory cortex. Neuron, 40(4), 859–869.

Fox, R. A., & Qi, Y.‐Y. (1990). Context effects in the perception of lexical tone, Journal of Chinese Linguistics,18, 261–284. Francis, A. L., Ciocca, V., Wong, N. K. Y., et al. (2006). Extrinsic context affects perceptual normalization of lexical tone. Journal of the Acoustical Society of America, 119(3), 1712–1726. Fujisaki, H., & Kawashima, T. (1968). The roles of pitch and higher formants in the perception of vowels IEEE Trans. Audio Electroacoust. AU‐16, 73–77. Garau, G., Renals, S., & Hain, T. (2005). Applying vocal tract length normalization to meeting recordings. In Interspeech 2005 – Eurospeech, 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, September 4–8, 2005 (pp. 265–268). Retrieved September 19, 2020. Gerstman, L. (1968). Classification of self‐ normalized vowels. In IEEE Transactions on Audio Electroacoustic (Vol. 16, pp. 78–80). Guilherme, J. M., Garcia, E. W., Tewksbury, B. A., et al. (2009). Interindividual variability in nasal filtration as a function of nasal cavity geometry. Journal of Aerosol Medicine and Pulmonary Drug Delivery, 22, 139–155. Glaser, R., York, A., & Dimitrakakis, C. (2016). Effect of testosterone therapy on the female voice. Climacteric, 19(2), 198–203. Greenspan, S. L., Nusbaum, H. C., & Pisoni, D. B. (1988). Perceptual learning of synthetic speech produced by rule. Journal of Experimental Psychology: Learning, Memory, and Cognition, 14(3), 421–433. Harnsberger, J. D., Brown, W. S., Jr., Shrivastav, R., & Rothman, H. (2010). Noise and tremor in the perception of vocal aging in males. Journal of Voice, 24(5), 523–530. Hickok, G., & Poeppel, D. (2007). The cortical organization of speech processing. Nature Reviews Neuroscience, 8, 393–402.

170 Perception of Linguistic Properties Hillenbrand, J., Getty, L. A., Clark, M. J., & Wheeler, K. (1995). Acoustic characteristics of American English vowels. Journal of the Acoustical Society of America, 97(5), 3099–3111. Hindle, D. (1978). Approaches to formant normalization in the study of natural speech. In D. Sankoff (Ed.), Linguistic variation, models and methods (pp. 161– 171). New York: Academic Press. Hollien, H. (2001). Forensic voice identification. New York: Academic Press. Hullett, P. W., Hamilton, L. S., Mesgarani, N., et al. (2016). Human superior temporal gyrus organization of spectrotemporal modulation tuning derived from speech stimuli. Journal of Neuroscience, 36, 2014–2026. Humphries, C., Liebenthal, E., & Binder, J. R. (2010). Tonotopic organization of human auditory cortex. Neuroimage, 50(3), 1202–1211. Johnson, K. (1989). Higher formant normalization results from auditory integration of F2 and F3. Perception & Psychophysics, 46, 174–180. Johnson, K. (1990a). The role of perceived speaker identity in F0 normalization of vowels. Journal of the Acoustical Society of America, 88, 642–654. Johnson, K. (1990b). Contrast and normalization in vowel perception. Journal of Phonetics, 18, 229–254. Johnson, K. (1997). Speech perception without speaker normalization: An exemplar model. In K. Johnson & J. W. Mullennix (Eds.), Talker variability in speech processing (pp. 145–166). San Diego: Academic Press. Johnson, K. (2020). The ΔF method of vocal tract length normalization for vowels. Laboratory Phonology: Journal of the Association for Laboratory Phonology, 11(1), 10. Johnson, K., Ladefoged, P., & Lindau, M. (1993). Individual differences in vowel production. Journal of the Acoustical Society of America, 94, 701–714. Johnson, K., Strand, E. A., & D’Imperio, M. (1999). Auditory‐visual integration of

talker gender in vowel perception. Journal of Phonetics, 27, 359–384. Jongman, A., & Moore, C. (2000). The role of language experience in speaker and rate normalization processes. In Proceedings: Sixth International Conference on Spoken Language Processing (ICSLP 2000), Beijing, China, October 16–20, 2000 (Vol. 1, pp. 62–65). Retrieved September 19, 2020, from http://www.isca‐speech. org/archive/icslp_2000. Joos, M. A. (1948). Acoustic phonetics. Language, 24(Suppl. 2), 1–136. Kidd, G. R. (1989). Articulatory‐rate context effects in phoneme identification. Journal of Experimental Psychology: Human Perception and Performance, 15(4), 736–748. Kiefte, M., & Kluender, K. R. (2008). Absorption of reliable spectral characteristics in auditory perception. Journal of the Acoustical Society of America, 123(1), 366–376. Kinnunen, T., & Li, H. Z. (2010). An overview of text‐independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40. Labov, W., Ash, S., & Boberg, C. (2006). The atlas of North American English: Phonology, phonetics, and sound change. A multimedia reference tool. Berlin: Mouton de Gruyter. Ladefoged, P. (1989). A note on “information conveyed by vowels.” Journal of the Acoustical Society of America, 85, 2223–2224. Ladefoged, P., & Broadbent, D. E. (1957). Information conveyed by vowels. Journal of the Acoustical Society of America, 39, 98–104. Lammert, A. C., & Narayanan, S. S. (2015). On short‐time estimation of vocal tract length from formant frequencies. PLOS One, 10(7), e0132193. Latinus, M., Crabbe, F., & Belin, P. (2011). Learning‐induced changes in the cerebral processing of voice identity. Cerebral Cortex, 21, 2820–2828. Leather, J. (1983). Speaker normalization in perception of lexical tone. Journal of Phonetics, 11, 373–382.

Speaker Normalization in Speech Perception 171 Lee, C. Y., Tao, L., & Bond, Z. S. (2009). Speaker variability and context in the identification of fragmented Mandarin tones by native and non‐native listeners. Journal of Phonetics, 37(1), 1–15. Liberman, M. Y., & Church, K. W. (1992). Text analysis and word pronunciation in text‐to‐speech synthesis. In S. Furui & M. M. Sondhi (Eds), Advances in speech technology (pp. 791–832). New York: Marcel Dekker. Liebenthal, E., Desai, R., Humphries, C., et al. (2014). The functional organization of the left STS: A large scale meta‐ analysis of PET and fMRI studies of healthy adults. Frontiers in Neuroscience, 8, 289. Lloyd, R. J. (1890). Some researches into the nature of vowel‐sound. Liverpool: Turner, & Dunnett. Lobanov, B. M. (1971). Classification of Russian vowels spoken by different speakers. Journal of the Acoustical Society of America, 49, 606–608. Maddux, S. D., Butaric, L. N., Yokley, T. R., & Franciscus, R. G. (2017). Ecogeographic variation across morphofunctional units of the human nose. American Journal of Physical Anthropology, 162, 103–119. Magnuson, J. S., & Nusbaum, H. C. (2007). Acoustic differences, listener expectations, and the perceptual accommodation of talker variability. Journal of Experimental Psychology: Human Perception and Performance, 33(2), 391–409 Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. New York: Freeman. Mesgarani, N., Cheung, C., Johnson, K., & Chang, E. F. (2014). Phonetic feature encoding in human superior temporal gyrus. Science, 343, 1006–1010. Miller, J. D. (1989). Auditory‐perceptual interpretation of the vowel. Journal of the Acoustical Society of America, 85, 2114–2134. Miller, J. L., Aibel, L L., & Green, K. (1984). On the nature of rate‐dependent

processing during phonetic perception. Perception & Psychophysics, 35, 5–15. Miller, J. L., Green, K., & Schermer, T. M. (1984). A distinction between the effects of sentential speaking rate and semantic congruity on word identification. Perception & Psychophysics 36(4), 329–337. Miller, J. L., & Grosjean, F. (1981). How the components of speaking rate influence perception of phonetic segments. Journal of Experimental Psychology: Human Perception and Performance, 7(1), 208–215. Miller, J. L., & Liberman, A. M. (1979). Some effects of later‐occurring information on the perception of stop consonant and semivowel. Perception & Psychophysics, 25(6), 457–465. Miller, R. L. (1953). Auditory tests with synthetic vowels. Journal of the Acoustical Society of America, 25, 114–121. Mitterer, H. (2006). Is vowel normalization independent of lexical processing? Phonetica, 63(4), 209–229. Moerel, M., De Martino, F., & Formisano, E. (2012). Processing of natural sounds in human auditory cortex: Tonotopy, spectral tuning, and relation to voice sensitivity. Journal of Neuroscience, 32, 14205–14216. Monahan, P. J., & Idsardi, W. J. (2010). Auditory sensitivity to formant ratios: Toward an account of vowel normalization. Language and Cognitive Processes, 25, 808–839. Moore, C. B., & Jongman, A. (1997). Speaker normalization in the perception of Mandarin Chinese tones. Journal of the Acoustical Society of America, 102, 1864–1877. Morrill, T. H., Dilley, L. C., McAuley, J. D., & Pitt, M. A. (2014). Distal rhythm influences whether or not listeners hear a word in continuous speech: Support for a perceptual grouping hypothesis. Cognition, 131(1), 69–74. Myers, E. B., & Theodore, R. M. (2017). Voice‐sensitive brain networks encode talker‐specific phonetic detail. Brain and Language, 165, 33–44.

172 Perception of Linguistic Properties Näätänen, R., & Winkler, I. (1999). The concept of auditory stimulus representation in cognitive neuroscience. Psychological Bulletin, 125(6), 826–859. Nagel, K. I., & Doupe, A. J. (2008). Organizing principles of spectro‐ temporal encoding in the avian primary auditory area field L. Neuron, 58(6), 938–955. Nearey, T. M. (1978). Phonetic feature systems for vowels. Bloomington: Indiana University Linguistics Club. Nearey, T. M. (1989). Static, dynamic, and relational properties in vowel perception. Journal of the Acoustical Society of America, 85(5), 2088–2113. Newman, R. S., & Sawusch, J. R. (2009). Perceptual normalization for speaking rate III: Effects of the rate of one voice on perception of another. Journal of Phonetics, 37(1), 46–65. Nordström, P. E., & Lindblom, B. (1975). A normalization procedure for vowel formant data. Paper presented at the International Congress of Phonetic Sciences, Leeds, UK. Nott, G. (2018). The ATO now holds the voiceprints of one in seven Australians. Computerworld Retrieved July 28, 2020, from www.computerworld.com/ article/3474235/the‐ato‐now‐holds‐the‐ voiceprints‐of‐one‐in‐seven‐australians. html. Nusbaum, H. C., & Morin, T. M. (1992). Paying attention to differences among talkers. In Y. Tohkura, E. Bateson, & Y. Sagisaka (Eds.), Speech perception, production and linguistic structure, (pp. 66–94). Tokyo: IOS Press. Obleser, J., & Eisner, F. (2009). Pre‐lexical abstraction of speech in the auditory cortex. Trends in Cognitive Sciences, 13(1), 14–19. Obleser, J., Elbert, T., Lahiri, A., & Eulitz, C. (2003). Cortical representation of vowels reflects acoustic dissimilarity determined by formant frequencies. Brain Research, 15(3), 207–213.

Obleser, J., Lahiri, A., & Eulitz, C. (2004). Magnetic brain response mirrors extraction of phonological features from spoken vowels. Journal of Cognitive Neuroscience, 16(1), 31–33. Overath, T., McDermott, J. H., Zarate, J. M., & Poeppel, D. (2015). The cortical analysis of speech‐specific temporal structure revealed by responses to sound quilts. Nature Neuroscience, 18(6), 903–911. Patterson, R. D., & Irino, T. (2013). The role of size normalization in vowel recognation and speaker identification. Proceedings of Meetings on Acoustics, 19, 060038. Peng, G., Zhang, C., Zheng, H. Y., et al. (2012). The effect of intertalker variations on acoustic–perceptual mapping in cantonese and mandarin tone Systems. Journal of Speech, Language, and Hearing Research, 55(2), 579–595. Pérez‐González, D., & Malmierca, M. S. (2014). Adaptation in the auditory system: An overview. Frontiers in Integrative Neuroscience, 8, 1–10 Pernet, C. R., McAleer, P., Latinus, M., et al. (2015). The human voice areas: Spatial organization and inter‐individual variability in temporal and extra‐ temporal cortices. NeuroImage, 119, 164–174. Peterson, G. E. (1951). The phonetic value of vowels. Language, 27, 541–553. Peterson, G. E. (1961). Parameters of vowel quality. Journal of Speech and Hearing Research, 4, 10–29. Peterson, G. E., & Barney, H. L. (1952). Control methods used in the study of vowels. Journal of the Acoustical Society of America, 24, 175–184. Phillips, E. A. K., Schreiner, C. E., & Hasenstaub, A. R. (2017). Cortical interneurons differentially regulate the effects of acoustic context. Cell Reports, 20, 771–778. Pisanski, K., Fraccaro, P. J., Tigue, C.C., et al. (2014). Vocal indicators of body size in men and women: A meta‐analysis. Animal Behavior, 95, 89–99.

Speaker Normalization in Speech Perception 173 Pitt, M. A., Szostak, C., & Dilley, L. C. (2016). Rate dependent speech processing can be speech specific: Evidence from the perceptual disappearance of words under changes in context speech rate. Attention, Perception, & Psychophysics, 78(1), 334–345. Potter, R., & Steinberg, J. (1950). Toward the specification of speech. Journal of the Acoustical Society of America, 22, 807–820. Price, C. J. (2012). A review and synthesis of the first 20 years of PET and fMRI studies of heard speech, spoken language and reading. NeuroImage, 62(2), 816–847. Rabinowitz, N. C., Willmore, B. D. B., Schnupp, J. W. H., & King, A. J. (2011). Contrast gain control in auditory cortex. Neuron, 70, 1178–1191. Rauschecker, J. P., & Scott, S. K. (2009). Maps and streams in the auditory cortex: Nonhuman primates illuminate human speech processing. Nature Neuroscience, 12, 718–724. Rauschecker, J. P., & Tian, B. (2000). Mechanisms and streams for processing of “what” and “where” in auditory cortex. Proceedings of the National Academy of Sciences of the United States of America, 97, 11800–11806. Reby, D., & McComb, K. (2003). Anatomical constraints generate honesty: Acoustic cues to age and weight in the roars of red deer stags. Animal Behavior, 65, 519–530 Reinisch, E., Jesse, A., & McQueen, J. M. (2011a). Speaking rate affects the perception of duration as a suprasegmental lexical‐stress cue. Language and Speech, 54(2), 147–165. Reinisch, E., Jesse, A., & McQueen, J. M. (2011b). Speaking rate from proximal and distal contexts is used during word segmentation. Journal of Experimental Psychology: Human Perception and Performance, 3, 978–996. Reinisch, E., & Sjerps, M. J. (2013). The uptake of spectral and temporal cues in vowel perception is rapidly influenced by context. Journal of Phonetics, 41(2), 101–116.

Roberts, T. P. L., Flagg, E. J., & Gage, N. M. (2004). Vowel categorization induces departure of M100 latency from acoustic prediction. NeuroReport, 15(10), 1679–1682. Saenz, M., & Langers, D. R. M. (2014). Tonotopic mapping of human auditory cortex. Hearing Research, 307, 42–52. Santoro, R., Moerel, M., De Martino, F., et al. (2014). Encoding of natural sounds at multiple spectral and temporal resolutions in the human auditory cortex. PLOS Computational Biology, 10(1), e1003412. Santoro, R., Moerel, M., De Martino, F., et al. (2017). Reconstructing the spectrotemporal modulations of real‐life sounds from fMRI response patterns. Proceedings of the National Academy of Sciences of the United States of America, 114(18), 4799–4804. Sawusch, J., & Newman, R. (2000). Perceptual normalization for speaking rate II: Effects of signal discontinuities. Perception & Psychophysics, 62(6), 1521–1528. Schwippert, C., & Benoit, C. (1997). Audiovisual intelligibility of an androgynous speaker. In C. Benoit & R. Campbell (Eds), Proceedings of the ESCA Workshop on Audiovisual Speech Processing (AVSP ’97): Cognitive and computational approaches (pp. 81–84). Retrieved September 19, 2020, from http://www.isca‐speech.org/archive_ open/avsp97. Scott, S. K., Blank, C. C., Rosen, S., & Wise, R. J. (2000). Identification of a pathway for intelligible speech in the left temporal lobe. Brain, 123(12), 2400–2406. Scott, S. K., & Johnsrude, I. S. (2003). The neuroanatomical and functional organization of speech perception. Trends in Neurosciences, 26(2), 100–107. Shestakova, A., Brattico, E., Soloviev, A., et al. (2004). Orderly cortical representation of vowel categories presented by multiple exemplars. Cognitive Brain Research, 21(3), 342–350.

174 Perception of Linguistic Properties Sjerps, M. J., Fox, N. P., Johnson, K., & Chang, E. F. (2019). Speaker‐normalized sound representations in the human auditory cortex. Nature Communications, 10(1), 1–9. Sjerps, M., McQueen, J. M., & Mitterer, H. (2012). Extrinsic normalization for vocal tracts depends on the signal, not on attention. In Interspeech 2012, 13th Annual Conference of the International Speech Communication Association, Portland, OR, USA, September 9–13, 2012 (pp. 394–397). Retrieved September 19, 2020, from http://www.isca‐speech.org/archive/ interspeech_2012. Sjerps, M. J., McQueen, J. M., & Mitterer, H. (2013). Evidence for precategorical extrinsic vowel normalization. Attention, Perception, & Psychophysics, 75(3), 576–587. Sjerps, M. J., Mitterer, H., & McQueen, J. M. (2011a). Constraints on the processes responsible for the extrinsic normalization of vowels. Attention, Perception, & Psychophysics, 73(4), 1195–1215. Sjerps, M. J., Mitterer, H., & McQueen, J. M. (2011b). Listening to different speakers: On the time‐course of perceptual compensation for vocal‐tract characteristics. Neuropsychologia, 49(14), 3831–3846. Sjerps, M. J., & Smiljanić, R. (2013). Compensation for vocal tract characteristics across native and non‐ native languages. Journal of Phonetics, 41(3–4), 145–155. Slawson, A. W. (1968). Vowel quality and musical timbre as functions of spectrum envelope and fundamental frequency. Journal of the Acoustical Society of America, 43, 87–101. Smith, D. R. R., & Patterson, R. D. (2005). The interaction of glottal‐pulse rate and vocal‐tract length in judgements of speaker size, sex, and age. Journal of the Acoustical Society of America, 118(5), 3177–3186. Steinschneider, M., Nourski, K. V., Kawasaki, H., et al. (2011). Intracranial

study of speech‐elicited activity on the human posterolateral superior temporal gyrus. Cerebral Cortex, 21(10), 2332–2347. Stilp, C. E. (2020). Evaluating peripheral versus central contributions to spectral context effects in speech perception. Hearing Research, 392, 107983. Stilp, C. E., Alexander, J. M., Kiefte, M. J., & Kluender, K. R. (2010). Auditory color constancy: Calibration to reliable spectral properties across nonspeech context and targets. Attention, Perception, & Psychophysics, 72, 470–480. Subramaniam, R. P., Richardson, R. B., Morgan, K. T., et al. (1998). Computational fluid dynamic simulations of inspiratory airflow in the human nose and nasopharynx. Inhalation Toxicology, 10, 91–120. Summerfield, Q. (1981). Articulatory rate and perceptual constancy in phonetic perception. Journal of Experimental Psychology: Human Perception and Performance, 7(5), 1074–1095. Summerfield, Q., Haggard, M., Foster, J., & Gray, S. (1984). Perceiving vowels from uniform spectra: Phonetic exploration of an auditory after effect. Perception & Psychophysics, 35(3), 203–213. Sussman, H. M. (1986). A neuronal model of vowel normalization and representation. Brain Language, 28, 12–23. Syrdal, A. K., & Gopal, H. S. (1986). A perceptual model of vowel recognition based on the auditory representation of American English vowels. Journal of the Acoustical Society of America, 79, 1086–1100. Tang, C., Hamilton, L. S., & Chang, E. F. (2017). Intonational speech prosody encoding in the human auditory cortex. Science, 357(6353), 797–801. Thomas, E. R. (2011). Sociophonetics: An introduction. New York: Palgrave Macmillan. Toscano, J. C., & McMurray, B. (2015). The time‐course of speaking rate compensation: Effects of sentential rate

Speaker Normalization in Speech Perception 175 and vowel length on voicing judgments. Language, Cognition and Neuroscience, 30(5), 529–543. Toscano, J. C., McMurray, B., Dennhardt, J., & Luck, S. J. (2010). Continuous perception and graded categorization electrophysiological evidence for a linear relationship between the acoustic signal and perceptual encoding of speech. Psychological Science, 21(10), 1532–1540. Traunmüller, H. (1981). Perceptual dimension of openness in vowels. Journal of the Acoustical Society of America, 69, 1465–1475. Traunmüller, H. (1984). Articulatory and perceptual factors controlling the age‐ and sex‐conditioned variability in formant frequencies of vowels. Speech Communication, 3, 49–61. Turkeltaub, P. E., & Coslett, B. H. (2010). Localization of sublexical speech perception components. Brain and Language, 114(1), 1–15. Ulanovsky, N., Las, L., Farkas, D., & Nelken, I. (2004). Multiple time scales of adaptation in auditory cortex neurons. Journal of Neuroscience, 24, 10440–10453. Verbrugge, R. R., Strange, W., Shankweiler, D. P., & Edman, T. R. (1976). What information enables a listener to map a talker’s vowel space? Journal of the Acoustical Society of America, 60, 198–212. von Kriegstein, K., Eger, E., Kleinschmidt, A., & Giraud, A. L. (2003). Modulation of neural responses to speech by directing attention to voices or verbal content. Cognitive Brain Research, 17(1), 48–55. von Kriegstein, K., Smith, D. R., Patterson, R. D., et al. (2010). How the human brain recognizes speech in the context of changing speakers. Journal of Neuroscience, 30(2), 629–638. von Kriegstein, K., Warren, J. D., Ives, D. T., et al. (2006). Processing the acoustic effect of size in speech sounds. NeuroImage, 32, 368–375. Wakita, H. (1977). Normalization of vowels by vocal‐tract length and its application to vowel identification. In IEEE

Transactions on Acoustics, Speech, and Signal Processing (pp. 183–192). Waldron, E. J., Manzel, K., & Tranel, D. (2014). The left temporal pole is a heteromodal hub for retrieving proper names. Frontiers in Bioscience (Scholar edition), 6, 50–57. Walker, S., Bruce, V., & O’Malley, C. (1995). Facial identity and facial speech processing: Familiar faces and voices in the McGurk effect. Perception & Psychophysics, 57(8), 1124–1133. Watkins, A. J. (1988). Spectral transitions and perceptual compensation for effects of transmission channels. In W. Ainsworth & J. Holmes (Eds.), Proceedings of the 7th Symposium of the Federation of Acoustical Societies of Europe: Speech ’88 (pp. 711–718). Edinburgh: Institute of Acoustics. Watkins, A. J. (1991). Central, auditory mechanisms of perceptual compensation for spectral‐envelope distortion. Journal of the Acoustical Society of America, 90, 2942–2955. Watkins, A. J., & Makin, S. J. (1994). Perceptual compensation for speaker differences and for spectral‐envelope distortion. Journal of the Acoustical Society of America, 96, 1263–1282. Watt, D., & Fabricius, A. (2002). Evaluation of a technique for improving the mapping of multiple speakers’ vowel spaces in the F1–F2 plane. Leeds Working Papers in Linguistics and Phonetics, 9, 159–173. Retrieved September 19, 2020, from https://www.researchgate.net/ publication/239566480_Evaluation_of_a_ technique_for_improving_the_mapping_ of_multiple_speakers’_vowel_spaces_in_ the_F1‐F2_plane. Wong, P. C., & Diehl, R. L. (2003). Perceptual normalization for inter‐and intratalker variation in Cantonese level tones. Journal of Speech, Language, and Hearing Research, 46(2), 413–421. Woolley, S. M., Fremouw, T. E., Hsu, A., & Theunissen, F. E. (2005). Tuning for spectro‐temporal modulations as a

176 Perception of Linguistic Properties mechanism for auditory discrimination of natural sounds. Nature Neuroscience, 8(10), 1371–1379. Young, E. D. (2008). Neural representation of spectral and temporal information in speech. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 363(1493), 923–945. Zäske, R., Hasan, B. A. S., & Belin, P. (2017). It doesn’t matter what you say: fMRI correlates of voice learning and recognition independent of speech content. Cortex, 94, 100–112. Zhang, C., Peng, G., & Wang, W. S. Y. (2012). Unequal effects of speech and nonspeech contexts on the perceptual normalization of Cantonese level tones.

Journal of the Acoustical Society of America, 132(2), 1088–1099. Zhang, C., Peng, G., & Wang, W. S. Y. (2013). Achieving constancy in spoken word identification: Time course of talker normalization. Brain and Language, 126(2), 193–202. Zhang, C. C., & Chen, S. (2016). Toward an integrative model of talker normalization. Journal of Experimental Psychology: Human Perception and Performance, 42(8), 1252–1268. Zimman, Lal (2017).Gender as stylistic bricolage: Transmasculine voices and the relationship between fundamental frequency and /s/. Language in Society, 46(3), 339–370.

lear Speech Perception: 7 C Linguistic and Cognitive Benefits RAJKA SMILJANIC The University of Texas at Austin, United States

A quick internet search for tips on how to talk to listeners with perceptual difficulties reveals some common but also contradictory strategies. The instructions include “speak clearly with more volume,” “use natural volume, intonation, and gestures,” “speak clearly, slowly, distinctly, but naturally, without shouting or exaggerating mouth movements,” “limit background noise,” “build in pauses to facilitate comprehension and allow the patient to ask questions,” and “face the hearing‐impaired person directly” (US Department of Health and Human Services, n.d.; Lubinski, 2010; UCSF Health, n.d.). While these tips reflect our intuitions of how to make ourselves better understood in everyday interactions, a thorough review of the “clear speech” literature reveals large gaps in our understanding of what acoustic‐articulatory features actually contribute to the intelligibility improvement, which cognitive‐perceptual processes are aided, and which listener groups benefit from these modifications. In the broadest sense, clear speaking styles are goal‐oriented modifications in which talkers adapt their output in response to communication challenges, such as noise‐adapted speech (NAS; Lombard speech), infant‐directed speech, foreigner‐ directed speech, and speech produced in response to vocoded speech (Cristia, 2013; Johnson et al., 2013; Cooke & Lu, 2010; Junqua, 1993; Lombard, 1911; Summers et al., 1988; Uther, Knoll, & Burnham, 2007). This chapter will focus on the characteristics and effectiveness of clear speech (CS) aimed at enhancing intelligibility for adult interlocutors with perceptual difficulties arising from hearing loss, low proficiency, or environmental noise. In this narrower sense, CS modifications involve different acoustic‐phonetic goals from, for instance modifications aimed at increasing affective prosody to capture children’s attention. The review considers The Handbook of Speech Perception, Second Edition. Edited by Jennifer S. Pardo, Lynne C. Nygaard, Robert E. Remez, and David B. Pisoni. © 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.

178 Perception of Linguistic Properties results from studies using read laboratory speech as well as a growing body of work using interactive tasks and different communication barriers for a more comprehensive account of CS intelligibility benefit. The current discussion builds on previous reviews of CS production and perception (Smiljanic & Bradlow, 2009; Uchanski, 2005) by focusing on more recent work and by calling attention to the mechanisms that may underlie CS enhanced intelligibility. It also supplements two reviews: one comparing the effect of instruction and communication environment on speech intelligibility (Pichora‐Fuller, Goy, & van Lieshout, 2010), the other examining algorithmic and human context‐induced speech modifications and their effect on speech processing (Cooke et al., 2013a). The review starts by looking at the acoustic‐articulatory features of conversational‐to‐clear‐speech modifications and continues by examining evidence of their effect on linguistic and cognitive processing. The role of talker and listener characteristics as sources of variation in CS production and perception is considered in turn as well. Finally, work on signal processing as a way of assessing the link between acoustic‐phonetic cue enhancements and improved intelligibility is discussed briefly. To better understand the talker–listener attunement and the dynamic nature of the felicitous speech communication process, it is important to examine CS production and perception in tandem. Considering CS from these varied perspectives will deepen our understanding of the interaction between signal‐dependent sensory and relatively signal‐independent cognitive factors that underlie improved speech processing. The research discussed here contributes toward a more complete account of the compensatory and cognitive mechanisms that allow listeners to understand and remember speech under a range of communicative situations, including when it is masked by environmental noise or when communicating in a nonnative language. At the same time, the findings have implications for improving the quality of speech communication and clinical practices for populations in which intelligibility is compromised, including hearing‐aid users, individuals with dysarthria, and second‐language learners, for provider–patient communication and health‐care outcomes, in aviation training, and GPS and customer service systems (Krause & Braida, 2009; Zeng & Liu, 2006; Godoy, Koutsogiannaki, & Stylianou, 2014; Tjaden, Lam, & Wilding, 2013; Tjaden, Kain, & Lam, 2014; Tjaden, Sussman, & Wilding, 2014; Park et al., 2016; Huttunen et al., 2011; Kornak, 2018).

Characteristics of clear speech production and their effect on linguistic and cognitive processes To overcome the various communication barriers, talkers spontaneously modify their output from hypo‐ to hyper‐articulated speech (H&H theory; Lindblom, 1990; Perkell et al., 2002), reflecting the needs of their interlocutor and the listening environment. When the communicated information is transmitted without distortions and is easy to understand, talkers produce conversational, fast, reduced forms of speech. When access to the speech signal is impeded or the signal is degraded in some way, talkers produce hyper‐articulated clear speech forms (Wassink, Wright,

Clear Speech 179 & Franklin, 2007; Scarborough & Zellou, 2013). A number of conversational to clear speech modifications that could give rise to the CS intelligibility benefit are well documented. Relative to conversational speech, CS is typically characterized by a decrease in speaking rate (longer segments as well as longer and more frequent pauses), an increase in vocal levels, a wider F0 range and higher F0 mean, more salient stop releases, vowel and consonant contrast enhancement, greater obstruent rms energy, and increased energy at higher frequencies (Picheny, Durlach, & Braida, 1986; Smiljanic & Bradlow, 2005; Hazan & Baker, 2011; Pichora‐ Fuller, Goy, & van Lieshout, 2010; Bradlow, Kraus, & Hayes, 2003; Ferguson & Kewley‐Port, 2002, 2007; Krause & Braida, 2004; Granlund, Hazan, & Baker, 2012; Maniwa, Jongman, & Wade, 2009; Van Engen, Chandrasekaran, & Smiljanic, 2012; Gilbert, Chandrasekaran, & Smiljanic, 2014; Liu et al., 2004). These acoustic‐ phonetic adjustments aid the listener in a number of ways, including by augmenting the speech signal, by enhancing language‐specific phoneme distinctions, and by facilitating linguistic processing and cognitive functioning associated with speech perception (Cooke et al., 2013a; Lansford et al., 2011; Bradlow & Bent, 2002; Ferguson, 2012; Krause & Braida, 2002; Payton, Uchanski, & Braida, 1994; Picheny, Durlach, & Braida, 1985; Schum, 1996). This section examines the evidence linking speech modifications with signal‐dependent and relatively signal‐independent processing benefits.1 It is important to keep in mind that many of the modifications likely work in tandem to facilitate processing at multiple levels.

Signal enhancement Some CS modifications augment global salience of the speech signal. This is essential as access to the signal may be impeded by environmental noise, competing speech or listener‐related perceptual difficulty. Increased intensity, slower speaking rate, longer vowels, and increased F0 mean, range, and energy in medium and high frequencies all promote the audibility of the signal, facilitating its transmission and making it more robust to noise masking (Junqua, 1993; Godoy, Koutsogiannaki, & Stylianou, 2014; Cooke et al., 2013a). In order to better understand how different communicative goals determine talker adaptations, Smiljanic and Gilbert (2017a) directly compared acoustic characteristics of CS and noise‐ adapted speech. To produce CS, talkers were instructed to read sentences as if they were talking to someone with low proficiency in English and who has difficulty following them conversationally. NAS was produced in response to hearing speech‐shaped noise through the headphones. Relative to conversational speech and speech produced in quiet, both styles involved similar changes involving pitch, energy, and duration, contributing to the global augmentation of the speech signal. The two adaptations also differed, with frequency and duration of pauses increased in CS only. This suggests that, in addition to augmenting the signal as NAS does, CS is a more intentional adaptation, providing listeners with more time to process the signal and highlighting the prosodic structure (CS and NAS benefits beyond signal enhancement are discussed in the next section). The two speaking adaptations thus reflect the specific communicative challenges the talkers try to

180 Perception of Linguistic Properties overcome (see also Hazan & Baker, 2011; Cooke & Lu, 2010; Lu & Cooke, 2008; Maniwa, Jongman, & Wade, 2009; Smiljanic & Bradlow, 2008a, 2008b; Scarborough & Zellou, 2013; Lam, Tjaden, & Wilding, 2012; Hazan et al., 2018a).

Enhancement of linguistic structure CS acoustic‐articulatory modifications also enhance language‐specific linguistic structure through increased distinctiveness of phonemic vowel and consonant contrasts and of prosodic information. Expansion of the vowel space area (VSA) has been reliably found in CS with corroborating evidence in the acoustic (Picheny, Durlach, & Braida, 1986; Moon & Lindblom, 1994; Ferguson & Kewley‐ Port, 2002, 2007; Bradlow, Kraus, & Hayes, 2003; Krause & Braida, 2004; Granlund, Hazan, & Baker, 2012) and articulatory domains (Song, 2017; Kim, Sironic, & Davis, 2011; Kim & Davis, 2014; Tang et al., 2015; Tasko & Greilick, 2010). The increased distinctiveness of the vowel categories in CS has been found even in languages with few vowel categories, which are presumably already perceptually less confusable, suggesting that this may be a global feature of the hyper‐articulated speaking style (Smiljanic & Bradlow, 2005; Bradlow, 2002). While VSA expansion has been documented in other listener‐oriented speaking styles (Cristia, 2013; Kuhl et al., 1997), it is not typically present in NAS where a shift toward higher first formant frequencies is found in line with the increased vocal effort (Godoy, Koutsogiannaki, & Stylianou, 2014; Summers et al., 1988; Lu & Cooke, 2008; Davis & Kim, 2012). Smiljanic and Gilbert (2017a), however, found similar VSA increase in CS and NAS, suggesting that talkers can increase articulatory precision when speaking in response to noise, presumably to facilitate language‐specific sound category identification. This specific feature associated with clarity thus may not be incompatible with increased vocal effort induced by environmental noise. This is important because hyper‐articulated vowel categories, while not necessary (see Krause & Braida, 2004), have been linked to enhanced intelligibility (Picheny, Durlach, & Braida 1986; Hazan & Markham, 2004; Ferguson & Kewley‐Port, 2002). Similar to the spectral enhancement, vowel duration contrasts are enlarged in CS through greater lengthening of the tense compared to the lax vowels (Ferguson & Kewley‐Port, 2002; Picheny, Durlach, & Braida, 1986; Uchanski et al., 1996) and of long compared to short vowels (Bradlow, 2002; Smiljanic & Bradlow, 2008a; Granlund, Hazan, & Baker, 2012). However, Leung et al. (2016) found that English tense vowels were modified more in the temporal domain, while lax vowels showed greater spectral modifications, revealing a trade‐off in the use of temporal and spectral cues in enhancing vowel contrasts. For consonant distinctions, Maniwa, Jongman, and Wade (2009) Pichen that the English fricative phonemic contrast was enlarged through duration, spectral peak frequency, and spectral moments modifications. Similar enhancement was found for the three‐way manner contrast in Korean stops (Kang & Guion, 2008). Examining more closely how the phonemic contrasts are enhanced in CS, Tuomainen Hazan, and Romeo (2016) found that cross‐category distance for word‐initial /s/ and /ʃ/ was increased, but there was no change in within‐category dispersion and discriminability. That is,

Clear Speech 181 talkers did not produce less variable CS categories, which would indicate more precise and consistent articulation.2 Some talkers, however, produced fewer sounds with centroids in the ambiguous region for the /s/–/ ʃ/ distinction which reduced the number of potentially confusable tokens.3 Not all conversational to clear speech modifications, however, are geared toward segmental and structural enhancements. Baker and Bradlow (2009) showed that second‐mention words, those that had previously occurred in the discourse, were overall shorter than first‐mention words. The listener‐oriented hyper‐articulation strategies did not override probabilistic effects on word duration as the second mention reduction was maintained in CS. Smiljanic and Bradlow (2008b) similarly found that temporal/rhythmic organization, measured through the variation of consonantal and vocalic interval durations, remained the same in CS and conversational speech. Finally, Scarborough and Zellou (2013) demonstrated that CS produced in the presence of a real communicative partner was characterized by greater coarticulatory nasality than CS produced in the absence of a communicative partner, that is, spoken “as if to someone hard‐of‐hearing.” This difference was found even though talkers in both conditions produced greater hyper‐articulation compared to the baseline conversational speech. Importantly, lexical decisions were faster for words from the real listener‐directed speech, suggesting that speech produced with hyper‐articulation and increased coarticulation was perceptually most beneficial. The work reviewed so far demonstrates that, in addition to enhancing the salience of the signal itself, CS enhances linguistic structure while maintaining global temporal properties and coarticulation patterns. The results highlight the importance of examining in greater detail the trade‐off between the maintenance of production norms and cue enhancements in CS intelligibility benefit. Future work should also investigate CS enhancements across multiple acoustic domains and across languages to fully understand how signal‐related and language‐specific structural factors interact in increasing phonemic distinctions. A more fine‐grained understanding of what constitutes phonemic contrast enhancements will shed light on how the phonetic cue distributions relate to improved intelligibility and learning. This understanding will have important implications for implementing signal processing algorithms in a more naturalistic manner.

Linguistic processing and cognitive functioning While increased intelligibility is a well‐documented CS benefit, little is known about how conversational to clear speech modifications interact with and facilitate different linguistic processes and cognitive functioning (Cooke et al., 2013a; Lansford et al., 2011). This section first details findings about the CS effect on semantic processing, auditory stream segregation, lexical access, and lexical competition in adverse listening conditions, and furthermore how the various processes interact. It then examines evidence of the CS effect on downstream processing, namely auditory memory, and the benefit for different listener populations. The section concludes with a consideration of processing models that could account for and unify these findings.

182 Perception of Linguistic Properties To gain insight into the interplay between various signal‐dependent and relatively signal‐independent processes, Van Engen et al. (2014) assessed word recognition for conversational and clear speech, semantically meaningful and anomalous sentences, and auditory‐only (AO) and audiovisual (AV) modalities in maskers that varied in the degree of energetic and informational masking (two‐ talker, four‐talker, and eight‐talker babble, and speech‐shaped noise). Focusing on the interactions involving the speaking style, the results showed that CS intelligibility benefit was greater for anomalous than for meaningful sentences. That is, the CS acoustic‐phonetic enhancements were more useful in a more difficult listening situation when sentence context cues were absent. In contrast, CS benefit was found to be larger in the AV condition than in the AO condition (see also Gagné et al., 1994, 1995; Gagné, Rochette, & Charest, 2002), thus running counter to the principle of inverse effectiveness which would predict greater benefit from visual cues for more challenging conversational speech (Stein & Meredith, 1993; see also Helfer, 1997). Here, exaggerated CS articulatory gestures enhanced the signal‐ complementary visual cues that aid listeners in recovering auditory information lost due to energetic masking. These enhanced cues may also help listeners take advantage of temporal information in the visual signal, allowing them to attend to the correct auditory stream in the presence of multiple talkers (to cope with informational masking). Once stream segregation is achieved, listeners can use the CS acoustic enhancements to guide recognition of the target speech. In addition to being masked by noise, portions of speech signals may be inaudible or missing due to the interruptions in transmission. Smiljanic et al. (2013) investigated listeners’ ability to integrate information across the missing and retained speech portions of different durations. They also examined how signal‐ dependent acoustic and signal‐independent semantic enhancements aided listeners in filling in the missing information. The interruption rate manipulation determines the duration of the retained and removed portions of speech within each interruption cycle, so that these intervals vary from an approximate sub‐ phonemic duration (fast interruption rate) to word length or greater (slow interruption rate). The results showed that both semantic context and enhanced acoustic‐phonetic CS cues aided listeners in integrating temporally distributed audible speech fragments into coherent percepts. Importantly, CS “shifted” contextual benefit to lower gating rates, enabling listeners to use signal‐independent semantic cues at slower/more difficult interruption rates where the silent intervals were on average longer than word durations. These results demonstrate that different sources of intelligibility‐enhancing information interact with one another and with the listening conditions in complex ways that need to be further delineated. There is also evidence that CS reduces lexical competition and enhances lexical access. Scarborough and Zellou (2013) showed that words with many phonological neighbors are produced with greater hyper‐articulation in CS compared to words with few phonological neighbors. This asymmetrical CS enhancement facilitated lexical decisions for high neighborhood words. Van Engen (2017) similarly found that CS was helpful in reducing lexical competition for high‐neighborhood‐density words in noise for older adults. Using a visual

Clear Speech 183 word recognition paradigm, van der Feest, Blanco, and Smiljanic (2019) examined how signal enhancements of two speaking style adaptations, adult‐oriented CS and infant‐directed speech (IDS), and of semantic context interact to determine the time course of word recognition in quiet and in noise. Relative to low‐predictability conversational sentences, both CS and IDS increased speed of word recognition for high‐predictability sentences in quiet and in noise for young adult listeners. In the quiet condition, lexical access was eventually facilitated by semantic cues even in conversational speech. In noise, however, listeners reliably focused the target only when a combination of exaggerated acoustic‐phonetic and semantic cues was available. Only a handful of studies to date have examined the effect of speech clarity on downstream processing, specifically looking at memory performance. Van Engen, Chandrasekaran, and Smiljanic (2012) found that CS led to better performance on a sentence recognition memory task compared to conversational speech. That is, listeners were better able to recognize sentences as previously heard when they were spoken in a clear style. Gilbert, Chandrasekaran, and Smiljanic (2014) extended these findings to show that both CS and NAS enhanced sentence recognition memory even when listeners were initially exposed to sentences mixed with noise. The observed differences in recognition memory could not be attributed to differences in whether the sentences were recognized correctly, because all sentences were presented either in quiet (Van Engen, Chandrasekaran, & Smiljanic, 2012) or with low‐level noise (Gilbert, Chandrasekaran, & Smiljanic, 2014), making them fully intelligible to listeners. The findings instead suggest that the exaggerated acoustic‐phonetic cues in CS and NAS enhanced memory traces for sentences produced in that style. Keerstock and Smiljanic (2018) further showed that listener‐oriented CS improved sentence recognition memory regardless of whether the acoustic signal was present during the test phase (within‐modal task, in which both exposure and test include the same audio stimuli of the clear and conversational sentences) or absent (cross‐modal task, in which the listeners hear the audio stimuli during exposure but only see written sentences in the test phase). The CS benefits observed in a recognition memory task were likewise shown in recall, a more complex memory task (Keerstock & Smiljanic, 2019). The intelligibility‐ enhancing CS cues boosted verbatim recall of individual words and entire sentences, and reduced the rate of errors and omitted answers. This suggests that speech encoding in the exposure phase is not at the auditory signal level only; rather the CS facilitates deeper linguistic processing abstracted from the input speech (at phonological, lexical‐semantic, morphosyntactic, and syntactic levels), allowing recall of larger units of connected meaning. Speaking clearly also facilitated memory for spoken information in populations with auditory‐peripheral and central‐cognitive deficits. Using ecologically relevant, complex medical prescription stimuli, DiDonato and Surprenant (2015) showed that older adults’ immediate and delayed recall was improved when they heard the instructions in CS compared to conversational speech. The better memory performance for CS instructions was maintained even in the presence of competing speech, a communication challenge that is particularly difficult for

184 Perception of Linguistic Properties older adults. Keerstock and Smiljanic (2018, 2019) demonstrated that the CS memory benefit extended to nonnative listeners despite the increased cognitive demands in L2 speech perception. However, the results also showed that the CS benefit on sentence recognition memory was smaller for nonnative listeners than native listeners in the absence of the auditory triggers (cross‐modal condition). This suggests that L2 auditory memory may rely more on signal‐dependent information and that listeners need the exaggerated acoustic‐phonetic cues to be present (within‐modality condition) to activate memory traces. The research discussed here, although sparse, shows that CS facilitates speech recognition through various mechanisms, including reduced lexical competition, faster lexical access, and improved use of the signal‐independent information. CS also provides advantages in downstream processes such as encoding of speech information in memory. All of these findings are in line with processing models that invoke increased cognitive load and listening effort4 when speech comprehension is challenging due to signal degradation or listener characteristics (McCoy et al., 2005; Rabbitt, 1968, 1990, Rönnberg et al., 2008, 2013; Pichora‐Fuller et al., 2016; Zekveld, Kramer, & Festen, 2010, 2011; Peelle, 2018; Schneider et al., 2019; Van Engen & Peelle, 2014; Francis & Love, 2019; Winn, Edwards, & Litovsky, 2015). Within these approaches, cognitive resources are limited and perceptual processes interact with cognitive processes for speech comprehension. When acoustic information is well specified and is easily accessed, its mapping onto phonological and lexical representations and meaning is rather automatic and implicit. When the speech signal is degraded or access to it is impeded, extracting meaning becomes cognitively more demanding and listening effort is increased. The increased processing cost results in fewer resources available for information integration, learning, and memory. Because CS provides more robust and salient cues through signal enhancement and the enhancement of the linguistic structure, it decreases task demands and likely reduces listening effort so that more resources remain available for comprehension and memory of spoken information. CS can also relieve some of the sensory and cognitive deficits associated with aging and with L2. Regardless of the exact locus of the increased difficulty, results reviewed here suggest that CS may allow for improved comprehension and memory of spoken information in part by reducing task demands and listening effort. Figure 7.1 provides a schematic illustration of cognitive and linguistic resource allocation during perception of speech of varying intelligibility. While cognitive load and listening effort are increasingly recognized as factors in speech processing, especially for nonnative speakers, older adults, or when listening to speech in noise, it is not yet clear what systems and responses are involved and in what way (Francis & Love, 2019; Pichora‐Fuller et al., 2016). Our understanding of links between intelligibility‐enhancing listener‐oriented speaking styles, cognitive load, listening effort, and linguistic processing in particular is lacking. One way to increase this understanding is by determining which CS features related to the signal enhancement, as opposed to the enhancement of the linguistic structure, underlie the observed improvements in linguistic and cognitive processing. How do specific enhanced acoustics‐phonetic cues facilitate integration of semantic information, lexical access, or encoding of information

Clear Speech 185

attention

attention working memory

motivation

motivation

working memory recall

encoding semantic Degraded speech intelligibility prediction

lexical competition

encoding

recall lexical access

semantic prediction

phoneme identification processing speed

Enhanced speech intelligibility

lexical access

phoneme identification

lexical competition processing speed

Figure 7.1 A schematic illustration of cognitive and linguistic resource allocation during perception of speech of varying intelligibility. Extracting meaning from a degraded, phonetically ambiguous, or masked speech signal increases cognitive load and listening effort (left). Increased processing load depletes cognitive and linguistic resources. When the speech signal is enhanced through conversational‐to‐clear‐speech modifications (right), cognitive load and listening effort are decreased. Decreased processing load leaves more resources available for downstream task performance, facilitating encoding, lexical access, building semantic predictions, recall, and so on.

in long‐term memory? Considering the ways in which cognitive resources, working memory capacity, and attention, for instance, are engaged during CS perception can help elucidate the behavioral and physiological correlates of listening effort. A promising direction toward that goal is to use physiological measures, such as pupillometry, electrodermal activity, and brain imaging, in addition to behavioral methods, to quantify CS listening effort (see Borghini & Hazan, 2018; Zekveld et al., 2010; Zekveld, Kramer, & Festen, 2011; Winn, Edwards, & Litovsky, 2015; Francis & Love, 2019).

Variability in CS production As detailed, talkers habitually vary their spoken output from a hypo‐ to hyper‐ articulated manner in response to the needs of a particular communicative situation (H&H theory; Lindblom, 1990; Perkell et al., 2002). This intricate listener‐ oriented talker adaptation is a skilled aspect of speech production reflecting the dynamic nature of speech communication. This section examines how different tasks, communicative barriers, and talker‐related characteristics affect speaking style modifications aimed at enhancing speech understanding.

Tasks and communicative barriers Lam, Tjaden, and Wilding (2012) compared talkers’ responses to the instructions to speak clearly, to speak as if they were talking to a listener with hearing impairment,

186 Perception of Linguistic Properties and to overenunciate. They found that each condition was associated with different magnitudes of acoustic modification. Compared to habitual baseline speech, for instance, the overenunciate condition yielded the largest change in the VSA, speaking rate, and segment and pause durations, followed by the instructions to talk to someone with a hearing impairment. This last elicitation condition led to the greatest increase in vocal intensity, but only for half of the talkers. A number of studies furthermore demonstrated that speech characteristics vary in response to more naturalistic communicative challenges, such as different types of noise (e.g., Cooke & Lu, 2010; Lu & Cooke, 2008, 2009), CS in combination with noise (Smiljanic & Gilbert, 2017a; Godoy, Koutsogiannaki, & Stylianou, 2014; Goy, Pichora‐Fuller, & van Lieshout, 2013; Gilbert, Chandrasekaran, & Smiljanic, 2014; also see Pichora‐ Fuller, Goy, & van Lieshout, 2010 for a review of CS and NAS), or by means of simulated recognition errors in feedback received from an interactive computer program (Maniwa, Jongman, & Wade, 2009; Stent, Huffman, & Brennan, 2008). Several studies examined closely the talker–listener dynamic adaptation via more spontaneous responses to communication challenges. Hazan and Baker (2011), for example, used an interactive spot‐the‐difference picture task with varying difficulty in the listening conditions: no communicative barrier baseline, with one talker hearing the other via a vocoder or with one talker hearing the simultaneous multitalker babble. Under this scenario, comprehension was reduced for one talker, so the communication partner had to clarify their speech to complete the task even though they did not experience communicative difficulty themselves. The results showed that talkers modified their speech in a selective manner. They increased mean energy and vowel F1 more in the babble condition than in the vocoded condition. They did not increase F0 median and range in the vocoded condition relative to the no‐barrier condition, where these modifications were unlikely to improve intelligibility. They also found that read CS, in which talkers read out loud sentences as if they were talking to someone with a hearing difficulty, showed more extreme changes in median F0, F0 range, and speaking rate relative to the more spontaneous changes in response to the actual communicative interference via vocoded speech or babble noise. Cooke and Lu (2010) similarly found differences in NAS produced when solving a sudoku puzzle with compared to without a partner. Ultimately, the differences between the fine‐tuned, goal‐oriented modifications are meaningful only to the extent that they aid the listener in particular listening situations. When they were conducted, perception tests revealed that indeed the intelligibility benefit increased with the magnitude of acoustic changes (Lam & Tjaden, 2013). Furthermore, the benefit was highest for the listening conditions in which the signal masker matched the one used during the elicitation (e.g. elicited with speech babble and tested mixed with speech babble) (Smiljanic & Gilbert, 2017b; Lu & Cooke, 2009; Cooke et al., 2013b; Hazan, Grynpas, & Baker, 2012). Hazan, Grynpas, and Baker (2012) showed that the matched‐masker tokens were also identified more quickly compared to the mismatched‐masker tokens (e.g. elicited in response to vocoded speech and tested mixed with speech babble) even though all productions were rated as similarly clear. These results strengthen

Clear Speech 187 evidence that spoken output is tailored to most efficiently overcome properties of specific communicative barriers and maskers. They also show that locus of the CS intelligibility benefit is not only evident in the increased recognition accuracy but also in the speed of recognition. Importantly, this work expands our understanding of how the acoustic‐phonetic cues are modified “online,” in response to the challenging conditions with a communicative partner.

CS across the life span Although there is ample evidence that talkers adapt their speech output to facilitate communication, most studies have primarily involved young healthy adults. Less work examined the effect of aging and development on talkers’ ability to produce intelligibility‐enhancing speaking style modifications. A few studies compared how older healthy adults and individuals with Parkinson’s disease (PD), amyotrophic lateral sclerosis (ALS), or multiple sclerosis (MS) change their output when instructed to speak clearly, slowly, or in response to different levels of background noise (Huber & Darling, 2011; Darling & Huber, 2011; Turner Tjaden, & Weismer, 1995; Tjaden, Lam, & Wilding, 2013; Adams et al., 2006, Adams, Winnell, & Jog, 2010; Sadagopan & Huber, 2007). The results showed that healthy older adults were more successful in implementing some of the typical adjustments (increased vocal intensity, VSA, abdominal effort, and decreased speaking rate) than the individuals with PD, ALS, or MS. In a direct comparison of older adult and young adult talkers, Schum (1996) and Smiljanic (2013) found that intelligibility was enhanced through conversational to clear speech modifications for both groups. The findings were inconsistent, though, in that the first study found the similar CS intelligibility benefit for older and young adults, while the second study found the smaller CS benefit for older adults. Looking at both production and perception, Smiljanic and Gilbert (2017a, 2017b) compared the intelligibility of conversational, CS, and NAS sentences produced by 10 children, 10 young adults, and 10 older adults. Young adult listeners benefited from the CS and NAS acoustic‐phonetic changes produced by all three talker groups, though the intelligibility benefit was marginally smaller for the older adults compared to the younger adults, and significantly smaller for the children compared to the adult talkers. These differences in intelligibility were mirrored by the differences in the CS and NAS acoustic‐phonetic modifications implemented by the three talker groups. For instance, older adults produced the slowest speaking rate, longest pauses, and smallest increase in F0 mean, 1–3 kHz energy, sound pressure level (SPL) and VSA when speaking clearly, while children slowed down less and increased the VSA least in CS. Despite this variability, young adult listeners found high‐ and low‐intelligibility talkers in all three age groups, revealing a need for a closer examination of the role that sensory, linguistic, and cognitive factors and age‐related changes play in talker intelligibility variation. Hazan et al. (2018a, 2018b) similarly found that aging and age‐related hearing loss (presbycusis) affected spontaneous responses to communication challenges using the spot‐the‐difference task described earlier. Older and young adults’

188 Perception of Linguistic Properties productions differed in speaking rate, mid‐frequency energy, and F0 characteristics even when the talkers heard each other without a communication barrier, revealing age‐related physiological changes in speech production. While older adults without hearing loss and younger adults made similar modifications when responding to the partner with a simulated hearing loss or to babble noise, older adults with hearing loss most notably differed from both groups in that they significantly increased vocal effort in response to both communicative challenges. Older adults with hearing loss also took significantly longer than younger adults to complete the spot‐the‐difference task. The correlation analyses indicated that the energy present in the 1–3 kHz frequency range, precisely where older adults failed to make significant adjustments, was the best predictor of intelligibility and task efficiency. These findings strengthen evidence that age‐related hearing loss and even mild presbycusis affect the dynamic adaptations needed for effective communication. At the other end of the age spectrum, children as young as three to five years old were found to produce perceptibly different speaking styles, though the exact modifications varied: children increased F0 range and produced larger VSA and longer vowels in Syrett and Kawahara (2013), while no differences in vowel formants were found in Redford and Gildersleeve‐Neumann (2009). The differences from the adult‐like CS modifications were found still in older children (9–14 years of age) using read clear speech (Smiljanic & Gilbert, 2017a) and in an interactive problem‐solving task (Hazan, Tuomainen, & Pettinato, 2016; Pettinato et al., 2016). Granlund, Hazan, and Mahon (2018) found that 9‐ to 14‐year‐olds increased their F0 range, the mid‐frequency intensity, and the VSA, but did not decrease the speaking rate when communicating with peers with hearing loss relative to peers with normal hearing. Children also simplified their utterances by decreasing the number of words per phrase but did not use more frequent or more varied lexical items during the interactions with peers with hearing loss. Though sparse, these findings suggest that some mechanisms for listener‐oriented hyper‐articulated speech adaptations emerge early, but that many features of adult‐like adaptations continue to develop into adolescence. Furthermore, the results suggest that children, even by 14 years of age, may not be as adept at aligning their output to the specific needs of the listener or communicative challenge. The findings on children and older adults reveal that they may encounter difficulties when communicating in challenging conditions beyond well‐recognized perceptual problems (Gordon‐ Salant & Fitzgibbons, 1997; Schneider, Daneman, & Pichora‐Fuller, 2002; Johnson, 2000). Their own speech may not be understood well in the environments, such as classrooms and hospitals, in which noise is ubiquitous and which require speaking clearly.

CS by nonnative and clinical populations Nonnative speakers encounter qualitatively different difficulties from older adults with hearing loss in that their access to the speech signal and cognitive resources are intact but they are less experienced in processing the linguistic code of the

Clear Speech 189 target language. Examining CS vowel intelligibility for Spanish learners of English, Rogers, DeMasi, and Krause (2010) found that early learners provided similar CS benefit as native English talkers, while late learners produced the smallest CS benefit for native English listeners. The late learners even exhibited a decrease in CS intelligibility for the vowel /ɪ/ in “bid.” Looking at a variety of CS acoustic‐ phonetic enhancements, Granlund, Hazan, and Baker (2012) found that Finnish– English late bilinguals used similar signal‐augmenting CS strategies (increased F0 mean and energy between 1 and 3 kHz) in both languages which were also similar to those made by native monolingual English talkers. At the segmental level though, the bilinguals showed mixed patterns – they modified the stop voicing contrasts in a manner consistent with the language spoken, but the vowels in both languages were enhanced similarly, reflecting the lack of familiarity with the language‐specific hyper‐articulation targets. Smiljanic and Bradlow (2011) examined sentence intelligibility for high proficiency L2 learners, and found that they were successful in increasing intelligibility significantly through CS modifications for both native and nonnative listeners. The degree of CS intelligibility benefit for some of the nonnative talkers was in fact similar to the benefit provided by the native talkers (Smiljanic & Bradlow, 2005). The study also found that, while the conversational‐to‐clear‐speech modifications implemented by nonnative talkers increased intelligibility for native listeners, these modifications did not contribute to lower subjective accentedness ratings for CS compared to conversational sentences, revealing a dissociation between the two perceptual dimensions. Together, these studies show that CS‐related signal enhancement is language independent, and is thus readily available to talkers in their L1 and L2. Enhancing phonemic contrasts in a target‐language‐appropriate way, however, requires substantial experience and practice. A detailed understanding of communication strategies that are effective at making speech more intelligible not only is relevant for nonnative speakers but has implications for clinical populations. Instructions to speak clearly, loudly, slowly, and in response to noise have all been used as behavioral treatment techniques for talkers exhibiting dysarthria associated with PD, ALS, or MS. Dysarthric speech is often characterized by variable speaking rate, reduced vowel space, pitch variation and loudness, and imprecise consonant production which can lead to speech intelligibility deficits and diminished life quality (Appleman, Stavitsky, & Cronin‐ Golomb, 2011). The goal of the interventions is to reduce or compensate for the underlying deficits and to maximize speech intelligibility and naturalness (Beukelman et al., 2002; Hustad & Weismer, 2007; Sadagopan & Huber, 2007; Park et al., 2016). When instructed to speak clearly, talkers with dysarthria increased VSA, vocal intensity, and mean F0 and decreased speaking rate (Goberman & Elmer, 2005; Tjaden, Lam, & Wilding, 2013; Tjaden, Kain, & Lam, 2014; Tjaden, Sussman, & Wilding, 2014). In a similar way to healthy adults, talkers with PD also produced different degrees of modifications in response to the varied CS instructions (Lam & Tjaden, 2016) and increased vocal effort in response to masking noise (Stathopoulos, Huber, & Sussman, 2011), although the responses differed between the control participants and individuals with PD (Darling & Huber, 2011; Adams

190 Perception of Linguistic Properties et al., 2006). Even talkers without diagnosed speech deficits can have compromised intelligibility. Yi, Smiljanic, and Chandrasekaran (2019) showed that, compared to talkers with low depressive symptoms, talkers classified as having high depressive symptoms (based on a self‐report scale: Radloff, 1977) produced smaller conversational-to-clear speech modifications for speaking rate, energy in the 1 to 3 kHz range, F0 mean, and F0 range. These modifications led to smaller CS intelligibility benefit. The findings on variability in CS production discussed in this section strengthen evidence that not all listener‐oriented intelligibility‐enhancing CS modifications are the same, but rather that the precise acoustic‐articulatory tweaks are made in response to different communicative challenges. Determining to what extent some of these fine‐grained adjustments are under talker control will have important implications for interventions aimed at increasing intelligibility for different talker groups. More work is needed to examine how these modifications expand or diminish over the course of an interaction and how other features of naturally occurring interactions, such as shared knowledge, familiarity with the interlocutor, or use of signal‐complementary visual information, affect this dynamic process. A largely unexplored direction involves looking more holistically at the multiple levels of adaptation, beyond acoustic‐phonetic modifications, that are available to talkers when producing speech in response to various real‐world communication challenges. The reviewed work also shows a range of talker abilities to address communication challenges. The pressing goal is to better understand the cognitive, linguistic, and physiological factors that facilitate speaking style adaptations for a variety of talker groups. To that end, more talkers across different age groups and with varied language proficiencies should be examined. From a practical standpoint, more work is needed to determine which training and treatment protocols can lead to enhanced intelligibility for clinical and nonclinical populations. Evidence is emerging, for example, that intensive treatment focusing on global, speech‐oriented behavioral techniques (e.g. speak clearly), rather than targeting individual speech parameters (e.g. speaking slowly), may be more effective and feasible clinical intervention for addressing speech difficulties in patients whose speech production is disrupted (Beukelman et al., 2002; Tjaden, Sussman, & Wilding, 2014; Park et al., 2016; Stipancic, Tjaden, & Wilding, 2016). However, more controlled studies are needed to establish a rigorous evidence base for aiding clinical decision making. It is critical to also establish whether improvements in speech production following L2 training or clinical interventions transfer to real‐life communication situations and whether these novel speech patterns are stable over time. Finally, future work should explore how the specific acoustic‐phonetic modifications work to enhance intelligibility in particular listening situations. This is an essential research direction, as evidence based on overcoming communicative challenges in more realistic communication situations can inform speech modification algorithms for improving intelligibility (Cooke et al., 2013a).

Clear Speech 191

Variability in CS perception The studies discussed have focused on talker and task characteristics in determining speaking style adaptations and their effect on intelligibility. A large body of work has also investigated how listener characteristics shape intelligibility gains. There is abundant evidence that CS enhances intelligibility for a number of listener groups, including adults with sloping high‐frequency sensorineural hearing loss (e.g. Schum, 1996; Ferguson, 2012), adult cochlear implant users (Liu et al., 2004; Ferguson & Lee, 2006), children with learning disabilities (Bradlow, Kraus, & Hayes, 2003), and normal‐hearing listeners identifying the stimuli in noise and/or reverberation (e.g., Payton, Uchanski, & Braida, 1994; Ferguson, 2004; Smiljanic & Bradlow, 2005). The magnitude of the intelligibility benefit, however, varies within and across listener populations. Considering how listener sensory and cognitive characteristics affect signal‐dependent and relatively signal‐ independent processing is another crucial component in our understanding of the dynamic talker–listener alignment in successful speech communication. A brief discussion of speech signal modifications aimed at improving intelligibility is included at the end.

Listeners with hearing impairment Listeners with sensorineural hearing impairment experience a relatively constant reduction in signal audibility and clarity, leading to difficulties in understanding speech. Because of the sensory deficits, many listeners find it challenging to communicate in common listening environments, such as noisy restaurants. Due to the limited access to the spectro‐temporal information, these listeners may not benefit from all CS modifications. While some conversational to clear speech modifications improve overall intelligibility for listeners with hearing loss by promoting signal audibility (Picheny, Durlach, & Braida, 1985; Payton, Uchanski, & Braida, 1994), findings on the effect of phonemic enhancements on vowel and consonant recognition are mixed. Ferguson and Kewley‐Port (2002), for instance, found that older adults with hearing loss, unlike normal‐hearing listeners, were not able to use CS spectral vowel enhancements produced by one talker in the study, presumably due to the F2 increases into the high frequency regions. In contrast, Ferguson (2012) found a substantial CS intelligibility benefit for vowels produced by 41 talkers for older adults with sloping sensorineural hearing loss. Importantly, even though modifications in vowel duration, F1 and F2, and dynamic formant movement significantly contributed to the CS benefit, the young and older adult listeners weighed vowel duration and steady‐state formant information differently (Ferguson & Quené, 2014). With regard to consonants, Maniwa, Jongman, and Wade (2009) found that CS modifications enhanced fricative intelligibility for listeners with simulated sloping hearing loss, but the CS benefit for nonsibilant fricatives was reduced compared to normal‐hearing listeners. Moreover, a shift of energy concentration toward higher frequency

192 Perception of Linguistic Properties regions and greater source strength, which contributed to the CS effect for normal‐ hearing listeners, did not help the impaired listeners and even decreased intelligibility for some sounds. These results demonstrate that hearing loss affects the way listeners are able to use and integrate hyper‐articulated acoustic cues to identify vowels, consonants, and words in noise. Cochlear implant (CI) users are also faced with a degraded spectro‐temporal signal due to the limitations of electric hearing and the electrode–nerve interface (see Hunter & Pisoni, Chapter 20, for a more in‐depth consideration of signal dependent and signal‐independent issues in speech intelligibility and language comprehension among individuals with hearing loss). Smiljanic and Sladen (2013) showed that CS and sentential context improved word recognition in noise for children with CIs and children with normal hearing. Children with normal hearing, however, benefited more from each source of enhancement (CS acoustic‐phonetic cues and semantic context) separately and in combination compared to children with CIs who needed enhanced signal clarity to draw on signal‐independent semantic information.

Language learners As children’s speech‐perception acuity develops into adolescence, their ability to use intelligibility‐enhancing modifications may not be fully adult‐like either. Similar to CS, modifications of speech directed toward infants (IDS) include slower speaking rate, increased pitch variation, and enlarged VSA, and have been shown to aid social‐emotional and affective development in young children (e.g. Cristia, 2013), as well as to facilitate the creation of perceptual sound categories, sound discrimination, segmentation, and word learning in young children (Cristia & Seidl, 2014; Kuhl, 2007; Liu, Kuhl, & Tsao, 2003; Schreiner & Mani, 2017; Graf Estes & Hurley, 2013; Song, Demuth, & Morgan, 2010). However, IDS may not be an appropriate intelligibility‐enhancing adjustment as children get older. A few studies that examined school‐aged children’s perception have found intelligibility benefit for adult‐directed CS (Bradlow, Kraus, & Hayes, 2003; Riley & McGregor, 2012; Leone & Levy, 2015). Riley and McGregor (2012), for example, showed improved word recognition in noise for narratives that were produced clearly compared to conversationally, providing evidence of CS benefit for children’s real‐word comprehension. Leone and Levy (2015) found a CS benefit in noise even in the absence of lexical influences, namely for vowel identification in nonsense words. They also found that CS accuracy was higher for front (/ɛ, æ/) than back vowels (/ɑ, ʌ/). This selective benefit was not observed in the adult data in the authors’ preliminary study or in Ferguson and Kewley‐Port’s (2002) study, suggesting different use of hyper‐articulated CS cues by children. Finally, Bradlow, Kraus, and Hayes (2003) showed that CS improved sentence recognition in noise for school‐aged children with learning disabilities (LDs) even though the benefit was smaller compared to children with no LDs. The CS intelligibility furthermore increased as the signal‐to‐ noise ratio decreased, revealing an important role for this speaking style adjustment in adverse listening conditions for both groups of children. There have also been efforts to assess whether CS can improve intelligibility for another group of language learners, namely nonnative listeners. Their difficulty is

Clear Speech 193 evident at all levels of L2 processing, from perceptual discrimination of sound contrasts and different weighing of acoustic cues to phonotactics and prosody (Best & Tyler, 2007; Cutler, Garcia Lecumberri, & Cooke, 2008; Flege, 1995; Francis, Kaganovich, & Driscoll‐Huber, 2008; Iverson et al., 2003; Kondaurova & Francis, 2008). Reflecting these perceptual difficulties, Bradlow and Bent (2002) found a substantially smaller CS benefit for low‐proficiency nonnative compared to native listeners. The smaller CS benefit may in part be rooted in nonnative listeners’ inability to use signal‐independent information, that is semantic context, in the absence of signal clarity (Bradlow & Alexander, 2007). As nonnative listeners gain expertise in L2 processing, they increasingly manage to attend to and use CS enhancements implemented by native talkers, including signal‐augmenting and language‐specific phonemic modifications (Smiljanic & Bradlow, 2011; Alcorn & Smiljanic, 2018). As discussed earlier, Keerstock and Smiljanic (2018, 2019) showed that nonnative listeners benefited from CS in memory tasks, both recognition and recall of spoken information. The CS benefit for recognition memory was smaller in the cross‐modal condition for nonnative than native listeners, showing the cost of information integration in L2. The evidence shows that CS enhances intelligibility for language learners, though this benefit may not be fully native‐ and adult‐like. The differences in the use of the acoustic‐phonetic enhancements can arise for a number of reasons including an incomplete auditory development in children (Werner, 2007), an incomplete language model and L1 interference in nonnative listeners (Best & Tyler, 2007; Flege, 1995; Costa, Caramazza, & Sebastian‐Galles, 2000), and disproportionate difficulty with speech perception in noise for both groups (Nishi et al., 2010; van Wijngaarden, 2001; van Wijngaarden, Steeneken, & Houtgast, 2002; Bradley & Sato, 2008; Stuart et al., 2006). The findings on variability in CS perception involving language learners and listeners with hearing loss are in line with the previously discussed literature on listening effort in that the increased difficulty in processing signal‐related auditory information or in mapping onto imperfect representations diminishes the resources that may otherwise be available, for instance, for building up the meaning over the course of the sentences or for memory encoding (McCoy et al., 2005; Rabbitt, 1968, 1990, Rönnberg et al., 2008, 2013; Pichora‐Fuller, Schneider, & Daneman, 1995; Pichora‐Fuller et al., 2016; Zekveld, Kramer, & Festen, 2010, 2011; Peelle, 2018; Van Engen, 2017; DiDonato & Surprenant, 2015). Examining how perception of CS enhancements interacts with signal‐independent linguistic and cognitive processing in different listener groups is still largely lacking. With better understanding of how CS facilitates speech processing, the more specific information on how to use this speech style effectively can be incorporated into training, clinical, and educational settings (see Peng & Wang, 2016).

Intelligibility effects of signal modifications The insights from the production and perception studies have implications for speech signal modifications aimed at improved speech intelligibility, for example GPS and customer service systems, hearing aids, and assisted speech devices

194 Perception of Linguistic Properties (Picheny, Durlach, & Braida, 1989; Uchanski et al., 1996; Krause & Braida, 2009; Zeng & Liu, 2006; Godoy, Koutsogiannaki, & Stylianou, 2014; Cooke et al., 2013a, 2013b). These modifications include features designed to improve signal audibility in noise and to enhance processing of the linguistic information. While listeners use the spectral and temporal adaptations that talkers implement when speaking in response to a communicative challenge, a direct link between any one acoustic‐articulatory modification and increased intelligibility remains poorly understood (Krause, 2001; Krause & Braida, 2002; Picheny, Durlach, & Braida, 1989; Uchanski et al., 1996; Liu & Zeng, 2006; Godoy, Koutsogiannaki, & Stylianou, 2014; Tjaden, Kain, & Lam, 2014). Lu and Cooke (2009), for instance, found that flattening of spectral tilt, but not an increase in F0, enhanced speech intelligibility in speech‐shaped noise. Godoy, Koutsogiannaki, and Stylianou (2014) implemented NAS‐inspired fixed spectral gain filter to boost spectral energy and loudness in higher frequencies and CS‐inspired frequency warping for vowel space expansion. They found that augmenting loudness significantly enhanced intelligibility,5 even more so than for the naturally produced CS and NAS (see also Cooke et al., 2013b). In contrast, vowel space expansion through frequency warping did not enhance intelligibility even though intelligibility benefit was found for the enlarged VSA spontaneously produced in CS. Using hybrid speech stimuli in which acoustic features of one conversational sentence were replaced with the enhanced features of its clearly produced counterpart, Kain, Amano‐ Kusumoto, and Hosom (2008) found that intelligibility was improved for a combination of CS short‐term spectra, phoneme sequence, and phoneme durations cues, but not for F0, energy, or pauses. Research in this area has also examined whether modifications to specific acoustic‐phonetic cues related to segmental precision versus respiratory‐phonatory features can enhance intelligibility for talkers with such production deficits (e.g. in dysarthria). Using a speech analysis‐resynthesis paradigm, Tjaden, Sussman, and Wilding (2014) investigated the contribution of segmental and suprasegmental features to intelligibility variation in conversational and CS produced by two talkers with Parkinson’s disease. Segment durations, short‐term spectrum, energy characteristics, and F0 were extracted from CS sentences and applied to their conversational counterparts. The results for one talker revealed that the intelligibility benefit was linked to the adjustments in the short‐term spectrum and duration, while energy characteristics were linked to the CS benefit for the other talker. Segmental modifications through VSA adjustments did not increase intelligibility for either talker. The evidence from this line of inquiry shows that it is possible to create a signal with greater intelligibility than the original conversational speech. However, a striking disparity in the findings shows that not any one signal modification adequately approximates intelligibility gains found in naturalistic speech. This suggests that likely a combination of different cue modifications that talkers employ jointly give rise to the intelligibility benefit. Furthermore, as discussed, the predictive power of acoustic‐phonetic features differs for different masker types, clear speech elicitation techniques, talkers, and listeners. Future work thus needs to

Clear Speech 195 strengthen our understanding of the relationship between acoustic‐articulatory strategies and intelligibility increase for various talker–listener pairs and communicative situations. More research is needed to establish objective measures of speech intelligibility in automatic speech enhancement systems.

Conclusion The research reviewed here shows a robust CS intelligibility benefit for a variety of talkers, listeners, and communication challenges. Our understanding of how talker‐, listener‐, and signal‐related factors affect intelligibility variation and in more naturalistic settings has increased significantly since the turn of the twenty‐ first century. However, a number of questions regarding the links between the signal‐dependent and relatively signal‐independent processing benefits of CS remain open and will likely shape the research agenda for the years to come. The review noted the paucity of research on how CS contributes to the segmentation of the speech signal, lexical access, semantic prediction, and improved memory retention. Similarly, more work is needed to better understand cognitive effort and how cognitive resources such as selective attention, working memory, and motivation are allocated during CS perception. Combining insights from speech intelligibility research with psycholinguistic approaches, physiological measures, and work on signal processing could prove a fruitful strategy for a more comprehensive understanding of the mechanisms underlying CS enhanced intelligibility. This multifaceted approach could elucidate how CS facilitates speech and linguistic processing at multiple levels, resulting in improved communicative effectiveness (see Figure 7.1). Such discoveries would guide the development of better, more targeted, training paradigms and treatment protocols for individuals for whom intelligibility is compromised and for whom communication in adverse situations is particularly challenging. Finally, the findings could help improve signal‐ processing algorithms for nonlinear hearing aids and a variety of contexts in which humans interact with machines.

NOTES 1 Note that this review focuses on conversational to clear speech modifications in the acoustic‐phonetic domain only. When attempting to overcome a communication barrier, talkers will have at their disposal a number of different strategies, such as simplifying sentence structure, using shorter and more frequent words, and repeating information. More work is needed to determine how these different strategies contribute to ease of processing and improved speech intelligibility. 2 McMurray and colleagues (2013) also found substantially more variation in vowel formants in IDS compared to adult‐directed speech, which suggests that, as with CS, the IDS adaptations may not enhance phonetic discrimination via their distributional properties.

196 Perception of Linguistic Properties 3 See also Buz, Tanenhaus, and Jaeger (2016) for voice onset time (VOT) hyper‐articulation in response to interlocutor feedback. 4 Here, cognitive load and listening effort are used to refer to “the extent to which the demands imposed by the task at a given moment consume the resources available to maintain successful task execution” (Pichora‐Fuller et al., 2016, p. 12S) and “the allocation of cognitive resources to overcome obstacles or challenges to achieving listening‐ oriented goals (Francis & Love, 2019, p. 3), respectively. A distinction between listening / cognitive demand and listening effort is also discussed in Peelle (2018). 5 Note that most intelligibility studies equate loudness of conversational and clearly spoken stimuli so the resulting intelligibility gain is due to the concomitant acoustic‐ phonetic modifications. In everyday communication, however, listeners may benefit from the increased CS loudness naturally produced by talkers.

REFERENCES Adams, S., Moon, B.‐H., Page, A., & Jog, M. (2006). Effects of multitalker noise on conversational speech intensity in Parkinson’s disease. Journal of Medical Speech‐Language Pathology, 14(4), 221–228. Adams, S., Winnell, J., & Jog, M. (2010). Effects of interlocutor distance, multi‐ talker background noise, and a concurrent manual task on speech intensity in Parkinson’s disease. Journal of Medical Speech‐Language Pathology, 18(4), 1–8. Alcorn, S., & Smiljanic, R. (2018). Effect of proficiency on perception of English nasal stops by native speakers of Brazilian Portuguese. Journal of the Acoustical Society of America, 144(3), 1726. Appleman, E., Stavitsky, K., & Cronin‐ Golomb, A. (2011). Relation of subjective quality of life to motor symptom profile in Parkinson’s disease. Parkinson’s Disease, 2011, 472830. Baker, R. E., & Bradlow, A. R. (2009). Variability in word duration as a function of probability, speech style, and prosody. Language and Speech, 52(4), 391–413. Best, C. T., & Tyler, M. D. (2007). Nonnative and second‐language speech perception: Commonalities and complementarities. In O.‐S. Bohn & M. J. Munro (Eds),

Language learning and language teaching (Vol. 17, pp. 13–34). Amsterdam: John Benjamins. Beukelman, D. R., Fager, S., Ullman, C., et al. (2002). The impact of speech supplementation and clear speech on the intelligibility and speaking rate of people with traumatic brain injury. Journal of Medical Speech‐Language Pathology, 10, 237–242. Borghini, G., & Hazan,V. (2018). Listening effort during sentence processing is increased for non‐native listeners: A pupillometry study. Frontiers in Neuroscience, 12, 152. Bradley, J. S., & Sato, H. (2008). The intelligibility of speech in elementary school classrooms. Journal of the Acoustical Society of America, 123, 2078–2086. Bradlow, A. R. (2002). Confluent talker‐ and listener‐related forces in clear speech production. In C. Gussenhoven & N. Warner (Eds), Laboratory phonology (Vol. 7, pp. 241–273). Berlin: Mouton de Gruyter. Bradlow, A. R., & Alexander, J. (2007). Semantic‐contextual and acoustic‐ phonetic enhancements for English sentence‐in‐noise recognition by native

Clear Speech 197 and non‐native listeners. Journal of the Acoustical Society of America, 121(4), 2339–2349. Bradlow, A. R., & Bent, T. (2002). The clear speech effect for non‐native listeners. Journal of the Acoustical Society of America, 112, 272–284. Bradlow, A. R., Kraus, N., & Hayes, E. (2003). Speaking clearly for children with learning disabilities: Sentence perception in noise. Journal of Speech, Language, and Hearing Research, 46, 80–97. Buz, E., Tanenhaus, M. K., & Jaeger, T. F. (2016). Dynamically adapted context‐ specific hyper‐articulation: Feedback from interlocutors affects speakers’ subsequent pronunciations. Journal of Memory and Language, 89, 68–86. Cooke, M., King, S., Garnier, M., & Aubanel, V. (2013a). The listening talker: A review of human and algorithmic context‐induced modifications of speech. Computer Speech and Language, 28(2), 543–571. Cooke, M., & Lu, Y. (2010). Spectral and temporal changes to speech produced in the presence of energetic and informational maskers. Journal of the Acoustical Society of America, 128(4), 2059–2069. Cooke, M., Mayo, C., Valentini‐Botinhao, C., et al. (2013b). Evaluating the intelligibility benefit of speech modifications in known noise conditions. Speech Communication, 55, 572–585. Costa, A., Caramazza, A., & Sebastian‐ Galles, N. (2000). The cognate facilitation effect: Implication for models of lexical access. Journal of Experimental Psychology: Learning, Memory, and Cognition, 26, 1283–1296. Cristia, A. (2013). Input to language: The phonetics and perception of infant‐ directed speech. Language and Linguistics Compass, 7(3), 157–170. Cristia, A., & Seidl, A. (2014). The hyperarticulation hypothesis of infant‐ directed speech. Journal of Child Language, 41(4), 913–934.

Cutler, A., Garcia Lecumberri, M. L., & Cooke, M. (2008). Consonant identification in noise by native and non‐native listeners: Effects of local context. Journal of the Acoustical Society of America, 124, 1264–1268. Darling, M., & Huber, J. E. (2011). Changes to articulatory kinematics in response to loudness cues in individuals with Parkinson’s disease. Journal of Speech, Language, and Hearing Research, 54, 1247–1259. Davis, C., & Kim, J., (2012). Is speech produced in noise more distinct and/or consistent? Speech Science Technology, 46–49. DiDonato, R. M., & Surprenant, A. M. (2015). Relatively effortless listening promotes understanding and recall of medical instructions in older adults. Frontiers in Psychology, 6, art. 778. Ferguson, S. H. (2004). Talker differences in clear and conversational speech: Vowel intelligibility for normal‐hearing listeners. Journal of the Acoustical Society of America, 116, 2365–2373. Ferguson, S. H. (2012). Talker differences in clear and conversational speech: Vowel intelligibility for older adults with hearing loss. Journal of Speech, Language, and Hearing Research, 55, 779–790. Ferguson, S. H., & Kewley‐Port, D. (2002). Vowel intelligibility in clear and conversational speech for normal‐ hearing and hearing‐impaired listeners. Journal of the Acoustical Society of America, 112, 259–271. Ferguson, S. H., & Kewley‐Port, D. (2007). Talker differences in clear and conversational speech: Acoustic characteristics of vowels. Journal of Speech, Language, and Hearing Research, 50(5), 1241–1255. Ferguson, S. H., & Lee, J. (2006). Vowel intelligibility in clear and conversational speech for cochlear implant users: A preliminary study. Journal of the Academy of Rehabilitative Audiology, 39, 1–16.

198 Perception of Linguistic Properties Ferguson, S. H., & Quené, H. (2014). Acoustic correlates of vowel intelligibility in clear and conversational speech for young normal‐hearing and elderly hearing‐impaired listeners. Journal of the Acoustical Society of America, 135(6), 3570–3584. Flege, J. E. (1995). Second language speech learning: Theory, findings, and problems. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in cross‐ language research (pp. 233–277). York, UK: York Press. Francis, A. L., Kaganovich, N., & Driscoll‐ Huber, C. (2008). Cue‐specific effects of categorization training on the relative weighting of acoustic cues to consonant voicing in English. Journal of the Acoustical Society of America, 124(2), 1234–1251. Francis, A. L., & Love, J. (2019). Listening effort: Are we measuring cognition, affect, or both? WIREs Cognitive Science, e1514. Gagné, J.‐P., Masterson, V., Munhall, K. G., et al. (1994). Across talker variability in auditory, visual, and audiovisual speech intelligibility for conversational and clear speech. Journal of the Academy of Rehabilitative Audiology, 27, 135–158. Gagné, J.‐P., Querengesser, C., Folkeard, P., et al. (1995). Auditory, visual, and audiovisual speech intelligibility for sentence‐length stimuli: An investigation of clear and conversational speech. Volta Review, 97, 33–51. Gagné, J.‐P., Rochette, A.‐J., & Charest, M. (2002). Auditory, visual and audiovisual clear speech. Speech Communication, 37, 213–230. Gilbert, R., Chandrasekaran, B., & Smiljanic, R. (2014). Recognition memory in noise for speech of varying intelligibility. Journal of the Acoustical Society of America, 135(1), 389–399. Goberman, A. M., & Elmer, L. W. (2005). Acoustic analysis of clear versus conversational speech in individuals

with Parkinson disease. Journal of Communication Disorders, 38, 215–230. Godoy, E., Koutsogiannaki, M., & Stylianou, Y. (2014). Approaching speech intelligibility enhancement with inspiration from Lombard and Clear speaking styles. Computer Speech and Language, 28(2), 629–647. Gordon‐Salant, S., & Fitzgibbons, P. J. (1997). Selected cognitive factors and speech recognition performance among young and elderly listeners. Journal of Speech, Language, and Hearing Research, 40(2), 423–431. Goy, H., Pichora‐Fuller, M. K., & van Lieshout, P. (2013). Effects of intra‐talker differences on speech understanding in noise by younger and older adults. Canadian Acoustics, 41(2), 23–30. Graf Estes, K., & Hurley, K. (2013). Infant‐ directed prosody helps infants map sounds to meanings. Infancy, 18(5), 797–824. Granlund, S., Hazan, V., & Baker, R. (2012). An acoustic‐phonetic comparison of the clear speaking styles of late Finnish– English bilinguals. Journal of Phonetics, 40, 509–520. Granlund, S., Hazan, V., & Mahon, M. (2018). Children’s acoustic and linguistic adaptations to peers with hearing impairment. Journal of Speech, Language, and Hearing Research, 61(5), 1055–1069. Hazan, V., & Baker, R. (2011). Acoustic‐ phonetic characteristics of speech produced with communicative intent to counter adverse listening conditions. Journal of the Acoustical Society of America, 130(4), 2139–2152. Hazan, V., Grynpas, J., & Baker, R. (2012). Is clear speech tailored to counter the effect of specific adverse listening conditions? Journal of the Acoustical Society of America, 132(5), EL371–377. Hazan, V., & Markham, D. (2004). Acoustic‐ phonetic correlates of talker intelligibility for adults and children. Journal of the

Clear Speech 199 Acoustical Society of America, 116, 3108–3118. Hazan, V., Tuomainen, O., Kim, J., et al. (2018a). Clear speech adaptations in spontaneous speech produced by young and older adults. Journal of the Acoustical Society of America, 144, 1331–1346. Hazan, V., Tuomainen, O., & Pettinato, M. (2016). Suprasegmental characteristics of spontaneous speech produced in good and challenging communicative conditions by talkers aged 9 to 14 years old. Journal of Speech, Language, and Hearing Research, 59, S1596–1607. Hazan, V., Tuomainen, O., Tu, L., et al. (2018b). How do aging and age‐related hearing loss affect the ability to communicate effectively in challenging communicative conditions? Hearing Research, 369, 33–41. Helfer, K. S. (1997). Auditory and auditory‐ visual perception of clear and conversational speech. Journal of Speech, Language, and Hearing Research, 40, 432–443. Huber, J. E. & Darling, M. (2011). Effect of Parkinson’s disease on the production of structured and unstructured speaking tasks: Respiratory physiologic and linguistic considerations. Journal of Speech, Language and Hearing Research, 54(1), 33–46. Hustad, K. C., & Weismer, G. (2007). Interventions to improve intelligibility and communicative success for speakers with dysarthria. In G. Weismer (Ed.), Motor speech disorders (pp. 217–228). San Diego, CA: Plural. Huttunen, K., Keränen, H., Väyrynen, E., et al. (2011). Effect of cognitive load on speech prosody in aviation: Evidence from military simulator flights. Applied Ergonomics, 42, 348–357. Iverson, P., Kuhl, P. K., Akahane‐Yamada, R., et al. (2003). A perceptual interference account of acquisition difficulties for

non‐native phonemes. Cognition, 87(1), B47–57. Johnson, C. E. (2000). Children’s phoneme identification in reverberation and noise. Journal of Speech Language and Hearing Research, 43, 144–157. Johnson, E. K., Lahey, M., Ernestus, M., & Cutler, A. (2013). A multimodal corpus of speech to infant and adult listeners. Journal of the Acoustical Society of America, 134(6), EL534–540. Junqua, J. C. (1993). The Lombard reflex and it role on human listener and automatic speech recognizers. Journal of the Acoustical Society of America, 93(1), 510–524. Kain, A., Amano‐Kusumoto, A., & Hosom, J.‐P. (2008). Hybridizing conversational and clear speech to determine the degree of contribution of acoustic features to intelligibility. Journal of the Acoustical Society of America, 124, 2308–2319. Kang, K.‐H., & Guion, S. G. (2008). Clear speech production of Korean stops: Changing phonetic targets and enhancement strategies. Journal of the Acoustical Society of America, 124(6), 3909–3917. Keerstock, S., & Smiljanic, R. (2018). Effects of intelligibility on within‐ and cross‐ modal sentence recognition memory for native and non‐native listeners. Journal of the Acoustical Society of America, 144(5), 2871–2881. Keerstock, S., & Smiljanic, R. (2019). Speaking clearly improves listeners’ recall. Journal of the Acoustical Society of America, 146(6), 4604–4610. Kim, J., & Davis, C. (2014). Comparing the consistency and distinctiveness of speech produced in quiet and in noise. Computer Speech and Language, 28(2), 598–606. Kim, J., Sironic, A., & Davis, C. (2011). Hearing speech in noise: Seeing a loud talker is better. Perception, 40(7), 853–862. Kondaurova, M. V., & Francis, A. L. (2008). The relationship between native allophonic experience with vowel

200 Perception of Linguistic Properties duration and perception of the English tense/lax vowel contrast by Spanish and Russian listeners. Journal of the Acoustical Society of America, 124(6), 3959–3971. Kornak, J. (2018). Are your patients really hearing you? ASHA Leader. Krause, J. C. (2001). Properties of naturally produced clear speech at normal rates and implications for intelligibility enhancement. Unpublished doctoral dissertion, MIT, Cambridge, MA. Krause, J. C., & Braida, L. D. (2002). Investigating alternative forms of clear speech: The effects of speaking rate and speaking mode on intelligibility. Journal of the Acoustical Society of America, 112, 2165–2172. Krause, J. C., & Braida, L. D. (2004). Acoustic properties of naturally produced clear speech at normal speaking rates. Journal of the Acoustical Society of America, 115, 362–378. Krause, J. C., & Braida, L. D. (2009). Evaluating the role of spectral and envelope characteristics in the intelligibility advantage of clear speech. Journal of the Acoustical Society of America, 125(5), 3346–3357. Kuhl, P. K. (2007). Is speech learning “gated” by the social brain? Developmental Science, 10(1), 110–120. Kuhl, P. K., Andruski, J. E., Chistovich, I. A., et al. (1997). Cross‐language analysis of phonetic units in language addressed to infants. Science, 277(5326), 684–686. Lam, J., & Tjaden, K. (2013). Intelligibility of clear speech: Effect of instruction. Journal of Speech, Language, and Hearing Research, 56, 1429–1440. Lam, J., & Tjaden, K. (2016). Clear speech variants: An acoustic study in Parkinson’s disease. Journal of Speech, Language, and Hearing Research, 59, 631–646. Lam, J., Tjaden, K., & Wilding, G. (2012). Acoustics of clear speech: Effect of instruction. Journal of Speech, Language, and Hearing Research, 55, 1807–1821.

Lansford, K., Liss, J., Caviness, J., & Utianski, R. (2011). A cognitive‐ perceptual approach to conceptualizing speech intelligibility deficits and remediation practice in hypokinetic dysarthria. Parkinson’s Disease, art. 150962. Leone, D., & Levy, E. (2015). Children’s perception of conversational and clear American‐English vowels in noise. Journal of Speech, Language, and Hearing Research, 58, 213–226. Leung, K. K., Jongman, A., Wang, Y., & Sereno, J. A. (2016). Acoustic characteristics of clearly spoken English tense and lax vowels. Journal of the Acoustical Society of America, 140, 45–58. Lindblom, B. (1990). Explaining phonetic variation: A sketch of the H & H theory. In W. J. Hardcastle & A. Marchal (Eds), Speech production and speech modelling (pp. 403–439). Dordrecht: Kluwer Academic. Liu, H. M., Kuhl, P. K., & Tsao, F. M. (2003). An association between mothers’ speech clarity and infants’ speech discrimination skills. Developmental Science, 6(3), F1–10. Liu, S., Del Rio, E., Bradlow, A. R., & Zeng, F. G. (2004). Clear speech perception in acoustic and electric hearing. Journal of the Acoustical Society of America, 116, 2374–2383. Liu, S., & Zeng, F.‐G. (2006). Temporal properties in clear speech perception. Journal of the Acoustical Society of America, 120(1). 424–432. Lombard, E. (1911). Le signe de l’elevation de la voix [The sign of the rise in the voice]. Annales des Maladies de l’Oreille, du Larynx, du Nez et du Pharynx, 37, 101–119. Lu, Y., & Cooke, M. (2008). Speech production modifications produced by competing talkers, babble, and stationary noise. Journal of the Acoustical Society of America, 124(5), 3261–3275. Lu, Y., & Cooke, M. (2009). The contribution of changes in F0 and spectral tilt to increased intelligibility of speech

Clear Speech 201 produced in noise. Speech Communication, 51(12), 1253–1262. Lubinski, R. (2010). Communicating effectively with elders and their families. The ASHA Leader, 15, 12–15. Maniwa, K., Jongman, A., & Wade, T. (2009). Acoustic characteristics of clearly spoken English fricatives. Journal of the Acoustical Society of America, 125, 3962–3973. McCoy, S. L., Tun, P. A., Cod, L. C., et al. (2005). Hearing loss and perceptual effort: Downstream effects on older adults’ memory for speech. Quarterly Journal of Experimental Psychology, 58A, 22–33. McMurray, B., Kovack‐Lesh, K. A., Goodwin, D., & McEchron, W. (2013). Infant directed speech and the development of speech perception: Enhancing development or an unintended consequence? Cognition, 129(2), 362–378. Moon, S.‐J., & B. Lindblom. 1994. Interaction between duration, context, and speaking style in English stressed vowels. Journal of the Acoustical Society of America, 96, 40–55. Nishi, K., Lewis, D. E., Hoover, B. M., et al. (2010). Children’s recognition of American English consonants in noise. Journal of the Acoustical Society of America, 127, 3177–3188. Park, S., Theodoros, D., Finch, E., & Cardell, E. (2016). Be Clear: A new intensive speech treatment for adults with nonprogressive dysarthria. Journal of Speech, Language, and Hearing Research, 25, 97–110. Payton, K. L., Uchanski, R. M., & Braida, L. D. (1994). Intelligibility of conversational and clear speech in noise and reverberation for listeners with normal and impaired hearing. Journal of the Acoustical Society of America, 95, 1581–1592. Peelle, J. E. (2018). Listening effort: How the cognitive consequences of acoustic

challenge are reflected in brain and behavior. Ear and Hearing, 39(2), 204–214. Peng, Z., & Wang, L. M. (2016). Effects of noise, reverberation and foreign accent on native and non‐native listeners’ performance of English speech comprehension. Journal of the Acoustical Society of America, 139, 2772–2783. Perkell, J. S., Zandipour, M., Matthies, M. L., & Lane, H. (2002). Economy of effort in different speaking conditions. I. A prelimi‐ nary study of intersubject differences and modeling issues. Journal of the Acoustical Society of America, 112(4), 1627–1641. Pettinato, M., Tuomainen, O., Granlund, S., & Hazan, V. (2016). Vowel space area in later childhood and adolescence: Effects of age, sex and ease of communication. Journal of Phonetics, 54, 1–14. Picheny, M. A., Durlach, N. I., & Braida, L. D. (1985). Speaking clearly for the hard of hearing I: Intelligibility differences between clear and conversational speech. Journal of Speech and Hearing Research, 28, 96–103. Picheny, M. A., Durlach, N. I., & Braida, L. D. (1986). Speaking clearly for the hard of hearing II. Acoustic characteristics of clear and conversational speech. Journal of Speech, Language, and Hearing Research, 29, 434–446. Picheny, M. A., Durlach, N. I., & Braida, L. D. (1989). Speaking clearly for the hard of hearing III: An attempt to determine the contribution of speaking rate to difference in intelligibility between clear and conversational speech. Journal of Speech, Language, and Hearing Research, 32, 600–603. Pichora‐Fuller, M. K., Goy, H., & van Lieshout, P. (2010). Effect on speech intelligibility of changes in speech production influenced by instructions and communication environments. Seminars in Hearing, 31, 77–94. Pichora‐Fuller, M. K., Kramer, S. E., Eckert, M. A., et al. (2016). Hearing impairment

202 Perception of Linguistic Properties and cognitive energy: The framework for understanding effortful listening (FUEL). Ear and Hearing, 37, 5S–27S. Pichora‐Fuller, M. K., Schneider, B. A., & Daneman, M. (1995). How young and old adults listen to and remember speech in noise. Journal of the Acoustical Society of America, 97, 593–608. Rabbitt, P. M. A. (1968). Channel capacity, intelligibility and immediate memory. Quarterly Journal of Experimental Psychology, 20, 241–248. Rabbitt, P. M. A. (1990). Mild hearing loss can cause apparent memory failures which increase with age and reduce with IQ. Acta Oto‐Laryngologica, 111(S), 167–176. Radloff, L. S. (1977). The CES‐D scale: A self‐report depression scale for research in the general population. Applied Psychological Measurement, 1, 385–401. Redford, M. A., & Gildersleeve‐Neumann, C. E. (2009). The development of distinct speaking styles in preschool children. Journal of Speech, Language, and Hearing Research, 52(6), 1434–1448. Riley, K. G., & McGregor, K. K. (2012). Noise hampers children’s expressive word learning. Language, Speech, and Hearing in Schools, 43, 325–337. Rogers, C. L., DeMasi, T. M., & Krause, J. C. (2010). Conversational and clear speech intelligibility of ∕bVd∕ syllables produced by native and non‐native English speakers. Journal of the Acoustical Society of America, 128(1), 410–423. Rönnberg, J., Lunner, T., Zekveld, A., et al. (2013). The ease of language understanding (ELU) model: Theoretical, empirical, and clinical advances. Frontiers in Systems Neuroscience, 7, 31. Rönnberg, J., Rudner, M., Foo, C., & Lunner, T. (2008). Cognition counts: A working memory system for ease of language understanding (ELU). International Journal of Audiology, 47(S2), S99–105. Sadagopan, N., & Huber, J. (2007). Effects of loudness cues on respiration in

individuals with Parkinson’s disease. Movement Disorders, 22, 651–659. Scarborough, R., & Zellou, C. G. C. (2013). Clarity in communication: “Clear” speech authenticity and lexical neighborhood density effects in speech production and perception. Journal of the Acoustical Society of America, 134, 3793–3807. Schneider, B. A., Daneman, M., & Pichora‐ Fuller, M. K. (2002). Listening in aging adults: From discourse comprehension to psychoacoustics. Canadian Journal of Experimental Psychology, 56, 139–152. Schneider, E. N., Bernarding, C., Francis, A. L., et al. (2019). A quantitative model of listening related fatigue. In 2019 9th international IEEE/EMBS conference on neural engineering (NER) (pp. 619–622). New York: IEEE. Schreiner, M. S., & Mani, N. (2017). Listen up! Developmental differences in the impact of IDS on speech segmentation. Cognition, 160, 98–102. Schum, D. J. (1996). Intelligibility of clear and conversational speech of young and elderly talkers. Journal of the American Academy of Audiology, 7, 212–218. Smiljanic, R. (2013). Can older adults enhance the intelligibility of their speech? Journal of the Acoustical Society of America, 133(2), EL129–135. Smiljanic, R., & Bradlow, A. R. (2005). Production and perception of clear speech in Croatian and English. Journal of the Acoustical Society of America, 118, 1677–1688. Smiljanic, R., & Bradlow, A. (2008a). Stability of temporal contrasts across speaking styles in English and Croatian. Journal of Phonetics, 36(1), 91–113. Smiljanic, R., & Bradlow, A. (2008b). Temporal organization of English clear and plain speech. Journal of the Acoustical Society of America, 124(5), 3171–3182. Smiljanic, R., & Bradlow, A. R. (2009). Speaking and hearing clearly: Talker and listener factors in speaking style changes.

Clear Speech 203 Language and Linguistics Compass, 3, 236–264. Smiljanic, R., & Bradlow, A. R. (2011). Bidirectional clear speech perception benefit for native and high‐proficiency non‐native talker‐ listeners: Intelligibility and accentedness. Journal of the Acoustical Society of America, 130(6), 4020–4031. Smiljanic, R., & Gilbert, R. (2017a). Acoustics of clear and noise‐adapted speech in children, young, and older adults. Journal of Speech, Language, and Hearing Research, 60(11), 3081–3096. Smiljanic, R., & Gilbert, R. (2017b). Intelligibility of noise‐adapted and clear speech in child, young adult, and older adult talkers. Journal of Speech, Language, and Hearing Research, 60(11), 3069–3080. Smiljanic, R., Shaft, S., Chandrasekaran, B., & Shafiro, V. (2013). Effect of speech clarity on perception of interrupted meaningful and anomalous sentences. Journal of the Acoustical Society of America, 133(5), 3388–3392. Smiljanic, R., & Sladen, D. (2013). Acoustic and semantic enhancements for children with cochlear implants. Journal of Speech, Language, and Hearing Research, 56, 1085–1096. Song, J. Y. (2017). The use of ultrasound in the study of articulatory properties of vowels in clear speech. Clinical Linguistics & Phonetics, 31(5), 351–374. Song, J. Y., Demuth, K., & Morgan, J. (2010). Effects of the acoustic properties of infant‐directed speech on infant word recognition. Journal of the Acoustical Society of America, 128(1), 389–400. Stathopoulos, E., Huber, J., & Sussman, J. (2011). Changes in acoustic characteristics of the voice across the life span: Measures from individuals 4–93 years of age. Journal of Speech, Language, and Hearing Research, 54, 1011–1021. Stein, B., & Meredith, M. (1993). The merging of the senses. Cambridge, MA: MIT Press. Stent, A., Huffman, M., & Brennan, S. (2008). Adapting speaking after evidence

of misrecognition: Local and global hyperarticulation. Speech Communication, 50, 163–178. Stipancic, K., Tjaden, K., & Wilding, G. (2016). Comparison of intelligibility measures for adults with Parkinson’s disease, adults with multiple sclerosis, and healthy controls. Journal of Speech, Language, and Hearing Research, 59, 230–238. Stuart, A., Givens, G. D., Walker, L. J., & Elangovan, S. (2006). Auditory temporal resolution in normal‐hearing preschool children revealed by word recognition in continuous and interrupted noise. Journal of the Acoustical Society of America, 119, 1946–1949. Summers, W. V., Pisoni, D. B., Bernacki, R. H., et al. (1988). Effects of noise on speech production: acoustic and perceptual analyses. Journal of the Acoustical Society of America, 84(3), 917–928. Syrett, K., & Kawahara, S. (2013). Production and perception of listener‐ oriented clear speech in child language. Journal of Child Language, 41(6), 1373–1389. Tang, L. Y., Hannah, B., Jongman, A., et al. (2015). Examining visible articulatory features in clear and plain speech. Speech Commununication, 75, 1–13. Tasko, S. M., & Greilick, K. (2010). Acoustic and articulatory features of diphthong production: A speech clarity study. Journal of Speech, Language, and Hearing Research, 53(1), 84–99. Tjaden, K., Kain, A., & Lam, J. (2014). Hybridizing conversational and clear speech to investigate the source of increased intelligibility in Parkinson’s disease. Journal of Speech, Language, and Hearing Research, 57, 1191–1205. Tjaden, K., Lam, J., & Wilding, G. (2013). Vowel acoustics in Parkinson’s disease and multiple sclerosis: Comparison of clear, loud, and slow speaking conditions. Journal of Speech, Language, and Hearing Research, 56, 1485–1502.

204 Perception of Linguistic Properties Tjaden, K., Sussman, J. E., & Wilding, G. E. (2014). Impact of clear, loud, and slow speech on scaled intelligibility and speech severity in Parkinson’s disease and multiple sclerosis. Journal of Speech, Language, and Hearing Research, 57, 779–792. Tuomainen, O., Hazan, V., & Romeo, R. (2016). Do talkers produce less dispersed phoneme categories in a clear speaking style? Journal of the Acoustical Society of America, 140(4), EL320–326. Turner, G. S., Tjaden, K., & Weismer, G. (1995). The influence of speaking rate on vowel space and speech intelligibility for individuals with amyotrophic lateral sclerosis. Journal of Speech and Hearing Research, 38, 1001–1013. Uchanski, R. M. (2005). Clear speech. In D. B. Pisoni & R. Remez (Eds), The handbook of speech perception (pp. 207–235). Oxford: Blackwell. Uchanski, R. M., Choi, S. S., Braida, L. D., et al. (1996). Speaking clearly for the hard of hearing IV: Further studies of the role of speaking rate. Journal of Speech and Hearing Research, 39, 494–509. US Department of Health and Human Services. (n.d.). Quick guide to health literacy, from www.health.gov/ communication/literacy/olderadults/ literacy.htm. UCSF Health. (n.d.). Communicating with people with hearing loss, from www. ucsfhealth.org/education/ communicating_with_people_with_ hearing_loss. Uther, M., Knoll, M. A., & Burnham, D. (2007). Do you speak E‐N‐G‐L‐ I‐SH? A comparison of foreigner‐ and infant‐ directed speech. Speech Communication, 49, 2–7. van der Feest, S., Blanco, C., & Smiljanic, R. (2019). Influence of speaking style adaptations and semantic context on time course of word recognition in quiet and in noise. Journal of Phonetics, 73, 158–177.

Van Engen, K. J. (2017). Clear speech and lexical competition in younger and older adult listeners. Journal of the Acoustical Society of America, 142(2), 1067–1077. Van Engen, K., Chandrasekaran, B., & Smiljanic, R. (2012). Effects of speech clarity on recognition memory for spoken sentences. PLOS One, 7(9), e43753. Van Engen, K. J., & Peelle, J. E. (2014). Listening effort and accented speech. Frontiers in Human Neuroscience, 8, 577. Van Engen, K., Phelps, J., Smiljanic, R., & Chandrasekaran, B. (2014). Sentence intelligibility in N‐talker babble: Effects of context, modality, and speaking style. Journal of Speech, Language, and Hearing Research, 57(5), 1908–1918. van Wijngaarden, S. J. (2001). Intelligibility of native and non‐native Dutch speech. Speech Communication, 35, 103–113. van Wijngaarden, S. J., Steeneken, H. J. M., & Houtgast, T. (2002). Quantifying the intelligibility of speech in noise for non‐native listeners. Journal of the Acoustical Society of America, 111, 1906–1916. Wassink, A. B., Wright, R., & Franklin, A. (2007). Intraspeaker variability in vowel production: An investigation of motherese, hyperspeech, and Lombard speech in Jamaican speakers. Journal of Phonetics, 35(3), 363–379. Werner, L.A. 2007. Issues in auditory development. Journal of Communication Disorders, 40, 275–283. Winn, M. B., Edwards, J. R., & Litovsky, R. Y. (2015). The impact of auditory spectral resolution on listening effort revealed by pupil dilation. Ear and Hearing, 36(4), e153–165. Yi, H., Smiljanic, R., & Chandrasekaran, B. (2019). The effect of talker and listener depressive symptoms on speech intelligibility. Journal of Speech, Language, and Hearing Research, 62(12), 4269–4281. Zekveld, A. A., Kramer, S. E., & Festen, J. M. (2010). Pupil response as an

Clear Speech 205 indication of effortful listening: The influence of sentence intelligibility. Ear and Hearing, 31, 480–490. Zekveld, A. A., Kramer, S. E., & Festen, J. M. (2011). Cognitive load during speech perception in noise: The influence of age,

hearing loss, and cognition on the pupil response. Ear and Hearing, 32, 498–510. Zeng, F.‐G., & S. Liu. 2006. Speech perception in individuals with auditory neuropathy. Journal of Speech, Language, and Hearing Research, 49, 367–380.

8 A Comprehensive Approach to Specificity Effects in Spoken‐Word Recognition CONOR T. MCLENNAN1 AND SARA INCERA2 Cleveland State University, United States Eastern Kentucky University, United States

1 2

In this chapter, we argue in support of a comprehensive approach to research on specificity effects in spoken‐word recognition. We focus on findings demonstrating the roles that the talker, the speech signal, the listener, and the context play in indexical specificity effects in spoken‐word recognition. We discuss how these aspects of the communicative environment influence the representation and processing of specificity in spoken words. A comprehensive approach to specificity effects is consistent with Mirman’s (2016) position that it is time to expand the definition of spoken‐word recognition. We discuss different theoretical frameworks and new research questions that emerge as researchers embrace a comprehensive approach to investigating spoken‐word recognition.

Comprehensive approach A comprehensive approach to specificity effects in spoken‐word recognition includes the talker, speech signal, listener, and context, as illustrated in Figure 8.1. When we refer to specificity effects, we are primarily, or at least initially, interested in indexical specificity effects associated with talker variability. However, later in this chapter we will also discuss specificity effects associated with environmental background sounds. Indexical information refers to aspects of the speech signal that reflect who the talker is and how the talker says a particular word. Inter‐talker

The Handbook of Speech Perception, Second Edition. Edited by Jennifer S. Pardo, Lynne C. Nygaard, Robert E. Remez, and David B. Pisoni. © 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.

Specificity Effects in Spoken-Word Recognition 207 Talker

Speech signal

Listener

Context

Figure 8.1 The spoken‐word recognition environment consists of the talker, the speech signal, the listener, and the context.

variability (different talkers saying the word lion) or intra‐talker variability (the same talker saying the word lion in two different tones of voice) in indexical information does not change the word being spoken. Indexical specificity effects are observed when listeners are more efficient at recognizing words repeated with the same indexical information (the same talker or the same tone of voice) than at recognizing words repeated with different indexical information (a different talker or a different tone of voice). Two major goals for researchers working in the field of spoken‐word recognition are to understand the nature of the representation(s) for each word and how listeners process the incoming signal. Pisoni and Luce (1986) argued that word recognition is affected by activation and competition. After decades of research, Vitevitch and Luce (2016) concluded that “It is now widely accepted that spoken‐ word perception is characterized by two fundamental processes: (a) multiple activation of similar‐sounding form‐based representations . . . and (b) subsequent competition for recognition among these activated representations” (p. 74). Clearly, recognizing a word involves the perception of linguistic properties. However, spoken‐word recognition is more complex – and more interesting – than these descriptions might suggest. A complete understanding of how listeners represent and process spoken words requires consideration of additional relationships, such as those between linguistic information and indexical information, and between spoken words, attention, and memory. Indeed, we argue that these relationships should be included in the definition of spoken‐word recognition. Throughout the chapter we identify some remaining empirical questions that emerge from looking at the issue of specificity through a comprehensive lens. These questions are previewed in Table 8.1. We are particularly interested in the roles that the talker, speech signal, listener, and context play in the issue of representational specificity. That is, we aim to understand the extent to which these aspects of the spoken‐word recognition environment influence whether listeners access abstract or specific representations during online processing. Choi, Hu, and Perrachione (2018) provided evidence that talker normalization is an obligatory process (however, see Ambridge, 2019, for a related discussion in language acquisition). Choi and colleagues (2018) also argued that there are important methodological differences between tasks employing phonological decisions and memory paradigms, or decisions based on

208 Perception of Linguistic Properties Table 8.1 Summary of empirical questions discussed in this chapter Empirical questions 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Are specificity effects . . .

Section where question is discussed

greater with or without intervening consolidation with sleep? greater with an unusual or a more common (stereotypical) voice? prevalent in everyday and applied settings?

Comprehensive approach Talker

greater when listeners are trained on a particular person, accent, or both? greater with changes in unfamiliar or familiar dialects? greater in casual or careful speech? greater when listening to reduced word forms or fully articulated speech? greater in older or younger adults? greater in listeners with hearing impairments? greater in abstract than in concrete words? greater in neutral than in emotional words? influenced by the meaning of passages? influenced by the predictability of a word in a sentence? greater in individual words or complete sentences when listening to foreign‐ accented speech? greater when the speaker is using an unexpected language? influenced by the social weighting of particular talkers? obtained in a long‐term ERP study? influenced by attention at encoding only, or both at encoding and retrieval? originating from the lexicon or from a general memory system? associated with nonspeech (e.g. environmental) sounds more likely to be observed in episodic memory tasks than in spoken‐word recognition tasks?

Talker; Interdisciplinary, basic, and applied research Foreign accents and dialectical variations Casual speech Spontaneous speech Age Hearing impairments Meaning

Sentences

Social

Time‐course hypothesis Attention Alternative possibilities

Specificity Effects in Spoken-Word Recognition 209 Table 8.1 (Continued) Empirical questions

Are specificity effects . . .

21

associated with words more likely to be observed in spoken‐word recognition tasks than in episodic memory tasks? only observed when listeners are conscious New questions of the repeated co‐occurrence of the linguistic (word) and surface (e.g. indexical) information? in speech‐extrinsic noises modulated by processing speed, attention, or both?

22

23

Section where question is discussed

semantic and word‐level stimulus features. We agree, and we further argue that it is important to distinguish between studies aimed at investigating processing and studies aimed at investigating representations. Processing studies provide important insights into the consequences associated with trial‐to‐trial variability (e.g. changes in talkers). It is well established that there are consequences associated with these changes, such as listeners making more errors or taking longer to correctly recognize words spoken by multiple talkers (e.g. Peters, 1955; Mullennix, Pisoni, & Martin, 1989). However, the interpretation of such consequences is equivocal. Less efficient processing due to trial‐to‐trial changes could indicate that unnecessary details are being discarded as part of a normalization process (Pisoni, 1992) en route to accessing abstract lexical representations. Alternatively, less efficient processing could reflect the time and resources necessary to store these details as part of more specific lexical representations. Processing studies cannot distinguish between these two possibilities, one of which (discarding details) is consistent with abstract representations, and one of which (storing details) is consistent with episodic representations. However, other types of studies are specifically designed to investigate how words are represented. In long‐term repetition priming studies, changes in talker–word co‐occurrences (i.e. a particular talker saying a particular word) take place across two blocks of trials that are separated by at least a couple minutes (thus, exceeding the 15‐ to 30‐second duration of short‐term memory, and thereby are considered as part of long‐term memory). Consequently, studies employing the long‐term repetition‐priming paradigm are ideally suited for investigating the nature of the long‐term representations underlying spoken‐word recognition above and beyond processing effects. Some researchers have tied the representational question of specificity to the processing issue of competition. Using a long‐lag priming experiment in which primes and targets were presented in two separate blocks of trials, Dufour and Nguyen (2017) found that a talker change can reduce competitor priming. In their

210 Perception of Linguistic Properties study, the target words and competitor primes heard by listeners shared all phonemes except the final phoneme. These researchers found that the target words were responded to more slowly following a competitor prime that was spoken by the same talker compared to a competitor prime that was spoken by a different talker. Importantly, the effect of a talker change on competitor priming was observed when the prime block was presented to the listeners five times, but not when the prime block was presented only once. Consequently, the amount of exposure listeners have with a talker appears to affect the role that talker‐specific representations play during spoken‐word recognition. Research on how words enter the mental lexicon could also help to shed light on the consolidation of specific representations. Perhaps talker‐specific representations become stronger and more fully integrated into the lexicon following sleep, just as new words are more fully integrated into the lexicon following sleep. Dumay and Gaskell (2007) observed that sleep is crucial for consolidating newly learned words into the mental lexicon. These researchers found that recently learned words (e.g. shadowks) did not immediately compete with existing words (shadow); instead, competition effects emerged after a 12‐hour interval that included sleep. Although consolidation with sleep may not be necessary for words to be integrated in the lexicon (see Kapnoula & McMurray, 2016), consolidation with sleep may strengthen the effect. Given that sleep can change the mental representations of spoken words, a possible extension of this research is to compare the effects of talker changes with and without intervening sleep. Will talker‐ specificity effects be greater following intervening consolidation with sleep? (Table 8.1, question 1). Studying the interplay between representations and processes is likely to constrain models of spoken‐word recognition. There are some challenges associated with developing a comprehensive approach to specificity effects in spoken‐word recognition. When reviewing the literature, some important recurring themes emerge, including an increasing emphasis on more naturalistic communication and efforts to integrate previously isolated areas of research. Many spoken‐word recognition findings are based on laboratory studies in which monolingual participants listen to words spoken clearly in carefully controlled quiet experimental settings. A comprehensive approach to specificity effects needs to consider the complete communication environment, including variability at all levels (talker, speech signal, listener, and context). In this section, we discuss how each of the components of the communication environment depicted in Figure 8.1 connects to theoretical and empirical questions regarding specificity effects.

Talker Indexical information refers to characteristics of the talker (e.g. gender, speaking rate, emotional tone of voice) that can be perceived from the speech signal. Talker identification can occur quite rapidly (e.g. Creel & Tumlin, 2011), and human listeners outperform machines in detecting talker changes (Sharma et al., 2019). However, unlike talker identification or the ability to detect a change in talkers, the

Specificity Effects in Spoken-Word Recognition 211 term talker‐specificity effects (or talker effects) refers to the implications that the co‐ occurrence of a particular talker saying a particular word has on spoken‐word recognition. More precisely, talker‐specificity effects are observed when listeners more efficiently recognize words repeated by the same talker (‘lion’ spoken by Sara two times) compared to repeated words spoken by different talkers (lion spoken first by Conor and then by Sara). According to many traditional models of spoken‐word recognition, such as the original version of TRACE (McClelland & Elman, 1986), talker characteristics are discarded in order to access abstract linguistic representations. Evidence has since accumulated supporting episodic models in which lexical representations include indexical information (e.g. Goldinger, 1996). One hybrid approach, according to which both abstract and episodic representations play a role in spoken‐word recognition, underlies the time‐course hypothesis. According to the time‐course hypothesis (Luce, McLennan, & Charles‐Luce, 2003; Luce & Lyons, 1998), indexical information influences word recognition relatively late, that is, after the influence of abstract information. Thus, indexical (e.g. talker) specificity effects are more likely to be observed when processing is relatively slow. Later in this chapter we will consider both evidence in support of – and evidence that challenges – the time‐course hypothesis. According to Johnson (2006), listeners process words faster when the talker has a stereotypical voice because listening to many voices results in the emergence of a representation that makes recognizing stereotypical voices easier (see also Sumner, 2015). A prediction based on the time‐course hypothesis that could be tested empirically is that greater talker‐specificity effects would be observed when listening to talkers with more unusual voices compared to listening to talkers with more common (stereotypical) voices (Table 8.1, question 2). As discussed by Bent and Holt (2017), listeners are sensitive to social‐indexical talker attributes such as sexual orientation, regional dialect, native language, race, and ethnicity. Despite these considerations, the role of indexical information in everyday conversations remains a matter of debate. For example, Hanique, Aalders, and Ernestus (2013) provided evidence for exemplar effects when listeners are exposed to a small number of trials and a large proportion of repeated words, but no exemplar effects emerged when listening to more trials, a smaller proportion of repeated words, and two (instead of one) types of pronunciation variation, a situation much more likely to occur in everyday settings. Thus, Hanique and colleagues questioned the role of exemplars in natural conversations. While knowing the boundaries of an effect is crucial to theory development, determining the prevalence of specificity effects in everyday settings (Table 8.1, question 3) is essential to understanding the generalizability of these findings. Foreign accents and dialectical variations The influence of foreign accents on spoken‐ word recognition has been studied extensively. While foreign‐accented speech poses unique challenges, listeners can quickly adapt to accents (Clarke & Garrett, 2004). Furthermore, listeners can generalize to new talkers. In one study by Baese‐Berk, Bradlow, and Wright (2013), participants were exposed to two

212 Perception of Linguistic Properties training sessions over two consecutive days, followed by post‐tests that were given on the second day immediately after the second training session. During both the training and post‐tests, the participants were instructed to write down the sentences they heard. These researchers found that, after training on multiple talkers with five different foreign accents, listeners were able to generalize their learning to a new talker. Listeners’ ability to generalize may be due to their ability to adapt to systematic variability across talkers. Will talker effects be greater when listeners are trained on a particular person, accent, or both? (Table 8.1, question 4). However, the researchers offered an alternative explanation: Perhaps when the situation is not optimal – when perceiving the speech signal is difficult – listeners have a more relaxed criterion for matching the speech input to a lexical representation. When listening to different foreign‐accented speakers, listeners may consider a particular utterance a word even though that same utterance may not have met the standard for a word when listening to native speakers. When listeners have to adapt to many different accents, the criteria for matching speech input to a lexical representation may be relaxed and talker effects could be weaker. In a study by Maye, Aslin, and Tanenhaus (2008), participants listened to a story in a standard accent and then heard the story again by the same talker but after the talker’s pronunciation was altered to sound like a different accent. Participants then completed a lexical decision task in which some of the items would be considered real words if these items were heard in the accent to which participants had been exposed earlier in the experiment, but these items would be considered nonwords if the items were heard in a different accent. Participants were indeed more likely to classify such stimuli as real words, demonstrating that listeners can efficiently adjust to the accent of a particular talker and that lexical decisions can be influenced by a talker’s pronunciation. Importantly, foreign‐accented speech has been used to investigate talker‐ specificity effects. McLennan and González (2012) conducted two long‐term repetition priming experiments, one with English stimuli and one with Spanish stimuli. Both sets of stimuli were spoken by native speakers of American English and by native Spanish speakers, thus resulting in native‐ and foreign‐accented stimuli in both languages. Not surprisingly, listeners were somewhat slower to recognize words spoken with a foreign accent. Interestingly, these researchers observed greater talker‐specificity effects when listeners recognized words spoken with a foreign accent than when listeners recognized words spoken with a native accent. These results support the time‐course hypothesis: talker‐specificity effects emerged when listeners were slower at processing words because of the foreign accents. Future research should investigate talker‐specificity effects when words are embedded in full sentences (see the subsection on “Sentences” in the “Context” section). Spoken‐word recognition is also influenced by regional variations in speech. According to Clopper and Bradlow (2008), listeners are able to adapt to dialect variation even at moderate noise levels (although listeners are more effective at adapting to dialects with less noise). Cai and colleagues (2017) provided evidence that the dialect of the speaker (British or American English) modulates access to

Specificity Effects in Spoken-Word Recognition 213 word meaning. These researchers argued for a speaker model of spoken‐word recognition in which listeners first identify the dialect of the talker, and then knowledge of the dialect affects their access to semantic information for all words spoken by that talker. These findings are in line with those of Maye and colleagues (2008), highlighting the relationship between indexical and lexical information. The influence of surface details associated with dialectical variations on the recognition of spoken words remains an area ripe for investigation. On the one hand, talker‐specificity effects could be larger when listening to unfamiliar regional dialects, as reported for foreign‐accented speech. A prediction based on the time‐course hypothesis is that greater talker‐specificity effects would be observed when listening to talkers with unfamiliar regional dialects compared to listening to talkers with a standard local dialect (Table 8.1, question 5). On the other hand, given that differences in event‐related potentials (ERPs) have been observed between processing of dialects (both local and unfamiliar regional dialects) and processing of foreign accents (Goslin, Duffy, & Floccia, 2012), a prediction for dialect specificity based on findings from accent specificity may not hold. Dysarthria Mattys and Liss (2008) used naturally occurring degraded speech to refer to unedited speech produced by individuals with dysarthria whose speech is more difficult to comprehend than that of healthy speakers. These researchers found that participants recalled more words when those words were spoken by the same talker than by different talkers. Interestingly, this talker‐specificity effect was greater when the words were spoken by dysarthric talkers than by healthy talkers. In line with the findings reported with foreign‐accented speech, the words spoken by the dysarthric talkers took longer to process. Research on talkers with dysarthia provides further evidence in support of the time‐course hypothesis and adds to the circumstances under which listeners may be more likely to access talker‐ specific representations during spoken‐word recognition. To summarize: In the past decade considerable efforts have been made to identify the conditions under which indexical (including talker) specificity effects emerge. Many findings have supported the argument that talker characteristics play an important role in spoken‐word recognition. After having considered the importance that the talker plays in the representation and processing of spoken words, we now turn our attention to the speech signal itself.

Speech signal Referring back to Figure 8.1, we see that the speech signal is, of course, another component of the spoken‐word recognition environment. Studies designed to investigate the role that the speech signal plays in listeners’ recognition of spoken words demonstrate that listeners are sensitive to the nature of the speech: careful, casual, or spontaneous. A comprehensive model of spoken‐word recognition should account for the representational and processing consequences of variability in the speech signal.

214 Perception of Linguistic Properties Careful speech Careful speech is hyper‐articulated and planned – as is often the case when recording words for an experiment. An electrophysiological study (Viebahn, Ernestus, & McQueen, 2017) revealed that listeners are sensitive to syntactic violations when the talker is speaking in a careful speaking style, but not when they are speaking in a casual manner. This finding demonstrates that the talker’s speaking style – whether careful or casual – affects listeners’ processing of the acoustic‐phonetic information in the speech signal, which in turn could have consequences for indexical specificity effects. If listeners process careful speech more quickly, talker‐specificity effects could be larger in casual speech. Listeners are more likely to hear spontaneously produced casual speech than carefully planned speech in their day‐to‐day lives (Johnson, 2004). Therefore, it is important to determine whether careful speech makes talker‐specificity effects less likely to emerge. Casual speech Casual speech is hypo‐articulated, but not necessarily spontaneous or in sentences. Listeners are more likely to hear reduced word forms in casual speech (Ernestus, Baayen, & Schreuder, 2002). With respect to specificity, perhaps casually produced speech impacts lexical representations such that the role of indexical information is overestimated or underestimated as a result of the overuse of careful speech in experiments. Compared to a casual production, a careful production is presumably a closer match to an abstract representation, so talker‐ specificity effects may be larger in casual speech. However, the notion of an abstract representation referred to here is one composed of idealized discrete symbols, consistent with traditional phonological theories (Pisoni & Levi, 2007). An alternative notion of abstract representations is one involving representations that are generalized across, or averaged over, instances (e.g. prototypes). In this alternative notion, a casual production may be a closer match to an abstract representation, given that casual productions are more frequent. If so, then, according to the time‐course hypothesis, talker‐specificity effects may be greater in the less frequent careful speech, given that less frequent representations should take longer to access. Whether indexical specificity effects are more or less likely to emerge when listening to casual or careful speech is an empirical question that awaits investigation (Table 8.1, question 6). Spontaneous speech Spontaneous speech is typically hypo‐articulated and occurs naturally. Words with many phonological neighbors tend to be phonetically reduced in connected spontaneous speech (Gahl, Yao, & Johnson, 2012). Moreover, reductions in spontaneous conversations can change the dynamics of spoken‐ word recognition. More specifically, in a series of eye‐tracking experiments, Brouwer, Mitterer, and Huettig (2012) demonstrated that listeners are more sensitive to mismatches between the acoustic signal and lexical representations when listening to fully articulated speech (e.g. computer) than when listening to reduced speech (e.g. puter). If talker identity is part of a word’s lexical representation, then this work might suggest that listeners would be more sensitive to a change in talkers when listening to fully articulated speech (Table 8.1, question 7).

Specificity Effects in Spoken-Word Recognition 215 Efforts have been made to better understand spontaneous speech, such as the development of the Buckeye Corpus of conversational speech (Pitt et al., 2005). Incorporating natural speech into mainstream research is an ongoing effort. As discussed earlier in this chapter, Hanique, Aalders, and Ernestus (2013) reported results from four long‐term priming experiments with reduced speech, and concluded that the role for specific representations during natural conversations may be smaller than assumed. Recall that specificity effects were not observed when listeners heard two (instead of one) types of pronunciation variant. Their study is particularly relevant to the current consideration of spontaneous speech because listeners are likely to encounter many types of pronunciations in the speech heard in everyday life. To summarize the section related to the speech signal, the talkers’ speaking styles (casual or careful) and whether the speech is spontaneous, connected (presented in full sentences, as opposed to isolated words), or includes reductions, can all affect how listeners process spoken words. Understanding the consequences of these differences in the speech signal for indexical specificity effects is a promising area for future investigations. One prediction for casual speech is that talker‐specificity effects would be larger since the signal may be further from an abstract representation. However, the results from Hanique and colleagues (2013) support the opposite view, that talker‐specificity effects are reduced during natural conversations. Researchers should investigate the boundaries of talker‐specificity effects in everyday environments. Next, we consider the listener’s role in specificity effects in spoken‐word recognition.

Listener The listener is the receiver of the speech signal produced by the talker. Characteristics of the listener influence the way in which spoken words are represented and processed. For example, words may be recognized differently in the listener’s first and second languages, by children, younger adults or older adults, or by listeners with and without hearing impairments. Bilinguals Listener characteristics, such as being bilingual, influence spoken‐ word recognition. Change deafness – listeners’ failure to notice a talker change – is more likely in a person’s native language (Neuhoff et al., 2014). That is, listeners are less likely to detect a talker change when listening to speech in a language they process without difficulty. When listeners process speech in their second language, they are more likely to hear the change in talkers. The additional time it takes to process speech in a second language presumably allows more thorough processing of the talker‐specific indexical information, which in turn results in the listener being more likely to notice a talker change (see also Vitevitch & Donoso, 2011). Age Older adults are at a greater disadvantage than younger adults when processing words in multitalker conditions (Sommers, 1996). These results could be due to diminished cognitive resources, impaired inhibitory control, general

216 Perception of Linguistic Properties slowing, or some combination of these factors. Understanding spoken‐word recognition throughout the life span is crucial for enhancing communication abilities in older populations (see Sommers, Chapter 19). One prediction based on the time‐ course hypothesis is that talker‐specificity effects should be greater for older adults than younger adults because of somewhat slower processing by older adults (McLennan, 2006). According to this account, the additional time older adults spend on processing the incoming speech signal will allow the additional time needed for the indexical information to influence spoken‐word recognition (Table 8.1, question 8). See Krestar (2014) for evidence consistent with this prediction. There is also substantial related research with children, infants, and even fetuses. For example, fetuses have been shown to discriminate between voices (Kisilevsky et al., 2003). Infants as young as two months of age are able to recognize syllables spoken by different talkers (Jusczyk, Pisoni, & Mullennix, 1992). The ability to recognize a particular voice (talker identification) and the ability to distinguish between voices (talker discrimination) are distinct processes. Moreover, as discussed earlier, talker identification and talker discrimination are different from talker‐specificity effects. Interestingly, infants have been shown to be sensitive to whether words were repeated by a familiar or a novel talker (Houston & Jusczyk, 2003). In sum, talker effects have been obtained throughout the life span. Despite their ubiquitous nature, the timing of talker effects, and the mechanisms underlying the roles of talker‐specific representations, may differ across the life span (McLennan, 2006). Next, we discuss how hearing impairments can influence specificity effects. Hearing impairments Cochlear implants represent a remarkable technological advancement1 (Peterson, Pisoni, & Miyamoto, 2010). Understanding how listeners with communication disorders – including listeners with cochlear implants – recognize spoken words will continue to inform theories and models of spoken‐word recognition (see Hunter & Pisoni, Chapter 20). For example, listeners with and without hearing impairments could be affected differently by variability, such that specificity effects may be more or less likely. A prediction based on the time‐course hypothesis is that greater talker‐specificity effects would be observed for listeners with hearing impairments, as a consequence of somewhat slower processing (McLennan, 2006). Alternatively, it is possible that smaller talker‐specificity effects would emerge in listeners with hearing impairments, since listeners with hearing impairments may develop a less detailed representation of the talker information as a consequence of receiving an impoverished signal (Table 8.1, question 9). Given that older adults are more likely than younger adults to have a hearing impairment, any observed age differences in talker‐specificity effects, as discussed in the previous section, must rule out a sensory deficit explanation. To summarize the listener section, taking into account how various populations of listeners process the speech signal in different ways is an important step in a comprehensive approach to specificity effects in spoken‐word recognition. Furthermore, these interactions take place in a specific context, which is the topic discussed in the next section.

Specificity Effects in Spoken-Word Recognition 217

Context It may be tempting to conclude that a talker, a speech signal, and a listener sufficiently account for the spoken‐word recognition environment. However, each of these three components may be influenced by the context in which the communication takes place. In Figure 8.1, we used a photo of a participant performing an experimental task to refer to the context, consistent with a traditional research environment. Researchers must be mindful of the roles that the task and the greater communication environment might play in their results. Consequently, part of what falls under context is a consideration of how naturalistic an environment is, including whether the conditions are adverse (perhaps realistic) or not (perhaps relatively less common in everyday settings). In addition, context refers to the experimental methodologies involved, as well as the linguistic, social, and cultural influences impacting communication. In this section we consider the influence of word meaning, sentence context, audiovisual information, and social representations. Each of these areas contributes to the context in which the communication takes place and in turn has consequences for specificity effects in spoken‐word recognition. Meaning In many studies, participants are presented with a single word on a given trial. In such cases, language researchers are frequently interested in the recognition of form‐based lexical representations, without consideration of word meaning. However, in other studies, researchers have focused on the role that semantics plays in spoken‐word recognition (see Mirman & Magnuson, 2009). Goh and colleagues (2016) found faster responses to concrete and emotional words. Predictions based on the time‐course hypothesis are that greater talker‐ specificity effects would be observed with abstract (as opposed to concrete) and neutral (as opposed to emotional) words (Table 8.1, questions 10 and 11). In addition, researchers have found that participants falsely recognized unstudied semantically associated words when attending to the talker identity during an earlier encoding phase (Luthra, Fox, & Blumstein, 2018). These findings demonstrate the relationship between semantics and talker‐specific indexical information. Meaning can also influence listeners’ perception of talker information. Listeners’ attitudes toward standard‐accented talkers were found to differ on the basis of passage topic (Heaton & Nygaard, 2011). Standard‐accented talkers were rated as more sociable, likable, and cheerful when producing passages that were rated as more typical of the Southern United States than when producing passages that were rated as not typical of the South. In other words, the content of the message itself is part of the linguistic context that affects listeners’ attitudes toward the talker. Although we know the meaning of passages can influence listeners’ perception of the talker, whether the meaning of passages can also affect the likelihood of obtaining talker‐specificity effects remains an empirical question (Table 8.1, question 12). In one study, greater talker‐specificity effects were obtained with a mixed presentation of taboo and neutral words (Tuft, McLennan, & Krestar, 2016), suggesting that meaning can indeed contribute to the likelihood of obtaining

218 Perception of Linguistic Properties talker‐specificity effects. However, whether this effect would extend beyond individual words to the meaning of passages is unknown. Sentences Listeners typically hear strings of words in everyday life, and thus efforts to integrate sentence context in spoken‐word recognition research are worthwhile and will increase researchers’ ability to generalize the findings (i.e. external validity). Sentence context can influence listeners’ ratings of talker characteristics. In one study (Incera et al., 2017), participants listened to sentences in which the final word was either predictable or meaningless given the leading sentence. Participants were more likely to rate the same stimuli as being spoken with a “strong accent” when the words were presented in a meaningless sentence. Given that sentence context has been found to influence listeners’ accent ratings, perhaps sentence context also influences the extent to which listeners are influenced by abstract and more specific representations. That is, the degree to which a word is predictable given the sentence may affect the magnitude of talker‐specificity effects (Table 8.1, question 13). One prediction is that, as the predictability of a word decreases, the magnitude of talker‐specificity effects increases, given that reduced predictability likely takes additional time, requires more attentional resources, or both. Research on talker‐specificity effects in sentence context could also be combined with work on foreign‐accented speech, discussed earlier. Given that listeners can adapt quickly to foreign accents, talker‐specificity effects in foreign‐accented speech may not be found when listening to complete sentences. Alternatively, listening to complete sentences spoken in a foreign accent may be particularly challenging – and thus take longer to process – for some listeners, thereby resulting in greater talker‐specificity effects. Whether talker‐specificity effects in complete sentences are larger or smaller for foreign‐accented speakers remains an empirical question (Table 8.1, question 14). We have focused our discussion primarily on variations involving indexical (e.g. talker) information. However, challenges posed by variation include research on allophonic variant perception (Luce & McLennan, 2005). Viebahn and Luce (2018) demonstrated that listeners recognized flapped variants (center produced as cenner) faster in casually produced sentence contexts – another demonstration of the importance of considering the role that sentence context plays in listeners’ perception of spoken words. These researchers also reported that the time needed to recognize flapped variants presented in isolation decreased over the course of the experiment – another demonstration that experimental context affects listeners’ perception of spoken words. Audiovisual Speech perception frequently takes place in an audiovisual environment. Consequently, researchers have investigated whether audiovisual information affects communication. Lachs and Pisoni (2004) argued that visual input needs to be integrated to accurately perceive spoken words. Heald and Nusbaum (2014) hypothesized that seeing a talker’s face could be helpful, in that it would reduce word‐recognition costs in multiple‐talker contexts. However,

Specificity Effects in Spoken-Word Recognition 219 these researchers found that recognition in a multiple‐talker context was even slower in an audiovisual condition, perhaps because seeing a talker’s face increases the importance of talker information. An alternative explanation is that it takes time to integrate the audiovisual information, such that any advantage of distinguishing between talkers is outweighed by the cost of the extra time needed to integrate the visual information. The creation of an audiovisual corpus (Cooke et al., 2006), which to date has been cited 853 times (according to Google Scholar), demonstrates that efforts are being made to stimulate research on audiovisual information. Social Listeners gather social information from contextual cues. For example, visual information (e.g. the speaker’s face) can shape how listeners process the speech signal. Researchers have found that Korean speakers are rated as more accented in an audiovisual than in an audio‐only condition (Yi et al., 2013). Furthermore, the language a particular speaker is expected to use influences how listeners process the speech signal. Contextual cues (e.g. mother speaks L1, father speaks L2) may provide the necessary input to generate separate lexical networks for each language (Grainger, Midgley, & Holcomb, 2010). These cues will allow the listener to inhibit a particular language when the speaker is unlikely to use that language. Will talker effects be larger when the speaker is using an unexpected language? (Table 8.1, question 15). Sumner and colleagues introduced a dual route approach (Sumner, 2015; Sumner et al., 2014) in which listeners map the incoming speech signal onto both linguistic and social representations. According to this dual route approach, information about the talker, which the listener gathers from the variability in the speech signal (age, gender, accent, emotion, style, etc.), can influence spoken‐word recognition. Furthermore, mapping to social representations has the potential to introduce social biases when processing spoken words. Social weighting of words suggests that listeners’ biases influence the way listeners attend to, and later remember, speech. The use of social information in a talker’s voice during spoken‐ word recognition is consistent with an exemplar lexicon – a lexicon in which indexical information is stored. In other words, this dual route approach assumes that talker‐specific details can influence spoken‐word recognition. Talker‐specific details could play more or less of a role, depending on the social weighting of particular talkers (Table 8.1, question 16). Consequently, taking social information into account is crucial to understanding the larger context in the spoken‐word recognition environment. Although much research remains to be done to determine how social contexts influence spoken‐word recognition, work in this area holds promise for contributing to our understanding of language in more realistic contexts. These efforts will increase external validity and are likely to improve theories and models of spoken‐word recognition. Valerie Hazan tweeted in 2016 that Ann Bradlow suggested the use of the term real‐life instead of adverse conditions. We agree. Researchers have considered laboratory settings as the standard – understandably, given the advantage of control – but everyday communications most

220 Perception of Linguistic Properties frequently occur in less optimal settings. Spoken‐word recognition models are largely based on findings from tasks performed by healthy listeners, on carefully recorded speech, in quiet environments, and under conditions of undivided attention (Mattys et al., 2012). In the special issue on “Speech recognition in adverse conditions” of the journal Language and Cognitive Processes, researchers emphasized the importance of developing theories using data collected in a wide range of situations. Cooper and Bradlow (2017) reported specificity effects that emerged as a result of a change in talkers or background noises. In particular, listeners in their study were more accurate in a recognition memory task when the talker was repeated from study to test compared to when the talker changed, and a similar benefit emerged for recognizing words in matching – compared to mismatching – background noise. Their study highlights the importance of considering word recognition in adverse listening conditions. Integrating evidence from conditions encountered in everyday life is crucial for advancing the field (Tucker & Ernestus, 2016). To summarize the context section, theoretical arguments and empirical findings demonstrate the importance of considering the experimental task and the greater communication environment. Furthermore, the conditions when listening to spoken words also represent an important aspect of context. The conditions could be ideal (relatively infrequent in everyday settings) or adverse (more common in everyday settings). The type and degree of any adverse conditions is likely to be consequential, as would be the meaning of the words, whether the words are presented in isolation or in sentence context, and whether the words are only heard or co‐occur with visual information. Linguistic, social, and cultural influences on communication are all part of the broader context in which spoken words are recognized. All these influences are likely to impact the degree to which listeners represent and process indexical variability. Numerous empirical questions remain to be answered to explain how we process indexical information and how – and where – these specific representations are stored.

Theoretical frameworks In this section, we focus on empirical tests of the time‐course and attention‐based hypotheses. Furthermore, we present competing theoretical frameworks regarding the locus of specificity effects. We discuss whether specificity effects emanate from the mental lexicon or from a more general memory system. Finally, we discuss what new questions need to be tackled by researchers interested in indexical specificity effects in spoken‐word recognition.

Time‐course hypothesis The time‐course hypothesis is supported by data showing that talker‐specificity effects emerge when the task is difficult (McLennan & Luce, 2005; Vitevitch & Donoso, 2011), when words are spoken with a foreign accent (McLennan &

Specificity Effects in Spoken-Word Recognition 221 González, 2012), when words are spoken by dysarthric individuals (Mattys & Liss, 2008), and when words are low frequency (Dufour & Nguyen, 2014). Furthermore, within‐talker specificity effects in emotional tone of voice are also consistent with the time‐course hypothesis. Krestar and McLennan (2014) observed within‐talker specificity effects for changes in emotional tone of voice for words spoken – all by the same female talker – in frightened and sad tones of voice in a difficult lexical decision task (when processing was relatively slow) but not in an easy task (when processing was relatively fast). Listeners were faster at recognizing words spoken in the same tone of voice across two blocks of trials (frightened both times or sad both times) than when the tone of voice changed across the blocks (from frightened to sad or vice versa). Importantly, these specificity effects only emerged in the slower, more difficult, task. A change in tone of voice had no such effect in the faster, easier task. Indexical variability influences spoken‐word recognition differently in each brain hemisphere. Moreover, there may be a connection between hemispheric differences in specificity effects and the time‐course hypothesis. In visual recognition, abstract information is processed more efficiently by the left hemisphere, and specific information by the right hemisphere (Marsolek, 2004). González and McLennan (2007) found the same pattern with spoken words. The right hemisphere – but not the left – benefits from matches in indexical information. These findings suggest that linguistic and indexical information may be processed by different neural networks in the brain. These results inform models of spoken‐word recognition by highlighting structural processing differences between linguistic and indexical information. As discussed in González and McLennan (2009), relatively fast auditory changes, such as those that would distinguish abstract segments of spoken words, are preferentially processed in the left hemisphere. Relatively slow auditory changes, such as those that might capture more detail (e.g. features associated with indexical variability), are preferentially processed in the right hemisphere. Papesh, Goldinger, and Hout (2016) argued that it is important to understand the role that response‐time scaling – effects typically increase with increases in reaction times (RT) – may play in data supporting the time‐course hypothesis. Measures that do not rely on response time (e.g. event‐related potentials, ERP) are likely to be particularly informative in addressing this point. An ERP study by Dufour and colleagues (2017) found that a talker change reduced the priming effect only in a relatively late N400 time window, and only for low‐frequency words. These researchers conclude that their results provide evidence that talker‐ specific information affects later stages of spoken‐word recognition. Although this study differs from some of the earlier research in that the task involves dichotic listening and it is a short‐term priming study, these findings support the conclusion that results from the time‐course hypothesis are not entirely driven by response‐ time scaling. A long‐term priming ERP study would provide more direct evidence of the timing of indexical specificity effects independent of response‐time scaling (Table 8.1, question 17). In addition to the issue of response‐time scaling, there are findings that challenge, or are inconsistent with, the time‐course hypothesis. Creel, Aslin, and

222 Perception of Linguistic Properties Tanenhaus (2008) obtained evidence (data from their experiment 1) that talker‐ specificity effects can occur relatively early. These researchers used eye tracking to investigate the time course of competitor effects as a function of listeners’ experiences with specific talker–word pairings. After repeated pairings of words and cohort competitor words (e.g. couch, cows), in which the words were either spoken by the same talker or by different talkers, words from different‐talker pairs showed fewer fixations to competitors than words from same‐talker pairs. This difference, in which talker‐specific information influenced spoken‐word recognition, was detected as early as approximately 500 milliseconds after word onset. However, as Papesh, Goldinger, and Hout (2016, p. 4) noted, “there are methodological factors that complicate interpretation. Perhaps most salient, the sheer number of word‐ voice repetitions used by Creel et al. (2008) may have ‘forced’ their observed effect.” Nevertheless, Papesh and colleagues also reported finding relatively early talker‐specificity effects, a finding that is consistent with the conclusion of an earlier study by Maibauer et al. (2014; see also, Tuft, McLennan, & Krestar, 2016) that talker‐specificity effects do not always come into play relatively late. These findings point to the idea that attention plays an important role in talker‐specificity effects.

Attention In addition to time, attention is an important factor influencing the likelihood that indexical information affects spoken‐word recognition. Maibauer and colleagues (2014) found that famous talkers – Barack Obama and Hillary Clinton – elicited talker‐specificity effects in a speeded shadowing task, which taps into processing earlier than a delayed shadowing task. Given that listeners were presumably paying greater attention to words spoken by famous talkers, these results support the idea that attention is a mechanism that influences the likelihood of obtaining talker‐specificity effects. Theodore, Blumstein, and Luthra (2015) proposed that encoding factors – attention to talker information when first hearing a word – play a role in talker‐ specificity effects. In a series of experiments, talker‐specificity effects emerged only when participants attended to the talkers (talker gender) and not when participants attended to lexical (syntactic) characteristics. As a result, these researchers emphasized the importance of drawing listeners’ attention to the indexical properties (i.e. the talker‐specific information), and argued that it is crucial do so at encoding. However, a study by Tuft, McLennan, and Krestar (2016) suggests that increasing listeners’ attention in general (e.g. increased arousal) may be sufficient for producing talker‐specificity effects. Using taboo words, these researchers found that talker‐specificity effects emerged in an easy lexical decision task – for female talkers with mixed presentation of taboo and neutral words – presumably as a result of the taboo words increasing participants’ attention in general. The three studies reported support the conclusion that indexical information influences recognition when attention is increased – as a result of the nature of the talkers (famous), the task (gender), or the words (taboo). The issue of whether

Specificity Effects in Spoken-Word Recognition 223 attention influences talker‐specificity effects only at encoding, or during both encoding and retrieval, remains an empirical question (Table 8.1, question 18).

Lexicon or general memory The nature of the specificity debate has evolved over the years. The position that variability does not have long‐term consequences for lexical access (consistent with a pure abstractionist position) is no longer tenable. Substantial evidence from many different researchers – working in different labs, countries, and languages – clearly demonstrates otherwise. However, the current nature of the debate – likely to be played out in the coming years – is whether the long‐term consequences of variability are coming from the lexicon, consistent with episodic theories of the lexicon, or from what researchers have referred to as episodic memory (which we will refer to as general memory, for reasons to be explained). Sound‐specificity effects Recent research on sound‐specificity effects has brought the theoretical question of the locus of specificity effects to the forefront. Pufahl and Samuel (2014) conducted a series of experiments in which participants heard spoken words paired with environmental sounds (e.g. a phone ringing). During an initial exposure phase, participants were instructed to make an animacy judgment by responding animate (e.g. in response to a butterfly) or inanimate (e.g. in response to a hammer) on each trial. During a subsequent test phase, participants were instructed to identify filtered versions of the stimuli heard during exposure. Compared to the exposure phase, the stimuli heard during the test phase had either the same talker paired with the same background sound, a different talker paired with the same background sound, the same talker paired with a different background sound, or both a different talker and a different background sound. These researchers obtained sound‐specificity effects; that is, changes in a co‐occurring environmental sound (e.g. a different phone ringing sound) resulted in a cost in spoken‐word recognition. These findings extend talker effects (talker changes negatively affect spoken‐word recognition) to environmental sounds (sound changes negatively affect spoken‐word recognition). Strori et al. (2018) subsequently tested two alternative explanations for sound‐ specificity effects, namely whether the background sound must be integral with the spoken word (as opposed to distinct auditory objects) or whether co‐occurrence of a background sound and a spoken word would suffice. These researchers obtained a sound‐specificity effect only when the background sound was integral with the spoken word; mere co‐occurrence was insufficient. According to these findings, only environmental sounds that are integral with the word influence recognition. Building on the work of Pufahl and Samuel (2014), Cooper and Bradlow (2017) investigated specificity effects for talker details, which are intrinsic to the speech signal, and for background noises, which are extrinsic to the speech signal. These researchers found sound‐specificity effects even in stimuli in which the speech and noise were spectrally separated, and found that lexical characteristics moderated

224 Perception of Linguistic Properties the extent to which specific information influenced spoken‐word recognition. Their findings provide compelling evidence that changes in speech‐extrinsic noises can affect recognition performance. Implicit memory Kessler and Moscovitch (2013) provided evidence that implicit and explicit memory are involved in repetition priming, and proposed a dual route model in which decisions can be made using a lexical route or a strategic route involving episodic memory. Similarly, we argue that specificity effects have the potential to emerge from both the lexicon and general memory. In particular, we support a hybrid approach to the lexicon in which lexical representations include relatively abstract information and specific details that capture (at least) some indexical information inherent in speech. We believe (at least tentatively) that phonetically relevant specificity effects are likely to originate from the lexicon, consistent with the phonetic relevance hypothesis (Sommers & Barcroft, 2006), and that other specificity effects (e.g. background noises) are likely to originate from general memory. There are several empirical findings consistent with this possibility: processing differences between phonetically relevant and phonetically irrelevant details (e.g. Cooper, Brouwer, & Bradlow, 2015); the results by Strori et al. (2018) discussed earlier; the discussion by Strori (2016) that sound‐specificity effects are dependent on experimental context and are more fragile than indexical specificity effects; and other asymmetries between indexical specificity effects and sound‐specificity effects reported in the literature (e.g. in Pufahl & Samuel’s [2014] experiment 2). Moving forward, researchers must clarify theoretical positions in which specificity effects are not restricted to the mental lexicon by explicitly naming the memory system(s) involved. While a domain‐general memory system(s) is viable, episodic memory does not seem tenable given that many findings of specificity almost certainly reflect implicit memory. In fact, when discussing the distinction between explicit and implicit memory, Pufahl and Samuel (2014) noted that indexical effects are more reliable in implicit memory tests, and that if the locus of indexical effects were a more general memory system, probing participants’ explicit memory should increase – not decrease – indexical effects. It may be tempting to conclude that implicit processing implies accessing the mental lexicon and explicit processing necessarily involves episodic memory. However, it is possible to tap into the lexicon explicitly or implicitly. Similarly, implicit processing may involve a general memory system(s), tap into the lexicon, or both. We call for a clear distinction to be made between episodic information and episodic memory. Episodic information refers to details associated with a particular episode (e.g. that John said the word house 10 minutes ago). Episodic memory refers to a memory system. Episodic memory is a type of declarative memory. By definition, people are consciously aware of the information, and as such are able to explicitly report on the details associated with that memory. When spoken‐word recognition researchers (including ourselves) have referred to episodic information, they may not have intended to invoke this idea of episodic memory – a declarative memory system. If so, then participants should always be consciously aware of

Specificity Effects in Spoken-Word Recognition 225 talker–word changes and should always be able to report on such changes (e.g. a male said the word house the first time, 10 minutes ago, and a female said the word house the second time, three minutes ago) – a suggestion we find highly unlikely. Consequently, in the discussion that follows we will refer to any position that specificity effects are coming from outside the mental lexicon as coming from general memory (as opposed to episodic memory) because we believe that the important distinction intended by language researchers is whether the locus of obtained specificity effects is the mental lexicon, which is a specialized subset of long‐term memory, or elsewhere. In our opinion, this elsewhere is unlikely to be episodic memory, which is traditionally defined as a declarative memory system, because of the implicit nature of many indexical specificity effects. Alternative possibilities Given the two locations presented above (mental lexicon, general memory), four possibilities emerge – laid out in Figure 8.2 – when considering where abstract linguistic information and more specific surface details may be stored. Underlying these four possibilities are two fundamental questions: First, is there a specialized subset of long‐term memory? The mental lexicon is illustrated in options B, C, and D, but not in option A. Second, assuming the answer to the first question is yes, what type(s) of information get stored in the mental lexicon? All four options assume background noises, phonetically relevant information (e.g. talker‐specific details, emotional tone of voice), and abstract word forms (i.e. abstract form‐based lexical representations) can be stored outside the mental lexicon. The critical issue is what type of information “gets in” to the mental lexicon. According to the No lexicon approach (option A), the mental lexicon is not a separate, specialized area of memory. This approach is consistent with Pufahl and Samuel’s (2014) argument that lexical representations should be considered a subset of auditory memory. Researchers have argued for decades about whether or not speech is special. Titone and colleagues (2017) discussed in detail the historical importance of this controversy, including the role that Skinner, Chomsky, Fodor, and other influential researchers in the field played in establishing this – still unresolved – debate. As Titone and colleagues (2017) put it, this debate lives on in today’s social media feeds with modern‐day “Chomskyans” – forever arguing that language is special – pitted against “R‐savvy usage‐based or cognitive linguists” who argue for a general learning mechanism. While this debate is much bigger than the issue of specificity, its conclusions will directly inform whether option A is tenable. The remaining alternatives (B, C, and D) propose that the mental lexicon is a separate subsystem. The differences between these three alternatives emerge when considering what type of information “gets in.” According to the Abstract‐only lexicon approach (option B), all specific details, both relevant and irrelevant, are only stored outside the mental lexicon. If indeed specific details are not stored in the mental lexicon, then perhaps an abstractionist position of the lexicon can be saved after all. According to the Abstract and relevant details lexicon approach (option C), the mental lexicon contains both abstract representations and specific details that are phonetically relevant to spoken‐word recognition – with other specific details

226 Perception of Linguistic Properties Option A: No lexicon • Background noises • Phonetically relevant information • Abstract word form

Option B: Abstract-only lexicon • Background noises • Phonetically relevant information • Abstract word form • Abstract word form

Option C: Abstract & relevant details lexicon • Background noises • Phonetically relevant information • Abstract word form • Phonetically relevant information • Abstract word form

Option D: Everything in lexicon • Background noises • Phonetically relevant information • Abstract word form • Background noises • Phonetically relevant information • Abstract word form

Figure 8.2 The large oval represents general memory, the smaller oval (in options B, C, and D) the mental lexicon. Different information (background noises, phonetically relevant information, abstract word form) could “get in” to the mental lexicon or be stored only in general memory.

that are not phonetically relevant to spoken‐word recognition being stored outside the lexicon. In this approach only phonetically relevant specificity effects will be emerging from the lexicon. Finally, according to the Everything in lexicon approach (option D), there is a separate mental lexicon, which contains both abstract representations and phonetically relevant specific details as well as specific details that are not phonetically relevant. An Everything in lexicon approach (option D) is the least parsimonious approach and we suspect few, if any, researchers would support this alternative. There are, of course, many different ways to implement each of the four options laid out in Figure 8.2 into a specific model. These alternatives are not models of spoken‐word recognition. These options simply cover the four possibilities of where abstract linguistic information and more specific surface details might be stored. Furthermore, the abstract and specific information for any particular word could be interpreted as entirely separate representations or as parts of a single distributed representation. For example, in an Abstract and relevant details lexicon approach (option C), a listener hearing the word house spoken by a male with a phone ringing in the background could store three separate representations, one in general memory and two in the mental lexicon. If this is the case, one representation with the phone ringing detail would be stored only in general memory, a second with the male detail (but not the phone ringing detail) would be stored in the mental

Specificity Effects in Spoken-Word Recognition 227 lexicon, and a third representation with no specific details and only the abstract form‐based linguistic information – house – would be stored in the mental lexicon. Alternatively, the phone ringing detail, the male detail, and the abstract linguistic information – house – could be parts of a distributed but single representation. If this is the case, the phone ringing portion of the single distributed representation would be stored in general memory, while the male and the abstract linguistic information portions of the representation would be stored in the mental lexicon. Although it is difficult to tease apart these alternatives behaviorally, neuroscientific techniques may be more informative in addressing this particular question. The four options laid out in Figure 8.2 are important to the theoretical question regarding the locus of the representations underlying specificity effects (Table 8.1, question 19). However, these options are not relevant for testing the time‐course or attention‐based hypotheses, which are agnostic to the issue of where the representations are stored. In an Abstract‐only lexicon approach (option B) indexical information is not stored in the same place as abstract word forms, which seems to support the idea that there are different time courses for accessing these different types of information. However, such a difference in storage location is not necessary for the time‐course hypothesis. Whether or not these types of information are stored in the same location, indexical information and abstract word forms could be processed at different times. For example, abstract information may be processed faster because it is more frequent than indexical information, even if abstract and indexical information are stored in the same place (options A, C, and D). A similar point could be made about the hemispheric differences work discussed earlier (González & McLennan, 2007, 2009; Marsolek, 2004), in that several of the presented options could account for the findings. On the one hand, that talker effects differ in the left and right hemispheres, which supports the idea that linguistic and indexical information may be processed by different neural networks in the brain, is consistent with an Abstract‐only lexicon approach (option B). It is possible that the lexicon resides (at least primarily) in the left hemisphere and other types of surface information (both phonetically relevant and background noises) are represented and processed in one or more general memory systems that reside (at least primarily) in the right hemisphere. On the other hand, an Abstract and relevant details lexicon (option C) could also account for hemispheric differences. It is possible that more abstract aspects of the lexicon are processed more efficiently in the left hemisphere and specific details are processed more efficiently in the right hemisphere. Finally, the fact that the same pattern of specificity effects was reported with environmental sounds (González & McLennan, 2009) is consistent with the idea that there is no separation between words and other types of environmental sounds, such as background noises (options A and D). When considering empirical evidence that bear on the four possibilities laid out in Figure 8.2 – regarding where abstract linguistic information and more specific surface details might be stored – it is important to fully consider the nature of the task. For example, an old/new task, in which participants are asked to indicate on each trial whether they have heard that word earlier (old) or not (new) (and other explicit memory tasks) could be tapping into general memory (the larger oval),

228 Perception of Linguistic Properties and a lexical decision task, in which participants are asked to indicate on each trial whether the sound they are hearing is a real word or a nonword (and other spoken‐ word recognition tasks) could be uniquely tapping into the mental lexicon (the smaller oval). There is value in triangulating effects across tasks. It is important to determine whether effects are generalizable or task specific. Any differences in results due to the use of different tasks would provide an opportunity to learn about the boundary conditions and the role that context – in this case, the participants’ task – plays in theoretical questions such as representational specificity. For example, if an Abstract and relevant details lexicon approach (option C) is correct, then episodic memory tasks should be more likely than spoken‐word recognition tasks to elicit sound specificity effects (Table 8.1, question 20) and spoken‐word recognition tasks should be more likely than episodic memory tasks to elicit indexical specificity effects in spoken‐word recognition (Table 8.1, question 21). New questions New empirical questions emerge in light of the options presented in Figure 8.2. For example, when considering issues related to the speech signal, discussed in the “Comprehensive approach” section of this chapter, it may be the case that certain types of speech (e.g. casually produced words) are more likely to be represented in the mental lexicon, and other types of speech (e.g. carefully produced words) are more likely to be represented in a general memory system (or vice versa). Furthermore, if an Abstract and relevant details lexicon approach (option C) is correct, then sound‐specificity effects should be more likely or more robust in memory tasks than in spoken‐word recognition tasks. When considering the Abstract and relevant details lexicon approach (option C) in Figure 8.2, it may not always be clear whether variability along a particular dimension (e.g. background noise when hearing a spoken word) is relevant: relevance may operate along a continuum. As Cooper and Bradlow (2017) point out, in some situations background noise may impact spoken language processing and comprehension. Part of the challenge of distinguishing between these four alternatives is to determine (or to understand how the system determines) which types of information are stored in a general memory system and which are stored in the mental lexicon. As Strori and colleagues (2018) acknowledge, in cases in which it is difficult to segregate a background sound from a spoken word, the sounds may be perceived as an integrated auditory item. It is possible that only sources that are highly integrated and regularly (systematically) co‐occur with spoken words, such as talker‐specific details, are stored in the mental lexicon, and other sources of variability that do not regularly co‐occur with spoken words, such as background noises, are stored in the general memory. In Pufahl and Samuel’s (2014) approach, although the lexicon is not inherently separate from nonlinguistic information (option A), words are thought to form clusters in different regions from other types of auditory sounds (e.g. environmental noises). Perhaps the extent to which words form clusters (see the earlier discussion about separate versus distributed representations) that are in different regions than clusters for environmental sounds depends, at least in part, on the degree to which the words and sounds are integrated and co‐occur. Furthermore, it is important to clarify how a cluster of

Specificity Effects in Spoken-Word Recognition 229 words in general memory (option A) differs from the idea of the mental lexicon (options B, C, and D). When considering which memory system(s) will be implicated, two interrelated questions merit consideration. The first question involves whether the hippocampus and surrounding areas in the medial temporal lobe are involved in the representation of episodic information, and the second question considers whether implicit processing of indexical specificity effects could also activate the hippocampus and surrounding areas in the medial temporal lobe. Although the hippocampus and surrounding areas are known to be crucial for declarative episodic memory, there is also evidence that those areas are important for implicit processing. Therefore, even if episodic information is only stored outside the mental lexicon, and even if the hippocampus and surrounding areas are involved in the representation and processing of episodic information, listeners may not have conscious access to the co‐occurrence of talker–word pairings. First, if a general memory system(s) is involved in the representation of episodic information, could this system include the hippocampus and surrounding areas in the medial temporal lobe? These areas are known to be crucial for declarative episodic memory. If so, the prediction could follow that talker‐specificity effects would not be found in patients with anterograde amnesia who have damage to the hippocampal system and surrounding areas and have deficits in forming new episodic memories. Indeed, the absence of such an effect has been reported. Schacter, Church, and Bolton (1995) found equivalent priming in same and different talker conditions (i.e. no talker‐specificity effects) in patients with anterograde amnesia, despite finding talker‐specificity effects in control participants (i.e. participants without anterograde amnesia). Second, does implicit processing of indexical specificity effects also activate the hippocampus and surrounding areas in the medial temporal lobe? If not, then the involvement of the hippocampus and surrounding areas would seem to indicate that participants are indeed conscious of the co‐occurrence of, for example, a particular talker saying a particular word. However, there is evidence that the hippocampal system and surrounding areas are also involved in implicit processing (Hannula, Ryan, & Warren, 2017; Ramos, Marques, & Garcia‐Marques, 2017), especially contextual binding (Chun & Phelps, 1999). Consequently, impaired voice‐ specific priming in patients with amnesia (Schacter, Church, & Bolton, 1995) does not necessarily imply that conscious involvement is required for talker‐specificity effects (see also Schacter, Dobbins, & Schnyer, 2004). Further research is needed to determine whether listeners are necessarily conscious of the repeated co‐occurrence of the linguistic (word) and indexical (e.g. talker) information (Table 8.1, question 22). In summary, additional work on indexical and environmental sound‐specificity effects is needed, including research aimed at providing a better understanding of the interplay between timing, attention, hemispheric differences, and – likely – other factors yet to be identified that influence specificity effects (e.g. Table 8.1, question 23). Furthermore, it is crucial to incorporate data bearing on these theoretical questions into models of spoken‐word recognition. Moving forward, efforts should

230 Perception of Linguistic Properties also be directed at determining the conditions under which specificity effects originate from the lexicon, from general memory, or both. Although, many researchers currently agree that indexical information plays a role in spoken‐word recognition, exactly how and when indexical information is processed, and where indexical information is represented, remain matters of debate. In the final section, we shift our attention to the future of the field. We discuss issues that should be accounted for when developing theoretical accounts of specificity effects. We cover the importance of considering interdisciplinary approaches to answering these questions, and the practical applications that emerge from this research. Lastly, we point to specific goals that the field should embrace.

Final thoughts A comprehensive approach to specificity effects includes investigating how the talker, speech signal, listener, and context influence spoken‐word recognition. Additionally, careful consideration of methodological and analytical advancements is necessary for a more complete understanding of the underlying dynamics of specificity effects. Measuring the online dynamics of spoken‐word recognition is being realized in part by capitalizing on measures with good temporal resolution, including eye tracking, computer mouse tracking, and ERPs. Investigating how effects unfold over time is crucial to understanding the way spoken words are represented and processed. Additional cognitive neuroscience techniques (e.g. fMRI) have also been used to improve our understanding of the neural mechanisms at play, and of language processes more generally. We believe that these techniques should complement, rather than replace, behavioral techniques. An important challenge for future research is triangulating between these diverse methodologies (Incera, 2018). As dynamic approaches gain traction, new statistical analyses are being developed to extend our understanding of these online measures. For example, growth curve analyses have been used to analyze eye‐tracking (Mirman, 2014) and mouse‐ tracking (Incera & McLennan, 2016) data. In addition, individual differences have often been disregarded in the past, but new research is increasingly likely to consider analyses at the participant level. Novel statistical techniques (e.g. mixed models: see Baayen, Davidson, & Bates, 2008) leave researchers better equipped to investigate how different words are processed by different participants. These new advancements should prove beneficial in future empirical investigations. We argue that these approaches are particularly well suited to addressing some of the remaining empirical questions discussed throughout this chapter (see Table 8.1). We believe that a comprehensive model (or class of models) of spoken‐word recognition should be informed by the latest empirical and theoretical developments on specificity effects. Models must explicitly account for how indexical information is represented and processed in spoken‐word recognition. According to Dahan and Magnuson (2006), a reconceptualization – or even an “all‐out paradigm shift” – is necessary to integrate new developments in the field. More than a

Specificity Effects in Spoken-Word Recognition 231 decade later, we agree with the possibility that a paradigm shift is in the making, and that “new” ideas may inform further advances. The extent to which some of these ideas are new may be a matter of debate. For example, the discussion regarding the source of specificity effects (the mental lexicon or a general memory system) is reminiscent of the longstanding debate about whether or not speech is special, and thus may have given some readers a sense of déjà vu. Nevertheless, it is important to revisit longstanding discussions in light of more recent empirical findings and novel methodological approaches.

Interdisciplinary, basic, and applied research Interdisciplinary collaborations are crucial for the development of new ideas; consequently, it is important to encourage such efforts. Research on specificity effects has always been at the crossroads between psychology, linguistics, and neighboring fields, so interdisciplinary collaborations between these fields frequently develop naturally. Furthermore, given novel technological advances, collaborations between scientists studying spoken‐word recognition and computer scientists are also likely to increase in the coming years. These collaborations could further enhance research on indexical specificity by helping researchers collect data in more naturalistic ways (e.g. PsyGlass: Paxton, Rodriguez, & Dale, 2015), and by harnessing the power of big data to determine what an “average talker” actually sounds like. Indexical specificity effects are influenced by how frequently a particular listener is exposed to a word spoken by a talker, thus, these approaches will allow researchers to better measure word frequencies in everyday life outside the laboratory. In a political climate where conclusions of the scientific community are often disregarded, it is particularly important for scientists to fight for basic research. In addition, it is increasingly important to consider the potential implications of basic findings and to explore new ways to transmit this knowledge to the public. We encourage researchers to carefully consider extensions of their work. There are many examples in which basic research findings have led to incredibly useful applications, including the implications of basic speech research for earwitness testimony (Mullennix et al., 2010). Mullennix and colleagues reported data demonstrating that some properties of speech could possibly contribute to errors during earwitness testimony for a suspect’s voice. One area for future research is the examination of applied settings in which talker‐specificity effects have the potential to be particularly consequential (Table 8.1, question 3). There are numerous situations in which having a listener take longer to process a spoken word could be critical, cases in which even milliseconds could really matter, such as one involving a pilot or an emergency services operator. Considering innovative ways to develop interdisciplinary collaborations and to apply findings from basic research are an important step for the future of research on specificity effects. Doing so will help maximize the impact of research in the area, contribute to a greater understanding of the importance of programs of research focused on understanding the representations and processes involved in

232 Perception of Linguistic Properties spoken‐word recognition, and lead to increased funding opportunities and continued research efforts.

New goals Fortunately, researchers who examine specificity effects are no longer limiting investigations to the study of carefully produced words in isolation. Although this approach should continue – to maximize control – the field is now equipped for investigations with a wider range of talkers, signals, listeners, and contexts. Including naturally produced words in research will help ensure that results from the laboratory are applicable to everyday settings. Accounting for talker‐specificity effects when words are spoken by a wider range of talkers – including talkers with foreign accents, dialectical variations, and dysarthria – presents difficulties, but meeting these challenges will help to avoid an understanding of the field that is limited to words spoken by particular groups of talkers (e.g. healthy native speakers with a “standard” dialect). Similarly, including casual and spontaneous speech will help avoid building models of spoken‐word recognition that are based only (or primarily) on a signal containing careful speech. Doing so could be particularly problematic given that listeners are more likely to encounter casual spontaneous speech in everyday life. In addition to the talker and the speech signal, efforts to include a wider range of listeners – including nonnative listeners, older adults, and listeners with hearing impairments, such as patients with cochlear implants – will lead to an understanding of indexical specificity effects that is generalizable across different populations. Finally, the complete communicative context merits careful consideration. Here we are referring both to the experimental context (e.g. the participants’ task, words spoken in isolation or sentence context, audiovisual integration) and to the social and cultural factors that are likely to influence talker effects. To maximize the impact of research, it is increasingly important to consider relationships between language and other cognitive processes (e.g. memory, attention, perception, categorization). For example, there may be a connection between desirable difficulties in learning/memory research and talker‐specificity effects in spoken‐word recognition. The term desirable difficulties refers to the notion that some conditions that create initial challenges for the learner often lead to better memory in the long run (Bjork, 1994). Recall that earlier in the chapter we mentioned that a possible extension of the work by Dumay and Gaskell (2007) was to compare talker‐specificity effects with and without intervening sleep. Perhaps listening to different talkers repeat a particular word is a desirable difficulty, such that initial lexical access is relatively difficult (compared to listening to the same talker repeat the word), but desirable, in that this variability leads to a more robust long‐term consolidated memory of the spoken word. See Bjork and Kroll (2015) for an example of using desirable difficulties as a connection between memory and language research. Other relationships, such as those between talker‐specificity effects and other types of auditory processing, also merit consideration. The finding that musical

Specificity Effects in Spoken-Word Recognition 233 training is associated with an advantage in talker identification (Xie & Myers, 2015) suggests that there are important connections between music perception and sensitivity to indexical variability (Zetzer, 2016). New empirical findings are resulting in a reconceptualization of the way in which we view spoken‐word recognition. Indexical specificity should be studied in relation to other relevant processes. Theoretical integration of the myriad of novel findings is key to continued progress. In conclusion, research on specificity effects is at a very exciting point in time. We are confident that researchers will continue to accept the challenges associated with working toward a comprehensive approach to spoken‐word recognition. Lupyan (2015) noted: “If one spends a career studying something, they better come up with a good reason (if only for themselves) for its centrality” (p. 519). We hope we have convinced you (and not only ourselves) of the centrality of specificity effects in language research, including their promise for future discoveries.

Acknowledgments We thank Paul Luce and the editors of this book for feedback on earlier versions of this chapter. We also thank all members of the Language Research Laboratory for helpful discussions and contributions to many of the ideas and studies cited in this chapter.

NOTE 1 See David Pisoni’s video “Some neuromyths of cochlear implants” in the Psychology of Language Video Series (https://www.csuohio.edu/language).

REFERENCES Ambridge, B. (2019). Against stored abstractions: A radical exemplar model of language acquisition. First Language, 1–51. Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed‐effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59(4), 390–412. Baese‐Berk, M. M., Bradlow, A. R., & Wright, B. A. (2013). Accent‐independent

adaptation to foreign accented speech. Journal of the Acoustical Society of America, 133, 174–180. Bent, T., & Holt, R. F. (2017). Representation of speech variability. Wiley Interdisciplinary Reviews: Cognitive Science, 8(4), e1434. Bjork R. A. (1994). Memory and metamemory considerations in the training of human beings. In J. Metcalfe & A. Shimamura (Eds.), Metacognition:

234 Perception of Linguistic Properties Knowing about knowing (pp. 185–205). Cambridge, MA: MIT Press. Bjork, R. A., & Kroll, J. F. (2015). Desirable difficulties in vocabulary learning. American Journal of Psychology, 128, 241–252. Brouwer, S., Mitterer, H., & Huettig, F. (2012). Speech reductions change the dynamics of competition during spoken word recognition. Language and Cognitive Processes, 27, 539–571. Cai, Z. G., Gilbert, R. A., Davis, M. H., et al. (2017). Accent modulates access to word meaning: Evidence for a speaker‐model account of spoken word recognition. Cognitive Psychology, 98, 73–101. Choi, J. Y., Hu, E. R., & Perrachione, T. K. (2018). Varying acoustic‐phonetic ambiguity reveals that talker normalization is obligatory in speech processing. Attention, Perception, & Psychophysics, 80, 784–797. Chun, M. M., & Phelps, E. A. (1999). Memory deficits for implicit contextual information in amnesic subjects with hippocampal damage. Nature Neuroscience, 2, 844–847. Clarke, C. M., & Garrett, M. F. (2004). Rapid adaptation to foreign‐accented English. Journal of the Acoustical Society of America, 116, 3647–3658. Clopper, C. G., & Bradlow, A. R. (2008). Perception of dialect variation in noise: Intelligibility and classification. Language and Speech, 51, 175–198. Cooke, M., Barker, J., Cunningham, S., & Shao, X. (2006). An audio‐visual corpus for speech perception and automatic speech recognition. Journal of the Acoustical Society of America, 120, 2421–2424. Cooper, A., & Bradlow, A. R. (2017). Talker and background noise specificity in spoken word recognition memory. Laboratory Phonology, 8(1), 1–15. Cooper, A., Brouwer, S., & Bradlow, A. R. (2015). Interdependent processing and encoding of speech and concurrent

background noise. Attention, Perception, & Psychophysics, 77, 1342–1357. Creel, S. C. & Tumlin, M. A. (2011). On‐line acoustic and semantic interpretation of talker information. Journal of Memory and Language, 65, 264–285. Creel, S. C., Aslin, R. N., & Tanenhaus, M. K. (2008). Heeding the voice of experience: The role of talker variation in lexical access. Cognition, 106, 633–664. Dahan, D., & Magnuson, J. S. (2006). Spoken word recognition. In M. J. Traxler, M. A. Gernsbacher, & M. J. Cortese (Eds.), Handbook of psycholinguistics (Vol. 2, pp. 249–284). Burlington, MA: Academic Press. Dufour, S., Bolger, D., Massol, S., et al. (2017). On the locus of talker‐specificity effects in spoken word recognition: An ERP study with dichotic priming. Language, Cognition and Neuroscience, 32, 1273–1289. Dufour, S., & Nguyen, N. (2014). Access to talker‐specific representations is dependent on word frequency. Journal of Cognitive Psychology, 26, 256–262. Dufour, S., & Nguyen, N. (2017). Does talker‐specific information influence lexical competition? Evidence from phonological priming. Cognitive Science, 41, 2221–2233. Dumay, N., & Gaskell, M. G. (2007). Sleep‐associated changes in the mental representation of spoken words. Psychological Science, 18, 35–39. Ernestus, M., Baayen, H., & Schreuder, R. (2002). The recognition of reduced word forms. Brain and Language, 81, 162–173. Gahl, S., Yao, Y., & Johnson, K. (2012). Why reduce? Phonological neighborhood density and phonetic reduction in spontaneous speech. Journal of Memory and Language, 66, 789–806. Goh, W. D., Yap, M. J., Lau, M. C., et al. (2016). Semantic richness effects in spoken word recognition: A lexical decision and semantic categorization. Frontiers in Psychology, 7, 1–10.

Specificity Effects in Spoken-Word Recognition 235 Goldinger, S. D. (1996). Words and voices: Episodic traces in spoken word identification and recognition memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22, 1166–1183. González, J., & McLennan, C. T. (2007). Hemispheric differences in indexical specificity effects in spoken word recognition. Journal of Experimental Psychology: Human Perception and Performance, 33, 410–424. González, J., & McLennan, C. T. (2009). Hemispheric differences in the recognition of environmental sounds. Psychological Science, 20, 887–894. Goslin, J., Duffy, H., & Floccia, C. (2012). An ERP investigation of regional and foreign accent processing. Brain and Language, 122, 92–102. Grainger, J., Midgley, K., & Holcomb, P. J. (2010). Re‐thinking the bilingual interactive‐activation model from a developmental perspective (BIA‐d). In M. Kail & M. Hickmann (Eds.), Language acquisition and language disorders (Vol. 52, pp. 267–283). Amsterdam: John Benjamins. Hanique, I., Aalders, E., & Ernestus, M. (2013). How robust are exemplar effects in word comprehension? Mental Lexicon, 8, 269–294. Hannula, D. E., Ryan, J. D., & Warren, D. E. (2017). Beyond long‐term declarative memory: Evaluating hippocampal contributions to unconscious memory expression, perception, and short‐term retention. In D. E. Hannula & M. C. Duff (Eds.), The hippocampus from cells to systems: Structure, connectivity, and functional contributions to memory and flexible cognition (pp. 281–336). Cham, Switzerland: Springer. Heald, S. L., & Nusbaum, H. C. (2014). Talker variability in audio‐visual speech perception. Frontiers in Psychology, 5, 698–707. Heaton, H., & Nygaard, L. C. (2011). Charm or harm: Effect of passage content on

listener attitudes toward American English accents. Journal of Language and Social Psychology, 30, 202–211. Houston, D. M., & Jusczyk, P. W. (2003). Infants’ long‐term memory for the sound patterns of words and voices. Journal of Experimental Psychology: Human Perception and Performance, 29(6), 1143–1154. Incera, S. (2018). Measuring the timing of the bilingual advantage. Frontiers in Psychology, 9, 1–9. Incera, S., & McLennan, C. T. (2016). Mouse tracking reveals that bilinguals behave like experts. Bilingualism: Language and Cognition, 19, 610–620. Incera, S., Shah, A. P., McLennan, C. T., & Wetzel, M. T. (2017). Sentence context influences the subjective perception of foreign accents. Acta Psychologica, 172, 71–76. Johnson, K. (2004). Acoustic and auditory phonetics. Phonetica, 61, 56–58. Johnson, K. (2006). Resonance in an exemplar‐based lexicon: The emergence of social identity and phonology. Journal of Phonetics, 34, 485–499. Jusczyk, P. W., Pisoni, D. B., & Mullennix, J. (1992). Some consequences of stimulus variability on speech processing by 2‐month‐old infants. Cognition, 43, 253–291. Kapnoula, E. C., & McMurray, B. (2016). Newly learned word forms are abstract and integrated immediately after acquisition. Psychological Bulletin & Review, 23, 491–499. Kessler, Y., & Moscovitch, M. (2013). Strategic processing in long‐term repetition priming in the lexical decision task. Memory, 21, 366–376. Kisilevsky, B. S., Hains, S. M. J., Lee, K., et al. (2003). Effects of experiences on fetal voice recognition. Psychological Science, 14, 220–224. Krestar, M. L. (2014). Examining the effects of processing time, age, valence, and variation in emotional intonation on spoken word

236 Perception of Linguistic Properties recognition. Unpublished doctoral dissertation, Cleveland State University. Krestar, M. L., & McLennan, C. T. (2014). Examining the effects of variation in emotional tone of voice on spoken word recognition. Quarterly Journal of Experimental Psychology, 66, 1793–1802. Lachs, L., & Pisoni, D. B. (2004). Crossmodal source identification in speech perception. Ecological Psychology, 16, 159–187. Luce, P. A., & Lyons, E. A. (1998). Specificity of memory representations for spoken words. Memory & Cognition, 26, 708–715. Luce, P. A., & McLennan, C. T., (2005). The challenge of variation. In D. B. Pisoni & R. E. Remez (Eds.), Handbook of speech perception (pp. 591–609). Malden, MA: Blackwell. Luce, P. A., McLennan, C. T., & Charles‐ Luce, J. (2003). Abstractness and specificity in spoken word recognition: Indexical and allophonic variability in long‐term repetition priming. In J. Bowers & C. Marsolek (Eds.), Rethinking implicit memory (pp. 197–214). Oxford: Oxford University Press. Lupyan, G. (2015). The centrality of language in human cognition. Language Learning, 66, 516–553. Luthra, S., Fox, N. P., & Blumstein, S. E. (2018). Speaker information affects false recognition of unstudied lexical‐semantic associates. Attention, Perception, & Psychophysics, 80, 894–912. Maibauer, A. M., Markis, T. A., Newell, J., & McLennan, C. T. (2014). Famous talker effects in spoken word recognition. Attention, Perception, & Psychophysics, 76, 11–18. Marsolek, C. J. (2004). Abstractionist versus exemplar‐based theories of visual word priming: A subsystems resolution. Quarterly Journal of Experimental Psychology, 57, 1233–1259. Mattys, S. L., Davis, M. H., Bradlow, A. R., & Scott, S. K. (2012). Speech recognition in adverse conditions: A review.

Language and Cognitive Processes, 27, 953–978. Mattys, S. L., & Liss, J. M. (2008). On building models of spoken‐word recognition: When there is as much to learn from natural “oddities” as artificial normality. Perception & Psychophysics, 70, 1235–1242. Maye, J., Aslin, R. N., & Tanenhaus, M. K. (2008). The Weckud Wetch of the Wast: Lexical adaptation to a novel accent. Cognitive Science, 32, 543–562. McClelland, J. L., & Elman, J. L. (1986). The TRACE model of speech perception. Cognitive Psychology, 18, 1–86. McLennan, C. T. (2006). The time course of variability effects in the perception of spoken language: Changes across the lifespan. Language and Speech, 49, 113–125. McLennan, C. T., & González, J. (2012). Examining talker effects in the perception of native‐ and foreign‐ accented speech. Attention, Perception, & Psychophysics, 74, 824–830. McLennan, C. T., & Luce, P. A. (2005). Examining the time course of indexical specificity effects in spoken word recognition. Journal of Experimental Psychology: Learning, Memory, and Cognition, 31, 306–321. Mirman, D. (2014). Growth curve analysis and visualization using R. Boca Raton, FL: Taylor & Francis. Mirman, D. (2016). Zones of proximal development for models of spoken word recognition. In G. Gaskell & J. Mirković (Eds.), Speech perception and spoken word recognition (pp. 97–115). London: Psychology Press. Mirman, D., & Magnuson, J. S. (2009). Dynamics of activation of semantically similar concepts during spoken word recognition. Memory & Cognition, 37, 1026–1039. Mullennix, J. W., Pisoni, D. B., & Martin, C. S. (1989). Some effects of talker variability on spoken word recognition.

Specificity Effects in Spoken-Word Recognition 237 Journal of the Acoustical Society of America, 85, 365–378. Mullennix, J. W., Stern, S. E., Grounds, B., & Kalas, R. (2010). Earwitness memory: Distortions for voice pitch and speaking rate. Applied Cognitive Psychology, 24, 513–526. Neuhoff, J. G., Schott, S. A., Kropf, A. J., & Neuhoff, E. M. (2014). Familiarity, expertise, and change detection: Change deafness is worse in your native language. Perception, 43, 219–222. Papesh, M. H., Goldinger, S. D., & Hout, M. C. (2016). Eye movements reveal fast, voice‐specific priming. Journal of Experimental Psychology: General, 145, 314–337. Paxton, A., Rodriguez, K., & Dale, R. (2015). PsyGlass: Capitalizing on Google Glass for naturalistic data collection. Behavior Research Methods, 47(3), 608–619. Peters, R. W. (1955). The relative intelligibility of single‐voice and multiple‐voice messages under various conditions of noise. In Joint Project Report, 56 (pp. 1–9). Pensacola, FL: US Naval School of Aviation Medicine. Peterson, N. R., Pisoni, D. B., & Miyamoto, R. T. (2010). Cochlear implants and spoken language processing abilities: Review and assessment of the literature. Restorative Neurology and Neuroscience, 28, 237–250. Pisoni, D. B. (1992). Talker normalization in speech perception. In Y. Tohkura, E. Vatikiotis‐Bateson, & Y. Sagisaka (Eds.), Speech perception, production and linguistic structure (pp. 143–151). Amsterdam: IOS Press. Pisoni, D. B., & Levi, S. (2007). Some observations on representations and representational specificity in speech perception and spoken word recognition. In G. Gaskell (Ed.), The Oxford handbook of psycholinguistics (pp. 3–18). Oxford: Oxford University Press. Pisoni, D. B., & Luce, P. A. (1986). Speech perception: Research, theory, and the principal issues. In E. C. Schwab & H. C.

Nusbaum (Eds.), Pattern recognition by humans and machines: Vol. 1, Speech perception (pp. 1–50). Orlando, FL: Academic Press. Pitt, M. A., Johnson, K., Hume, E., et al. (2005). The Buckeye corpus of conversational speech: Labeling conventions and a test of transcriber reliability. Speech Communication, 45, 89–95. Pufahl, A., & Samuel, A. G. (2014). How lexical is the lexicon? Evidence for integrated auditory memory representations. Cognitive Psychology, 70, 1–30. Ramos, T., Marques, J., & Garcia‐Marques, L. (2017). The memory of what we do not recall: Dissociations and theoretical debates in the study of implicit memory. Psicológica, 38, 365–393. Schacter, D. L., Church, B., & Bolton, E. (1995). Implicit memory in amnesic patients: Impairment of voice‐specific priming. Psychological Science, 6, 20–25. Schacter, D. L., Dobbins, I. G., & Schnyer, D. M. (2004). Specificity of priming: A cognitive neuroscience perspective. Nature Reviews Neuroscience, 5, 853–862. Sharma, N. K., Ganesh, S., Ganapathy, S., & Holt, L. L. (2019). Talker change detection: A comparison of human and machine performance. Journal of the Acoustical Society of America, 145, 131–142. Sommers, M. S. (1996). The structural organization of the mental lexicon and its contribution to age‐related declines in spoken‐word recognition. Psychology and Aging, 11, 333–341. Sommers, M. S., & Barcroft, J. (2006). Stimulus variability and the phonetic relevance hypothesis: Effects of variability in speaking style, fundamental frequency, and speaking rate on spoken word identification. Journal of the Acoustical Society of America, 119, 2406–2416. Strori, D. (2016). Specificity effects in spoken word recognition and the nature of lexical representations in memory.

238 Perception of Linguistic Properties Unpublished doctoral dissertation, University of York. Strori, D., Zaar, J. D., Cooke, M., & Mattys, S. L. (2018). Sound specificity effects in spoken word recognition: The effect of integrality between words and sounds. Attention, Perception, & Psychophysics, 80, 222–241. Sumner, M. (2015). The social weight of spoken words. Trends in Cognitive Sciences, 19, 238–239. Sumner, M., Kim, S. K., King, E., & McGowan, K. B. (2014). The socially weighted encoding of spoken words: A dual‐route approach to speech perception. Frontiers in Psychology, 4, 1–13. Theodore, R. M., Blumstein, S. E., & Luthra, S. (2015). Attention modulates specificity effects in spoken word recognition: Challenges to the time‐course hypothesis. Attention, Perception, & Psychophysics, 77, 1674–1684. Titone, D., Gullifer, J., Subramaniapillai, S., et al. (2017). History‐inspired reflections on the bilingual advantages hypothesis. In M. Sullivan & E. Bialystok (Eds.), Growing old with two languages: Effects of bilingualism on cognitive aging. Amsterdam: John Benjamins. Tucker, B. V., & Ernestus, M. (2016). Why we need to investigate casual speech to truly understand language production, processing and the mental lexicon. Mental Lexicon, 11, 375–400. Tuft, S. E., McLennan, C. T., & Krestar, M. L. (2016). Hearing taboo words can result in

early talker effects in word recognition for female listeners. Quarterly Journal of Experimental Psychology, 71(2), 435–448. Vitevitch, M. S., & Donoso, A. (2011). Processing of indexical information requires time: Evidence from change deafness. Quarterly Journal of Experimental Psychology, 64, 1484–1493. Viebahn, M. C., Ernestus, M., & McQueen, J. M. (2017). Speaking style influences the brain’s electrophysiological response to grammatical errors in speech comprehension. Journal of Cognitive Neuroscience, 29, 1132–1146. Vitevitch, M. S., & Luce, P. A. (2016). Phonological neighborhood effects in spoken word perception and production. Annual Review of Linguistics, 2, 75–94. Viebahn, M. C., & Luce, P. A. (2018). Increased exposure and phonetic context help listeners recognize allophonic variants. Attention, Perception, & Psychophysics, 80, 1539–1558. Xie, X., & Myers, E. (2015). The impact of musical training and tone language experience on talker identification. Journal of the Acoustical Society of America, 137, 419–432. Yi, H., Phelps, J. E. B., Smiljanic, R., & Chandrasekaran, B. (2013). Reduced efficiency of audiovisual integration for nonnative speech. Journal of the Acoustical Society of America, 134(5), 387–393. Zetzer, E. E. (2016). Examining whether instrument changes affect song recognition the way talker changes affect word recognition. Unpublished master’s thesis, Cleveland State University.

9 Word Stress in Speech Perception ANNE CUTLER1 AND ALEXANDRA JESSE2 Western Sydney University, Australia University of Massachusetts Amherst, United States

1 2

Stress denotes greater relative salience of some linguistic elements compared with others, within larger units of speech. The term can be applied to different speech domains. Thus some words are stressed more than others within a sentence, and some syllables are stressed more than others within words. What controls the posi tioning of stress is not random; at the sentence level, it is largely determined by the relative importance of sentence components for the ongoing discourse (information structure), and at the lexical level it is fully determined by word phonology. While information structure can reasonably be held to originate outside the specifically linguistic domain and hence to have claim to universality, word phonology is highly language specific, and not all languages have word‐level stress. In languages that do have word stress, segmentally matched stressed and unstressed syllables (such as the first syllables of English camper and campaign) generally differ in multiple acoustic dimensions. However, the consequences of stress placement for speech perception are not simply a function of these acoustic variations. The next section of this chapter, “Lexical stress and the vocabulary,” shows that the phonology of the language in question, and the resulting vocabu lary structure, determine what use is made of stress‐related acoustic information. If word stress varies, cues to stress location can help identify spoken words and can modulate the activation and competition processes involved in this; but as demonstrated in the following section, “Spoken‐word identification,” even related and in principle quite similar languages can vary greatly in how such cues are exploited, with the underlying driver of cue use being, indeed, the language‐ specific vocabulary patterns. In “New horizons for stress in speech perception,” the currently very active state of the field of word‐stress perception research is illustrated by innovative data from multisensory perception and from perception

The Handbook of Speech Perception, Second Edition. Edited by Jennifer S. Pardo, Lynne C. Nygaard, Robert E. Remez, and David B. Pisoni. © 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.

240 Perception of Linguistic Properties of degraded speech. The chapter concludes with a summary and the prediction that this extensive activity, spanning not only new techniques but also many lan guage groups, will produce substantial and detailed new knowledge on the role of word stress in speech perception. Among the languages with word‐level stress, some allow stress placement to vary within the word while others do not. Languages where stress placement in words can vary are said to have “lexical stress.” These languages in principle allow the relative stress level of syllables to distinguish otherwise (i.e. segmentally) iden tical word forms. Languages where stress placement cannot vary within the word (“fixed‐stress” languages) obviously preclude such a lexically contrastive function for stress. In the latter class of languages, note that, while stress always falls in the same place in the word, that place itself is language‐specifically defined: the initial syllable in Finnish, the final syllable in Turkish, the antepenultimate syllable in Macedonian, and so on. Thus there is no universal pattern either for the appear ance of word stress (some languages don’t have it at all), or for its realization when it does appear (it can be fixed or it can vary), and, importantly, there can therefore be no universal pattern for its role in speech perception (only in some languages can it be contrastive). The story that this chapter has to tell, in other words, is at its core one of language specificity. Lexical stress is the variety of stress that English has, like its West Germanic lan guage relatives; the kind of minimal pairs this allows include noun–verb distinc tions such as PERvert (noun; capital letters indicate primary stress) versus perVERT (verb), but also word pairs unrelated in meaning, such as FOREgoing and forGOing. In all such cases, the segments are all the same and only the stress placement differs. In lexical‐stress languages, the stress pattern of every polysyllabic word is lexically determined, that is, is part of the phonological representation of how speakers ought to produce the word. In fact, minimal whole‐word pairs are not numerous in any lexical‐stress language. Far more commonly, speech perception involves minimally paired initial syllables that differ in stress, such as the first syl lables of the English word CAMper versus camPAIGN. Such segmentally identical syllables, one stressed and the other not, differ in how they are uttered. The difference in lexical stress is realized in several acoustic dimensions, as can be seen in Figure 9.1. This figure shows waveforms and spectrograms for a male speaker of American English saying the pervert pair in the context Say the word pervert again. The top three panels show the verb reading perVERT, the lower three the noun reading PERvert. Although the syllables have the same segmental structure in each member of the pair, including full vowels in each syllable, the acoustic real ization can be seen to be clearly different in the three suprasegmental dimensions duration, intensity, and fundamental frequency (F0). For each syllable, the stressed version is longer and louder (most immediately visible in the waveform), and presents more F0 movement (see the spectrograms). In the early days of systematic research on speech perception (when most phonetic research was done in countries with a West Germanic language), stress perception was a topic, with the consensus opinion converging on the conclusion that all of the above cues were used, separately or in combination. Searches for a

(a) Frequency (kHz)

4 3 2 1

Frequency (kHz)

0

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Time (s)

(b) Frequency (kHz)

4 3 2 1 0

Frequency (kHz)

4 3 2 1 0 0

0.2

0.4

0.6

0.8

1

1.2

1.4

Time (s)

Figure 9.1 Sound spectrograms of the words perVERT (a) and PERvert (b) in the carrier sentence Say the word . . . again, recorded by a male speaker of American English. Each figure consists of three display panels: (top) a broad‐band spectrogram; (middle) a waveform display; and (bottom) a narrow‐band spectrogram. Vertical lines indicate onset and offset of pervert. The figure is modeled on a figure first presented by Lehiste and Peterson (1959, p. 434).

242 Perception of Linguistic Properties single unifying factor in the perception of stress, such as articulatory or perceptual effort, were unsuccessful (more detail of this early work can be found in Cutler, 2005). Extending the stress investigations to other languages was critical in showing that the phonological system of a language determines how stress is real ized and hence perceived. For example, in some tone languages F0 may not distin guish stress as it is reserved to convey tone (see Potisuk, Gandour, & Harper, 1996, for Thai), and in languages with vowel quantity distinctions, duration may be likewise reserved to convey vowel identity and hence be unavailable as a stress cue (see Berinstein, 1979, for Mayan languages). Note also that not only the supra segmental dimensions realize lexical stress, as segmental structure can also vary systematically with stress; in English, for instance, any vowel can be either stressed or not except for the central vowel schwa which is necessarily unstressed. Again, this is dependent on language‐specific phonology, as lexical‐stress languages without such a vowel in their phonemic repertoire will realize stress only suprasegmentally (e.g. Spanish), while in languages with schwa that vowel may be associated with unstressed syllables only (as in English or German), or may be a stressable vowel like any other (as in Welsh; Williams, 1985). In fixed‐stress languages, there is sometimes little measurable acoustic difference between syllables in the stressed position and segmentally identical syllables that occur in another position in the word (Suomi, Toivanen, & Ylitalo, 2003, for Finnish; Dogil, 1999, for Polish, which has fixed penultimate stress). This does not exclude a speech‐perception function for fixed stress, however; obvious options remain. Where the designated stress placement is at a word’s edge, for instance, it may appear to be useful for identifying word boundaries in running speech. Since that would require syllables bearing fixed stress to be reliably distinguished from those without stress (i.e. the stress to be realized acoustically in a consistent manner), the acoustic evidence argues against it. The designated “stressed” syllables are, how ever, the location for sentence‐level accents that fall upon the word in question, and when these sentence‐level accents are realized, the acoustic effects that this brings about will be informative in sentence‐level processing. It is for word‐level processing that fixed stress offers little support. In fact, when speakers of fixed‐stress languages are asked to perform tasks involving word‐level stress perception, they exhibit in many cases what has been called “stress deafness” (although the term is inaccurate; see the section on “Spoken‐word identification”). As was first demonstrated by Dupoux et al. (1997) for French, in which clitic phrases are accented on a fixed location (the final syl lable), such listeners often err on decisions about stress placement in heard nonsense material. For instance, they have high error rates in an ABX task in which a third token must be matched to one of two preceding tokens (e.g. bopeLO boPElo bopeLO). Further work by Dupoux and colleagues (Dupoux & Peperkamp, 2002; Dupoux, Peperkamp, & Sebastián‐Gallés, 2001; Peperkamp, Vendelin, & Dupoux, 2010) showed the same contrasts to be predictably easy for speakers of Spanish (a lexical‐stress language) but just as hard as in French for speakers of the fixed‐stress language Finnish, and almost as hard for speakers of Polish and Hungarian, also with fixed stress. Dupoux and Peperkamp accounted for the gradations of

Lexical Stress 243 difference across the fixed‐stress groups in terms of language‐specific phonology and its learnability (Polish and Hungarian accentual rules affect different word classes in different ways, so their stress systems will be harder to learn than the French or Finnish systems, and this greater learning challenge leaves a trace of stress cue sensitivity in adulthood, detectable in the perception experiments). It is clear that language specificity characterizes speech perception at the word level. What is involved in recognizing spoken words in everyday conversations is twofold: segmentation (identifying where each word begins, given that speech is continuous and word boundaries are not reliably marked), and lexical access (identifying the word from all the other word forms it might be, including of course identifying where it ends). For stress to play a role, acoustic cues must in any case be present. If that is the case, then segmentation can involve stress in its demarcative role (Cutler & Carter, 1987; Tyler & Cutler, 2009). Identification con sists, critically, in the rejection of alternative word candidates. Recognition of a spoken word therefore depends to a significant degree on how many other words are like it, in particular its onset overlap competitors (e.g. camp and campanology for campaign) but also its embeddings (am and pain in campaign). The term “compet itor” here suggests that candidate words compatible with some portion of the incoming speech signal actively compete with one another, and its use in this con text has arisen from findings showing inhibitory effects on losing competitors, as described in the section on “Spoken‐word identification.” There, we address the question of exactly how lexical stress is processed in spoken‐word recognition: does camper compete with campaign until the second syllables become segmentally unique, or does the stress difference in the initial syllable result in one alternative being ruled out before the incoming second syllable is heard? Given the competitive nature of the lexical identification process, it would seem rational to assume that if the speech signal contains information that would con tribute to the identification of the signal content, listeners would always make use of it. When acoustic cues to stress are available only as a function of higher‐level structure, as in the fixed‐stress case, the cues may be considered unreliable; but, for lexical stress, cues should be consistently available. Note, however, that in other speech‐perception cases listeners do not always exploit all of the information available in speech. Even at the phoneme level, a contrast that is present (with the same acoustic effects) in many languages is not always resolved in the same way. A telling example concerns the two fricatives [f] and [s], contained in the phoneme repertoires of many languages; the mechanics of speech ensure that comparable local cues will be available whenever these sounds are uttered. However, listeners use such cues only in identifying [f] or [s] when the sound in question has a highly similar competitor in the language’s repertoire (thus, only for identifying [f] in English and Castilian Spanish, because each of these languages has [θ] among its phonemes as well, and [θ] is highly confusable with [f]; but, in contrast, only for identifying [s] in Polish because Polish has [ɕ], [ʂ], and other sibilants: Wagner, 2013; Wagner, Ernestus, & Cutler, 2006). That is, cues may be consistently present, but language‐specific phonological structure leads listeners to make use of them in some languages but to ignore them in others.

244 Perception of Linguistic Properties In the section on “Lexical stress and the vocabulary” we consider issues of vocab ulary structure that could influence the perceptual relevance of lexical stress in a similar way. Fixed‐stress languages clearly vary in the extent to which their phonology and vocabulary encourage any perceptual role for word stress (e.g. as described above for Polish, Hungarian, French and Finnish). Do lexical‐stress languages vary likewise? Are there then also implications for the task in which lexical stress (but not fixed stress) can play a direct role, namely spoken‐word recognition?

Lexical stress and the vocabulary Every language has a store of sound‐to‐meaning pairings, that is, a vocabulary full of words, even though there are huge differences in the ways in which those stored elements map to parts of the speech signal (Fortescue, Mithun, & Evans, 2017). Among the dimensions along which vocabularies differ is the potential for interword competition in speech perception, and this is particularly determined by the size of the language’s phonemic repertoire, and the constraints that the pho nology applies to syllable structure. Each of these obviously affects the shape of the words that make up the vocabulary stock. Syllable structure includes whether and in which position consonant clusters can occur; a language in which syllables can vary from oh and go to screech and plunged will clearly allow more monosyl labic possibilities than a language in which only consonant–vowel (CV) and/or CVC syllables are legal. Segmental repertoire size is important in the same way, in that the more phonemes a language has, the more different short words its vocab ulary can hold. The phonemic and syllabic properties of a language’s vocabulary thus have direct implications for speech perception. Consider first the word‐length issue: shorter average word length in languages with more complex syllables and more phonemes, and longer average length in languages with less complex syllables and fewer distinct phonemes. This asym metry certainly holds across the vocabularies of British English (45 phonemes) and Spanish (25 phonemes), for example; the mean word length in phonemes in English is 6.94, in Spanish 8.3, with length in syllables being respectively 2.72 and 3.48 (Cutler, Norris, & Sebastián‐Gallés, 2004). Segmentally alone, there will be less embedding of short words within longer ones in the languages with larger pho neme repertoires and on average shorter words. Tallying word embedding and overlap in the vocabulary provides an estimate of the strength of competition from potentially coactivated candidate words in speech recognition, that is, spurious lexical competitors which can be activated by speech signals and impact upon spoken‐word recognition. Indeed, the Spanish–English comparison by Cutler et al. (2004) showed this asymmetry too, with Spanish having more than twice the embedding rate of English. The embedding rate can change if stress placement is considered as well. The potential for lexical stress to play a part in spoken‐word identification can then be estimated by comparing the number of coactivated candidate words (effectively, the amount of competition) using the tallying just described while (1) taking only

Lexical Stress 245 segmental structure into account, or (2) effectively including suprasegmental structure as well. Since these vocabulary statistics are computed using phonetic transcriptions, which represent segmental strings, adding this further dimension is realized by including the location of primary stress in each word’s transcription, and ruling out any embeddings where syllables mismatch on this factor. On seg mental match only, enterprise contains the shorter words enter and prize, and settee has set and tea. But if we require primary stress location to match also, ENterprise contains only enter, and setTEE contains only tea (neither set‐ nor ‐prise have pri mary stress in the longer words). Including the frequency of each carrier word as a weighting factor to estimate actual occurrence in speech experience, Cutler et al. (2004) found the difference between the two tallies to be much larger in Spanish (2.32 vs. 0.73) than in English (0.94 vs. 0.59). Thus, considering stress reduces com petition in Spanish by more than two thirds, but in English by less than one third. Perhaps more importantly, on average a word will likely activate just a single com petitor in English whether or not stress placement is taken into account, but in Spanish consideration of stress placement reduces more than two competitors to one competitor at most – a quantum improvement. Interestingly, closely related languages such as English, Dutch, and German, all from the West Germanic family, can also differ on such a metric. The figures for these languages show that Dutch and German have more embeddings than English (Cutler & Pasveer, 2006). Examples of embedding in Dutch are OUderdom “old age,” which contains ouder “parent” and dom “stupid,” and in German SAUna “sauna” which contains Sau “sow” and nah “near.” The effect of taking stress into account in identifying words, estimated in the same way as for English and Spanish, reveals that the amount of competition from embedding reduces by more than 50 percent for both Dutch and German if stress match is included in these computations. With carrier‐word frequency taken into consideration, a Dutch segments‐only count then gives 1.52 competitors per word of speech on average, and a segments‐plus‐stress count reduces this to 0.74; for German, the respective figures are 1.72 and 0.8. This is again a quantum improvement (from more than one to less than one competitor) for each language, which English cannot match because its embedding count was below 1 even without taking stress into account. The differences have implications for word recognition in these languages. For instance, the fact that taking stress marking into account produced no sub stantial improvement in English could imply that English listeners actually need to attend only to words’ segments in computing mismatch to competitors; vowels and consonants reduce competition sufficiently for optimal or near‐optimal recog nition. Adding the use of suprasegmental cues would thus not reduce competition to an extent that would be worth the effort. This suggestion assigns a vital role to the segmental correlate of stress, the central vowel schwa. In languages where schwa cannot be stressed, the presence of schwa effectively signals lack of stress without the need of another cue dimension. The relative rarity of embeddings in English compared with Dutch and German would reflect the greater likelihood of an English unstressed syllable containing schwa rather than a full vowel. Can the vocabulary provide specific evidence on this proposal?

246 Perception of Linguistic Properties A comparison of English and Dutch shows that both languages have extensive affixation, with the affixes (especially inflectional suffixes, but also prefixes) being typically realized by weak syllables containing schwa in each language. Over the entire vocabulary, morphological differences between the two languages actually tend to produce more Dutch than English syllables with schwa (in both initial and final position). Neither in initial nor in final syllables do the vocabulary counts show greater schwa frequencies for English. Strong effects of schwa are found instead in nonaffixed syllables, which can occur in word‐medial positions. Comparisons within morphological families show less phonetic overlap in English – in admire admiration, gratitude gratuity, legal legality, and so on, stress alternation brings different vowel realization with it, so that the initial syllables do not compete; specifically, the position of schwa switches between the two pair members. There is more phonetic overlap in Dutch: pairs such as legaal legaliteit “legal legality,” glorie glorieus “glory glorious,” and definitie definitief “definition definitive” all share the segments of the initial bisyllables despite having differ ently placed primary stress. Thus syllables without primary stress more often con tain schwa in English than is the case for Dutch, and unstressed syllables with full vowels are more common in Dutch than in English. The vocabularies reveal different, indeed opposing, patterns; taking into account all words of three or more syllables, Dutch has more full vowels in medial syllables without stress, while English has more schwa. Despite the lack of overlap in morphologically related pairs, English has more overlap in general. A tally of the proportion of words in the vocabulary that share an initial bisyllabic string with one or more other words reveals a significantly larger figure for English than for Dutch (Bruggeman & Cutler, 2016; consider that, for example, the initial CVCV sequence of the English words coral, correlate, corridor, coroner, corrugated and coryphée, although spelled differently in each word, is in each case phonetically identical, the constant second V being schwa). The effect of such greater overlap is to increase competition, of a kind that taking stress into account is not going to help at all. Finally, the position of embedded words within their carriers also varies across vocabularies. In the Germanic languages, embedded words in carrier‐initial posi tion way outnumber embeddings in final position. A combination of suffixal mor phology and vowel reduction in unstressed syllables causes this skew, which therefore hardly exists in some other languages. Consider Japanese, which, in con trast to English, Dutch, and German, is a noninflectional language, and has neither stress nor vowel reduction; it has no significant asymmetry of embedding posi tion. Because both Dutch and German have more inflectional suffixes (on verbs and as noun plurals) than English, the tendency toward more initial than final embeddings, which is already larger in English than in Japanese, is significantly larger again in Dutch and German. The added fact that German has very many monomorphemic words ending in schwa finally boosts German to the largest asymmetry in the set (Cutler, Otake, & Bruggeman, 2012). These comparisons might suggest that both morphology and stress phonology are necessary to explain the initial‐final embedding asymmetry, but in principle

Lexical Stress 247 either could actually produce it alone. Evidence from the vocabulary of French offered a further explanation here; French has suffixes aplenty in its morphology and schwa in its phoneme repertoire, but no lexical stress in its phonology. Comparison of French with the other vocabularies (Cutler & Bruggeman, 2013) revealed that the suffix effect alone, present in French, accounted for a positional asymmetry roughly half the size of that in English. Then, if the lexicon of French were modified to allow realization (as schwa) of the “silent” vowel in words such as petite “small” and ville “town,” the asymmetry roughly equaled that of English. Thus morphology on the one hand, and segmentally realized stress on the other, each separately and additively contribute to this influence on the availability of competing words during speech processing. It is clear, therefore, that vocabulary analyses such as those reviewed in this section predict differences across lexical‐stress languages (even among those most closely related) in the degree to which the processing of suprasegmental information leads to a worthwhile payoff in the efficiency of spoken‐word recog nition. The next section reviews the findings relevant to this proposal.

Spoken‐word identification The vocabulary statistics are based on measures of overlap, on the assumption that interword competition is the primary testbed for whether a particular factor will play a useful role in the identification of spoken words. Given that these measures differ across lexical‐stress languages when suprasegmental cues are considered, the statistics then predict cross‐language differences in whether suprasegmental cues will actually prove useful to listeners. Evidence for the importance of the competition factor on which the statistics are based is firmly established. Competitor words, temporarily supported by an incoming speech signal, are effortlessly discarded by listeners as mismatching speech information becomes available, but traces of their fleeting presence are reli ably seen in psycholinguistic experiments. For instance, in the cross‐modal priming task, where listeners make yes–no lexical decisions about printed words while hearing speech, words are recognized more slowly when they partially match the auditory input than when the auditory input is totally unrelated (e.g. responses to printed feel are slower after spoken feed than after spoken name; Marslen‐ Wilson, 1990), and in the word‐spotting task, finding real words in nonsense car riers is harder if the carrier could become another real word than if it could not (so WRECK is easier to find in berrec than in correc, which could be the beginning of correction; McQueen, Norris, & Cutler, 1994). This response inhibition in each case is evidence of the temporary availability but later rejection of a different interpre tation of the speech input. Candidate words that have been subject to competition are momentarily less available to the recognition process. Initial competition has greater effects than final competition on the speed and accuracy of word recogni tion too; this asymmetry between embeddings in word‐initial versus word‐final position has been supported in many spoken‐word recognition studies in English

248 Perception of Linguistic Properties (Allopenna, Magnuson, & Tanenhaus, 1998; Cluff & Luce, 1990; McQueen & Viebahn, 2007). Exactly as the statistics predict, the relative usefulness or otherwise of supraseg mental cues to stress differs across languages. In Dutch, listeners’ guesses, given increasingly larger fragments of pairs such as CAvia kaviAAR “guinea pig caviar” in a gating task, were clearly shown to draw on suprasegmental as well as on seg mental information (van Heuven, 1988). Mis‐stressing of Dutch words likewise exercised adverse effects on recognition both in gating (van Leyden & van Heuven, 1996) and in semantic judgment tasks (Koster & Cutler, 1997). Single syl lables differing only in stress could be assigned to the appropriate source word by Dutch listeners (e.g. the above CA‐/ka‐ pair; Cutler & van Donselaar, 2001; Jongenburger, 1996) and also by German listeners (e.g. AR‐/ar‐ from ARche ArCHIV “ark archive”; Yu, Mailhammer & Cutler, 2020). Spoken‐word recognition studies in English, in contrast, have repeatedly failed to find equivalent effects. Studies in English showed that mis‐stressing of stress pairs did not at all affect their recogni tion (Small, Simon, & Goldberg, 1988), that mis‐stressing also did not affect word recognition in noise (Slowiaczek, 1990); that minimal stress pairs such as TRUSty and trusTEE activate both associated meanings just as homophones would (Cutler, 1986), that stress pattern prompts were ignored in word‐matching judg ments (Slowiaczek, 1991), and that cross‐spliced words in which primary‐ and secondary‐stressed syllables were swapped were rated as just as natural as the original unmodified versions (Fear, Cutler, & Butterfield, 1995). All these findings could be said to give evidence of the low payoff provided by the English vocabu lary for the use of suprasegmental as well as segmental information in identifying English words, while the Dutch results confirmed the greater utility of paying attention to both cue types in that language. Note that these results largely made use of “offline” tasks, that is, they did not tap word‐processing speed, but only whether or not the outcome was correct. Cross‐modal priming studies that measure word‐recognition speed, however, likewise revealed language‐specific differences in the use of cues to stress in the process of recognizing spoken words. As described earlier, the cross‐modal priming task critically allows a view of competition, via an inhibitory effect on the recognition of a constant target when competition is present versus when it is absent. Evidence from a series of studies using cross‐modal fragment priming (where the primes were fragments of spoken words) indeed indicated that supra segmental information about lexical stress could modulate lexical processing. In these studies, respectively in Spanish (Soto‐Faraco, Sebastián‐Gallés, & Cutler, 2001), Dutch (van Donselaar, Koster, & Cutler, 2005), and English (Cooper, Cutler, & Wales, 2002), listeners heard a fragment prime taken from the onset of longer word pairs in their respective native language, before performing a lexical decision on a printed target. A study in German (Friedrich et al., 2004) combined this cross‐ modal priming technique with a record of evoked response potentials (ERPs) in the brain. In all these studies, the fragment in matching conditions consisted of the initial portion of the target word. That is, two‐syllable fragments had the same segmental

Lexical Stress 249 composition and stress pattern as the first two syllables of the target word (e.g. in the case of English, ADmi‐ was a prime for ADmiral). Listeners’ lexical decisions were faster in this matching condition, compared in the Dutch, Spanish and English studies to a control condition with unrelated primes such as immer‐ from immersion, and in the German study, which had no control‐prime condition, to the mismatching‐prime condition. Facilitatory priming was also found when primes were monosyllabic (Cooper, Cutler, & Wales, 2002; Friedrich et al., 2004; van Donselaar, Koster, & Cutler, 2005). In the German ERP study, mismatching fragments of any size further induced a positive response approximately 350 mil liseconds following the onset of target presentation (a P350) which was held to be a signal of the detection of lexical incongruency. Such facilitation over a control is expected on the basis of segmental mismatch alone, of course; the critical information about whether suprasegmental cues to lexical stress contribute to word recognition is provided by comparing the control to a stress‐mismatching condition. In this condition, the fragment primes also matched the first portion of the target word in their segmental composition, but mismatched them in stress (e.g. ADmi‐ before admiRAtion). Stress in these words was solely cued through suprasegmental information, as the vowels in matching and mismatching primes were identical. Here the results depended on the lan guage: Dutch and Spanish listeners responded more slowly to target words after a fragment prime with a different stress pattern, compared to after an unrelated prime. For these language users, the suprasegmental differences in stress pattern weighed enough to undo the advantage that the match in segmental overlap had provided. When fragment primes were monosyllabic (a condition in the study with Dutch listeners, but not in the Spanish study) this inhibitory effect was not observed. The monosyllabic primes thus gave Dutch listeners insufficient supra segmental information to favor stress competitors over the target words. Together, the results show that Dutch and Spanish listeners can use suprasegmental infor mation about lexical stress in the recognition of words, although they do so more effectively if two syllables are provided. In Dutch and Spanish, both segmental and suprasegmental information thus provide support for word representations during recognition. In contrast, the results again showed English listeners to be less efficient in using the suprasegmental cues to lexical stress in lexical processing (Cooper, Cutler, & Wales, 2002). While English listeners’ responses had also been facil itated in the matching conditions, their results for the stress‐mismatching conditions differed from those observed for Dutch and Spanish listeners. Bisyllabic stress‐mismatching primes crucially did not inhibit recognition. They also did not elicit facilitatory priming, however, suggesting that the relative contributions of the segmental match and the suprasegmental mismatch may have offset one another. English listeners, like Dutch listeners, also obtained less stress information from monosyllabic primes than from bisyllabic primes; indeed, they showed facilitation for stress‐mismatching monosyllabic primes. In English word identification decisions, suprasegmental information about lexical stress may therefore be outweighed by segmental information. Alternatively, as

250 Perception of Linguistic Properties a result of listeners' experience of the English vocabulary structure reviewed earlier, suprasegmental cues may simply be less efficiently processed than seg mental cues. Note, though, that the stress distinction in the English study may also be held to have been intrinsically less useful to listeners. In the English study, bisyllabic primes differed in whether they had primary or secondary stress on the first syl lable, with the second syllable being constantly weak (with schwa). Such contrasts can be obtained in English, but there are no contrasts such as those between the full vowels in the first two syllables of Dutch OCtopus and okTOber or Spanish PRINcipe “prince” and prinCIpio “beginning,” in which the vowels remain the same although the primary stress shifts from first to second syllable. In English OCtopus versus ocTOber, the vowels are not the same (the medial syllable of OCtopus has schwa) so that for this reason sufficiently there is no competition. Half the word pairs in the Dutch study were similar in structure to the English pairs, but the remaining half offered listeners a less subtle and hence potentially more useful contrast. Again, what the vocabulary offers in Dutch (or Spanish), com pared to English, is intrinsically more useful to listeners than what the English vocabulary can provide. In summary, these priming studies confirm the “offline” findings that supraseg mental cues to lexical stress can aid spoken‐word recognition, as long as the statistical payoff provided by the vocabulary is significantly strong. However, cross‐modal priming studies also suffer from a limitation: they cannot inform researchers whether such cues are evaluated rapidly and efficiently enough to facilitate spoken‐word recognition in continuous speech. In cross‐modal priming, the response measure is time to accept that the printed target is indeed a real word, with the visual presentation being a separate operation from the auditory processing. Another experimental paradigm that combines visual and auditory processing, but in a dependent rather than disjointed manner, is eye tracking, in which participants are instructed to look at a visual display and their looks are tracked. As speech unfolds over time, information about a spoken word gradually becomes available and constrains its identity, and if the display contains a picture (Allopenna, Magnuson, & Tanenhaus, 1998) or a printed string (McQueen & Viebahn, 2007) corresponding to an incoming spoken form, looks to that part of the display will accumulate as a function of the information availability. In other words, this technique offers a way of tracking the uptake of speech input across time. Thus the priming and eye‐tracking methods offer complementary views of the spoken‐word recognition process: eye tracking reveals what speech information is available when, while priming provides a record of what active (inhibitory) competition has occurred. The full picture requires both sources of information. Looking behavior does allow tracking of the degree to which other candidate words are considered over time, of course; listeners immediately respond to even fine‐grained sub‐phonemic information by adjusting their looking patterns (Dahan et al., 2001; McMurray, Tanenhaus, & Aslin, 2002). The proportion of looks to com petitors in a display shows the degree to which listeners simultaneously consider multiple candidates (and potential candidates are not necessarily limited to words

Lexical Stress 251 in the current display; Dahan et al., 2001; Magnuson et al., 2007). Such effects on looking patterns likely result from active lexical competition, even if the task offers no direct measure of the consequent inhibition. Recent eye‐tracking work has shown listener entrainment to prosodic structure (Brown et al., 2015), and studies in three languages have confirmed that supraseg mental cues to lexical stress can influence spoken‐word recognition (Jesse, Poellmann, & Kong, 2017; Reinisch, Jesse, & McQueen, 2010; Sulpizio & McQueen, 2012). Participants in these three experiments heard sentences containing a member of a critical stress pair (e.g. Click on the word ADmiral) while the fixations of their eyes on a computer screen were tracked. The display would then include the printed target word (admiral), its stress competitor (admiration), and two unrelated words; participants simply followed the instructions to click on the mentioned word. The first two syllables of the competitor overlapped segmen tally with the target word, but differed suprasegmentally, with words’ target or competitor roles counterbalanced across trials. The eye‐movement data showed that Dutch, Italian, and English listeners fixated on target words (in their respec tive native language) more frequently than competitors even before disambigu ating segmental information was available. Listeners from all these three languages can thus process suprasegmental cues to lexical stress efficiently enough to facilitate recognition of the spoken word. Relative efficiency of processing segmental compared to suprasegmental cues as an explanation of the English processing results can therefore be rejected. Nonetheless, cross‐language differences in the contribution of suprasegmental information were once again observed. In Dutch (Reinisch, Jesse, & McQueen, 2010), there was an effect of stress pattern in that competitors with primary stress on the initial syllable attracted more looks than competitors with secondary stress, and in Italian (Sulpizio & McQueen, 2012) there was an effect of default stress pattern (to be described in detail); but in the English experiment (Jesse, Poellmann, & Kong, 2017) and also in a replication by Kong and Jesse (2017) there was no sign of a stress pattern effect. As argued by Jesse and colleagues, this pattern suggests that, while cue processing can be equally efficient across languages, English lis teners attach greater weight to the segmental than to the suprasegmental information they receive, an imbalance that is absent from processing by the Dutch or Italian listeners. We return to this proposal in the conclusion to the chapter. It therefore appears that, with sufficiently sensitive techniques, evidence of effi cient suprasegmental stress cue processing can always be observed. The striking differences across languages in many studies and their obvious links to vocabu lary structure signal that stress perception itself is a different issue from whether stress is useful, and for what. Note that there can be many ways in which native listeners show that they have efficiently registered their language’s stress features. In the case of English, the most telling evidence comes from the speech segmentation literature, where the predominance of stress‐initial words in English speech (Cutler & Carter, 1987) is the presumed underpinning of the fact that listeners who are locating word bound aries in that language use stressed syllables as indicating the beginning of a new

252 Perception of Linguistic Properties word (Cutler & Norris, 1988). Also, word‐class stress regularities (final stress in English bisyllables is significantly more common for verbs than for nouns) form part of stored lexical representations (Arciuli & Slowiaczek, 2007). Indeed, English speakers exhibit knowledge not only of the major stress placement probabilities, but also of subtle stress‐led probabilities such as that trisyllabic words with [i] in their final syllable are more likely to bear antepenultimate (cavity, recipe) than pen ultimate stress (safari, bikini; Moore‐Cantwell & Sanders, 2017). Further, they are capable of learning novel rules based on stress, such as consonantal occurrence restrictions that are dependent on trochaic or iambic stress contexts (White et al., 2018). Stress errors in speech production are not overlooked by listeners, but induce misperceptions (Bond & Garnes, 1980; Cutler, 1980; Fromkin, 1976). All these abilities rest on perceptual processing of stress, which is not signaled by seg mental structure alone. Indeed, participants in English word recognition experiments can be seen to be making use of suprasegmental cues to stress; they just do not always do so as effi ciently as listeners from other lexical‐stress language communities. Cooper, Cutler, and Wales (2002), besides their cross‐modal priming studies, conducted a simpler “offline” task in which listeners heard word‐initial syllables extracted from pairs of words with segmentally identical but suprasegmentally differing initial sylla bles (such as hum‐ from HUmid versus huMANE), and chose which member of the pair had in each instance been the source word. Mattys (2000) had also performed such a task with initial syllables of trisyllabic pairs such as PROsecutor versus proseCUtion. In neither study were such pairs of single syllables reliably distinguished at an above‐chance level, and in Cooper et al.’s data, the most striking finding was that Dutch listeners proficient in English outperformed the native English listeners with the English materials, in particular by correctly identifying noninitially stressed cases such as hum‐ from huMANE (where the English native listeners’ responses did not differ from chance). Later work showed that German listeners proficient in English likewise performed above chance in correctly classifying these noninitially stressed English word fragments (Yu, 2020). Further, a similar judgment task in Spanish revealed that English listeners failed to use the F0 and durational cues to Spanish stress placement to the extent that native Spanish lis teners did (Ortega‐Llebaria, Gu, & Fan, 2013). However, more detailed analyses (Cutler et al., 2007) of Cooper, Cutler, and Wales’s (2002) English data revealed that those listeners had indeed made use (if somewhat inefficiently) of stress cues. Measurements showed that, although the pairs differed significantly in the initial syllables’ duration, F0, amplitude, and spectral tilt, it was the F0 cues that most reliably distinguished the pairs (separate evidence from a gating task with Dutch listeners had also proved the F0 cues to be most informative). Correlations across item means then showed that Cooper et al.’s Dutch listeners made good use of each type of cue, in that the degree of acoustic difference between pairs correlated with the likelihood of a correct decision. However, the results for the native English‐speaking listeners differed: the F0 cues were appropriately exploited, but the weaker cues were not. In fact, some correlations seemed hard to explain in that they were in the opposite direction

Lexical Stress 253 from what might have been predicted; unstressed syllables were significantly shorter, and had less spectral tilt, than stressed syllables, but the longer such a syl lable was and the greater its spectral tilt, the more likely English listeners were to correctly judge it as unstressed. Considering that longer duration, and clearer information in the upper spectral ranges, should each have increased listeners’ opportunity to process F0, it may be that these paradoxical results are actually cues to what the English listeners were doing: correctly exploiting (only) the most effec tive of the stress cues, that is, F0. When placed in a situation where the only information relevant to performing a task is information that they don’t often use in this way, English listeners can do their best and at least exploit the clearest and most reliable cue. In other languages, too, stress perception abilities can be drawn upon in the appropriate circumstances, but otherwise not. The Italian eye‐tracking study of Sulpizio and McQueen (2012), confirming cross‐modal priming findings by Tagliapietra and Tabossi (2005), showed that Italian listeners could certainly use suprasegmental stress cues in recognizing words. However, they did not use them for every word; they used them only for those experimental stimuli that had ante penultimate stress (e.g. COmico “funny”). This may seem arbitrary, but in fact is not, since this stress pattern is an exception to the general stress rules for Italian. With words in which the more common default rules applied (e.g. coMIzio “meeting”), listeners ignored the suprasegmental cues. Relatedly, a study in Turkish (Domahs et al., 2013) measured ERPs to examine listeners’ responses to hearing correctly stressed words versus incorrect stress realized by manipulating F0, duration, and amplitude together; note that ERP evidence confirms that Turkish stress minimal pairs such as BEbek “a district of Istanbul” vs beBEK “baby” can be distinguished on these suprasegmental cues alone (Zora, Heldner, & Schwarz, 2016). In Domahs et al.’s study, different types of violations led to different responses. The expected default stress placement in Turkish is final, and violations of default stress modulated a P300 effect. In contrast, violations of stress on words with an exceptional stress placement (e.g. initial), that had to be stored as part of their individual phonological representations, produced an N400 effect. Default stress patterns are therefore processed differently than lexically defined excep tions, in Turkish (classified as fixed stress given the predominance of the default case) as in Italian (classified as lexical stress since predominant patterns differ with word length). This ERP research has thus established that listeners from a fixed‐stress lan guage background can also process suprasegmental cues when necessary. Further, though users of fixed‐stress languages show stress “deafness” by being unable to recall stress placement long enough to perform an ABX choice, they can correctly perform the simpler AX discrimination (same–different, e.g. bopeLO–boPElo; Dupoux et al., 1997). Vroomen, Tuomainen, and de Gelder (1998) likewise showed that Finnish listeners could use F0 cues to word‐initial stress in learning the “words” of an artificial language; and word‐final suprasegmental cues are exploited in continuous‐speech segmentation by listeners from the fixed‐final lan guages French (Tyler & Cutler, 2009) and Turkish (Kabak, Maniwa, & Kazanina,

254 Perception of Linguistic Properties 1999). In Hungarian, an ERP study of suprasegmental cue processing by Garami et al. (2017) showed that oddball detection, as indicated by the mismatch nega tivity component, differed between real and pseudo‐words; cues were processed more efficiently in pseudo‐word than in real‐word input. That is, when listeners knew that an incoming item was a real word, they evaluated the stress cues differ ently; indirectly, this shows that they could hear and process cues to stress. The way that listeners process these cues is thus in no way dependent on lan guage‐specific perceptual ability; rather it is language‐specific vocabulary training that determines the attention that stress cues receive. This is as true for fixed‐ as for lexical‐stress languages.

New horizons for stress in speech perception As the previous section revealed, new techniques have caused recent research in speech perception to substantially deepen our understanding of the role of stress in the processing of spoken words. We expect that this trend will continue. ERP research in particular will expand in this field, with lexical processing of supraseg mental information already having been shown to distinguish between minimal‐ pair competitors (in Swedish, where accents can signal word number: Söderström et al., 2016). New fields have also opened perspectives from which word stress can be further understood. This includes applied fields such as the processing of prosody in second‐language speech perception and production, which was long a virtually unresearched topic but is now the subject of a rapidly expanding litera ture that has already grown too extensive to include here. In this section, we report briefly on two growing areas that shed additional light on the central question of the preceding section: How does word stress contribute to word recognition? These concern speech perception when listening has become difficult because the input is degraded, and speech perception considered as an audiovisual process.

Lexical‐stress perception in degraded speech Suprasegmental cues to lexical stress can facilitate listeners’ recognition of spoken words, by providing more support for the target word over its lexical competitors. This facilitation depends, however, on the quality of the speech signal. Mattys, White, and Melhorn (2005) have argued that word stress may become most rele vant in noisy listening situations, such that the use of word stress for the segmentation of continuous speech increases when lexical and semantic cues are degraded by noise. However, when the quality of the available stress cues them selves is artificially degraded, listeners may default to the pattern most common in their native language. In a second experiment in their study described above, Sulpizio and McQueen (2012) taught Italian listeners to associate nonsense shapes with novel nonwords, with stress on the penultimate (e.g. toLAco, the default pattern) or on the antepenultimate syllable (TOlaco, the exceptional pattern). Amplitude and duration cues were neutralized in the training nonwords, while F0

Lexical Stress 255 cues remained intact. In a test phase, participants were then instructed to click on a particular shape, and their fixations on that shape or shapes associated with a stress competitor or a distractor were tracked. During this test, the nonwords had only F0 stress cues, as in training, or were fully intact. With only F0 cues, non words with penultimate stress competed more for recognition than those with antepenultimate stress. That is, with insufficient acoustic support for a definitive answer, participants showed a general preference to interpret the nonwords as bearing the default stress. Knowledge of vocabulary patterns is presumably helpful in difficult listening conditions outside the laboratory as well. One case in which suprasegmental cues to prosody are definitely degraded is when listening occurs through cochlear implants (CIs). In particular, the spectral and temporal fine‐structure information needed for the perception of pitch is reduced in cochlear‐implant listening, for example, as it is discarded by the signal‐ processing algorithms implemented in CIs (Smith, Delgutte, & Oxenham, 2002; Zeng et al., 2005). CI users thus perceive prosodic contrasts less well than normal‐ hearing listeners (e.g. Holt & McDermott, 2013; Holt, Demuth, & Yuen, 2016; Meister et al., 2009; Morris et al., 2013; Peng, Lu, & Chatterjee, 2009). Swedish normal‐hearing listeners are therefore better, for example, than CI users at recog nizing words from minimal stress pairs in a two‐alternative forced choice (Morris et al., 2013). This performance difference was exacerbated when speech‐shaped noise was added, as then only the performance of the CI users declined. Among CI users, those with residual low‐frequency hearing often perform better in perceptual tasks than those without residual hearing (Ching, van Wanrooy, & Dillon, 2007; Gifford et al., 2012; Woodson et al., 2010), including on tasks dependent on the processing of suprasegmental information (Kong, Stickney, & Zeng, 2005; Zhang, Dorman, & Spahr, 2010). For example, both CI users with and without residual hearing use stressed syllables in English as cues to word onsets in segmenting continuous speech (Spitzer et al., 2009). However, only the performance of CI users with residual hearing changed as a function of whether or not pitch was provided as a cue. Low‐frequency residual hearing improves the perception of pitch (e.g. Dorman et al., 2008; Kong, Stickney, & Zeng, 2005), and as such may enhance CI listeners’ ability to facilitate speech perception through pro sodic information. While these results suggest that residual hearing provides a benefit for discrim inating prosodic contrasts, or even enables it, these studies cannot speak to how prosodic cues in speech input are processed by CI users, in particular whether or not any CI users can process prosodic cues effectively enough for these to have an immediate influence on ongoing lexical processing. To address this question in the case of lexical stress, Kong and Jesse (2017) simulated the degradation of speech in CI listening by using (eight‐channel) noise‐vocoded speech. Normal‐hearing par ticipants were first trained to recognize spectrally vocoded speech, before being tested, in the eye‐tracking paradigm also used by Jesse, Poellmann, and Kong (2017), on vocoded speech with and without supplementary low‐pass filtered speech information. Critical word pairs (e.g. Admiral admiRAtion) differed again in primary or secondary stress on the first syllable, with unstressed, unreduced

256 Perception of Linguistic Properties second syllables being identical across a pair and segmental difference not appear ing until at least the third syllable. In the vocoder‐only condition, participants could not distinguish target from competitor words using lexical stress information alone. In Jesse, Poellmann, and Kong’s (2017) study with intact speech, only pitch and amplitude differed across the word‐stress patterns; but here noise‐channel vocoding discarded the fine‐ structure information needed to access pitch, and the remaining amplitude cues did not suffice. There was also no sign that listeners had applied a general strategy, for example choosing word‐initial stress, as the most common pattern for English words. In contrast, in the condition where vocoded speech had been supplemented with a low‐pass filtered version of the speech materials, listeners showed that they were able to effectively use suprasegmental cues to lexical stress to determine the target word. Importantly, these differences across listening conditions were not based on differences in access to segmental information. The preference for fix ating a competitor word over phonologically unrelated distractor words was the same across listening conditions, indicating a similar degree of lexical competition due to segmental overlap.

Lexical stress in visual speech In many of our daily conversations, we hear and see the speaker. It is well established that listeners use available information from both modalities to achieve more robust recognition of speech (e.g. Jesse & Massaro, 2010; Jesse et al., 2000; MacLeod & Summerfield, 1987; Sumby & Pollack, 1954). While most work in the domain of audiovisual speech perception has focused on how visual speech aids recognition by providing segmental cues, visual speech can also provide prosodic information (e.g. Bernstein, Eberhardt, & Demorest, 1989; Dohen et al., 2004; Munhall et al., 2004; Srinivasan & Massaro, 2003). Minimal word‐stress pairs can, for example, be distinguished based on speech‐reading alone. Thus Risberg and Lubker (1978) showed that Swedish normal‐hearing and hearing‐impaired adoles cents can (equally well) distinguish minimal pairs that differ in stress placement (e.g. BAnan “rack” vs. baNAN “banana”). Likewise, English adults in a speech‐ reading study by Scarborough et al. (2009) performed above chance when asked to distinguish noun–verb stress pairs (e.g. [a] SUBject vs. [to] subJECT) as well as reiterant speech versions of these minimal pairs (e.g. FERfer vs. ferFER). Together, these results suggest that lexical stress has visual correlates that listeners can use in spoken‐word recognition. For instance, in Scarborough and colleagues’ study, stressed syllables were produced with a larger lip opening and with larger and faster chin movements than unstressed syllables with reduced vowels. However, it is unclear which of these visual correlates perceivers relied on. Degrees of lexical stress can also be recognized from visual speech. Presenting the first two syllables of Dutch word pairs as visual speech, Jesse and McQueen (2013) showed that participants could distinguish primary from secondary stress preceding an unstressed syllable (e.g. CAvi‐ from CAvia “guinea pig” vs. kavi‐ from kaviAAR “caviar”) and unstressed‐primary from secondary‐unstressed stress

Lexical Stress 257 sequences (e.g. proJEC‐ from proJECtor “projector” vs. projec‐ from projecTIEL “projectile”). The former distinction was, however, possible only when the critical fragments came from words with phrase‐level emphasis. Phrase‐level emphasis falls onto syllables with primary stress and increases articulatory effort (de Jong, 1995; Fowler, 1995; Kelso et al., 1985). Cues in primary-stressed syllables were then strengthened by the phrase-level emphasis. The otherwise subtle difference between the visual correlates of suprasegmental differences across syllables thus became perceptible. Together, these studies demonstrate that information about the lexical stress pattern of a word can also be obtained from seeing a speaker. Perceiving lexical stress from visual speech is not reliant on segmental cues, but rather on the visual correlates of suprasegmental cues to lexical stress, and these can be sufficient for recognition – at least in languages where word recognition relies more heavily on such cues.

Conclusions Word stress is not language universal, and even among languages that do mark stress within words there is variety: the placement of stress can be fixed, or it can be variable. If it is fixed, the stipulated position may be at a word’s edge or not, with different implications for a demarcative function for stress; if it is variable, it may still be affected by phonological factors such as the availability of vowel reduction. Either type of stress may show traits of the other too: fixed stress lan guages may have some minimal pairs involving loan words or proper names; lexical stress languages can display strong tendencies toward preferred positions for the placement of primary stress. All these features are captured in each language’s vocabulary, and, as we have argued, it is via learning of a vocabulary that listeners come to know how they should make use of cues to stress in speech perception. The study of stress in speech perception is expanding, and we particularly look forward to data from languages previously unrepresented in this literature. For now, however, our review has largely drawn from English and related languages. Even here we find plenty of data to establish our argument that the use of speech cues to stress is driven by vocabulary structure. We drew particular attention to subtle mismatches in the processing of suprasegmental cues to stress in the closely related languages Dutch and English. If stress location is ignored, the Dutch vocab ulary shows a higher degree of within‐word embedding than the English vocabu lary, but if the computation takes stress location into account this asymmetry is greatly reduced, due to significant reduction in the Dutch embedding counts. The vocabulary pattern thus suggests that Dutch listeners will be able to speed spoken‐ language processing by attending to where stress falls to a greater degree than English listeners. This prediction offers an explanation for a remarkable asym metry in the processing data; studies of the uptake of suprasegmental cues to stress show that both English and Dutch listeners can process these, and equally

258 Perception of Linguistic Properties efficiently, but studies focusing on the resolution of active competition, by inhibi tion of competing forms, reveal that such inhibition via suprasegmental cues occurs only in Dutch. English and Dutch listeners can both use suprasegmental cues to help identify words, but English listeners do not further use the same cues to suppress competitors, while Dutch listeners do. We have argued that this asymmetric pattern appears because, in English, with its lesser degree of embedding, stress information has a lower payoff in the word‐ identification process. Listeners can afford to let suprasegmental information regarding word identity be outweighed by segmental information. Note that the relative weighting of different sources of information in lexical access is a well‐ studied topic; many lines of research have demonstrated, for instance, that conso nantal information yields stronger cues to lexical identity than vowel information does (see Nazzi & Cutler, 2019, for a review). This weighting effect for segments appears across languages differing widely in phoneme repertoire makeup, and is seen in findings such as listeners’ greater willingness to turn nonwords (shevel, eltimate) into real words by changing a vowel (shovel, ultimate) than by changing a consonant (level, estimate; van Ooijen, 1996) or in the greater discoverability of “words” in an artificial language with consonant regularities than in one with vowel regularities (Mehler et al., 2006). There are orthographies (such as Hebrew) in which consonants may be written and vowels omitted, but there are no cases of the reverse pattern. The vowel/consonant asymmetry is also driven by the vocab ulary; for instance, while young infants at first favor listening to vocalic over con sonantal information (held to be due to the talker identity cues carried on vowels), this infant pattern of vowel preference switches to the adult pattern of consonant preference once the compilation of an initial vocabulary begins, toward the end of the first year of life (Nazzi & Cutler, 2019). Vowel/consonant preferences relate directly to the perception of stress because suprasegmental cues to stress patterns are primarily carried on vowels: F0 movements are realized across vowels, amplitude is principally realized in vowel articulation, and vowels make a significantly greater contribution to syllable dura tion than do consonants. Thus downgrading the contribution of vocalic information, which has been established to happen in many languages, will automatically result in a corresponding downgrading of suprasegmental information. From this point of view, it may seem that these lexical weighting patterns, in which supra segmental information contributes little to word identification, in fact constitute the default case. Indeed, among all the world’s languages, those with lexical stress form a minority, outnumbered by the combination of fixed‐stress languages and languages without word‐level stress. Even among lexical‐stress languages, how ever, suprasegmental cues seem to be fully exploited only when their use notice ably speeds processing by significantly reducing the lexical competitor count. The available option of modulating lexical competition by weighting all cues (seg mental or suprasegmental) equally is chosen when (and only when) the vocabu lary renders it profitable. As the section on “Lexical stress and the vocabulary” showed, the size (and relative benefit) of this competitor reduction can be estimated by computing word

Lexical Stress 259 overlap in the vocabulary in the light of frequency statistics for word usage. Reliable large corpora, especially of spoken language, so far exist for only relatively few languages; but if there is one thing that is growing fast worldwide, it is the collection and exploitation of large data sets, so the prediction is that in the future this particular data problem should be solved. Then vocabulary‐based predictions about the use of suprasegmentals in lexical processing can be made, and subse quently tested, for many more lexical‐stress languages. We note that at the level that creates the general processing strategies at issue here, the phonological patterns characterizing a vocabulary are largely constant across varieties. For English, our reports suggesting that suprasegmental cues are downgraded in speech perception have come from different varieties (American English, e.g. Jesse, Poellmann, & Kong, 2017; British English, e.g. Fear, Cutler, & Butterfield, 1995; Australian English, e.g. Cooper, Cutler, & Wales, 2002), but they all paint the same picture. Indeed, crucial findings directly replicate across vari eties (for instance, McQueen, Norris, & Cutler’s [1994] demonstration of active interword competition was conducted in British English, but an identical pattern, including strong effects of word‐stress placement, appears in American English: Warner et al., 2018). These varieties of English essentially share all their major vocabulary patterns, notwithstanding the existence of individual variety‐specific lexical items. It is not impossible that varieties of one language will differ in the degree to which suprasegmental information is used in speech perception; but analysis of variety‐specific dictionaries can quickly establish whether relative competition patterns predict this to happen. Our knowledge of the processing of suprasegmental information in speech per ception has significantly expanded in recent years. The dependence of processing patterns on the vocabulary is clear, and with it the path to yet further discoveries.

Acknowledgments AC acknowledges support from the Australian Research Council Centre of Excellence for the Dynamics of Language (CE140100041).

REFERENCES Allopenna, P. D., Magnuson, J. S., & Tanenhaus, M. K. (1998). Tracking the time course of spoken word recognition using eye movements: Evidence for continuous mapping models. Journal of Memory and Language, 38(5), 419–439. Arciuli, J., & Slowiaczek, L. M. (2007). The where and when of linguistic word‐level

prosody. Neuropsychologia, 45(11), 2638–2642. Berinstein, A. E. (1979). A cross‐linguistic study on the perception and production of stress. UCLA Working Papers in Phonetics, 47. Bernstein, L. E., Eberhardt, S. P., & Demorest, M. E. (1989). Single‐channel

260 Perception of Linguistic Properties vibrotactile supplements to visual perception of intonation and stress. Journal of the Acoustical Society of America, 85, 397–405. Bond, Z. S., & Garnes, S. (1980). Misperceptions of fluent speech. In R. Cole (Ed.), Perception and production of fluent speech (pp. 115–132). Hillsdale, NJ: Lawrence Erlbaum. Brown, M., Salverda, A. P., Dilley, L. C., & Tanenhaus, M. K. (2015). Metrical expectations from preceding prosody influence perception of lexical stress. Journal of Experimental Psychology: Human Perception and Performance, 41, 306–323. Bruggeman, L., & Cutler, A. (2016). Lexical manipulation as a discovery tool for psycholinguistic research. In C. Carignan & M. D. Tyler (Eds.), Proceedings of the Sixteenth Australasian International Conference on Speech Science and Technology (pp. 313–316). Parramatta, NSW: Australasian Speech Science and Technology Association. Ching, T. Y. C., van Wanrooy, E., & Dillon, H. (2007). Binaural–bimodal fitting or bilateral implantation for managing severe to profound deafness: A review. Trends in Amplification, 11, 161–192. Cluff, M. S., & Luce, P. A. (1990). Similarity neighborhoods of spoken two‐syllable words: Retroactive effects on multiple activation. Journal of Experimental Psychology: Human Perception and Performance, 16, 551–563. Cooper, N., Cutler, A., & Wales, R. (2002). Constraints of lexical stress on lexical access in English: Evidence from native and non‐native listeners. Language and Speech, 45, 207–228. Cutler, A. (1980). Errors of stress and intonation. In V. A. Fromkin (Ed.), Errors in linguistic performance: Slips of the tongue, ear, pen and hand (pp. 67–80). New York: Academic Press. Cutler, A. (1986). Forbear is a homophone: Lexical prosody does not constrain lexical access. Language and Speech, 29, 201–220.

Cutler, A. (2005). Lexical stress. In D. B. Pisoni & R. E. Remez (Eds.), The handbook of speech perception (pp. 264–289). Oxford: Blackwell. Cutler, A., & Bruggeman, L. (2013). Vocabulary structure and spoken‐word recognition: Evidence from French reveals the source of embedding asymmetry. In F. Bimbot, C. Cerisara, C. Fougeron, et al. (Eds.), Proceedings of the 14th Annual Conference of the International Speech Communication Association (Interspeech 2013) (pp. 2812–2816). Lyon: International Speech Communication Association. Cutler, A., & Carter, D. M. (1987). The predominance of strong initial syllables in the English vocabulary. Computer Speech and Language, 2(3–4), 133–142. Cutler, A., & Norris, D. (1988). The role of strong syllables in segmentation for lexical access. Journal of Experimental Psychology: Human Perception and Performance, 14, 113–121. Cutler, A., Norris, D., & Sebastián‐Gallés, N. (2004). Phonemic repertoire and similarity within the vocabulary. In S. Kin & M. J. Bae (Eds.), Proceedings of the 8th International Conference on Spoken Language Processing (Interspeech 2004‐ ICSLP) (pp. 65–68). Seoul: Sunjijn. Cutler, A., Otake, T., & Bruggeman, L. (2012). Phonologically determined asymmetries in vocabulary structure across languages. Journal of the Acoustical Society of America, 132, EL155–160. Cutler, A., & Pasveer, D. (2006). Explaining cross‐linguistic differences in effects of lexical stress on spoken‐word recognition. In R. Hoffmann & H. Mixdorff (Eds.), Speech prosody 2006: Third International Conference. Dresden: TUD Press. Cutler, A., & van Donselaar, W. (2001). Voornaam is not (really) a homophone: Lexical prosody and lexical access in Dutch. Language and Speech, 44, 171–195. Cutler, A., Wales, R., Cooper, N., & Janssen, J. (2007). Dutch listeners’ use of

Lexical Stress 261 suprasegmental cues to English stress. In J. Trouvain & W. J. Barry (Eds.), Proceedings of the 16th International Congress of Phonetics Sciences (ICPhS 2007) (pp. 1913–1916). Dudweiler, Germany: Pirrot. Dahan, D., Magnuson, J. S., Tanenhaus, M. K., & Hogan, E. M. (2001). Subcategorical mismatches and the time course of lexical access: Evidence for lexical competition. Language and Cognitive Processes, 16, 507–534. de Jong, K. J. (1995). The supraglottal articulation of prominence in English: Linguistic stress as localized hyperarticulation. Journal of the Acoustical Society of America, 97, 491–504. Dogil, G. (1999). The phonetic manifestation of word stress in Lithuanian, Polish and German and Spanish. In H. van der Hulst (Ed.), Word prosodic systems in the languages of Europe (pp. 273–311). Berlin: Mouton de Gruyter. Dohen, M., Lœvenbruck, H., Cathiard, M.‐A., & Schwartz, J.‐L. (2004). Visual perception of contrastive focus in reiterant French speech. Speech Communication, 44(1–4), 155–172. Domahs, U., Genc, S., Knaus, J., et al. (2013). Processing (un‐)predictable word stress: ERP evidence from Turkish. Language and Cognitive Processes, 28, 335–354. Dorman, M. F., Gifford, R. H., Spahr, A. J., & McKarns, S. A. (2008). The benefits of combining acoustic and electric stimulation for the recognition of speech, voice and melodies. Audiology & Neuro‐ Otology, 13, 105–112. Dupoux, E., Pallier, C., Sebastian, N., & Mehler, J. (1997). A destressing “deafness” in French? Journal of Memory and Language, 26(3), 406–421. Dupoux, E., & Peperkamp, S. (2002). Fossil markers of language development: Phonological “deafnesses” in adult speech processing. In J. Durand (Ed.), Phonetics, phonology, and cognition (pp.

168–190). Oxford: Oxford University Press. Dupoux, E., Peperkamp, S., & Sebastián‐ Gallés, N. (2001). A robust method to study stress “deafness.” Journal of the Acoustical Society of America, 110, 1606–1618. Fear, B. D., Cutler, A., & Butterfield, S. (1995). The strong/weak syllable distinction in English. Journal of the Acoustical Society of America, 97, 1893–1904. Fortescue, M., Mithun, M., & Evans, N. (Eds.). (2017). The Oxford handbook of polysynthesis. Oxford: Oxford University Press. Fowler, C. A. (1995). Acoustic and kinematic correlates of contrastive stress accent in spoken English. In F. Bell‐Berti & L. J. Raphael (Eds.), Producing speech: Contemporary issues: For Katherine Safford Harris (pp. 355–373). New York: AIP Press. Friedrich, C. K., Kotz, S. A., Friederici, A. D., & Gunter, T. C. (2004). ERPs reflect lexical identification in word fragment priming. Journal of Cognitive Neuroscience, 16, 541–552. Fromkin, V. A. (1976). Putting the emPHAsis on the wrong sylLABle. In L. M. Hyman (Ed.), Studies in stress and accent (pp. 15–26). Los Angeles: Department of Linguistics, University of Southern California. Garami, L., Ragó, A., Honbolygó, F., & Csépe, V. (2017). Lexical influence on stress processing in a fixed‐stress language. International Journal of Psychophysiology, 117, 10–16. Gifford, R. H., Dorman, M. F., Brown, C. A., & Spahr, A. J. (2012). Hearing, psychophysics, and cochlear implantation: Experiences of older individuals with mild sloping to profound sensory hearing loss. Journal of Hearing Science, 2, 9–17. Holt, C. M., Demuth, K., & Yuen, I. (2016). The use of prosodic cues in sentence processing by prelingually deaf users of

262 Perception of Linguistic Properties cochlear implants. Ear and Hearing, 37, e256–262. Holt, C. M., & McDermott, H. J. (2013). Discrimination of intonation contours by adolescents with cochlear implants. International Journal of Audiology, 52, 808–815. Jesse, A., & Massaro, D. W. (2010). The temporal distribution of information in audiovisual spoken‐word identification. Attention, Perception, & Psychophysics, 72, 209–225. Jesse, A., & McQueen, J. M. (2013). Suprasegmental lexical stress cues in visual speech can guide spoken‐word recognition. Quarterly Journal of Experimental Psychology, 67, 793–808. Jesse, A., Poellmann, K., & Kong, Y.‐Y. (2017). English listeners use suprasegmental cues to lexical stress early during spoken‐word recognition. Journal of Speech, Language, and Hearing Research, 60, 190–198. Jesse, A., Vrignaud, N., Cohen, M. A., & Massaro, D. W. (2000). The processing of information from multiple sources in simultaneous interpreting. Interpreting, 5, 95–115. Jongenburger, W. (1996). The role of lexical stress during spoken‐word processing [PhD thesis, University of Leiden]. The Hague: Holland Academic Graphics. Kabak, B., Maniwa, K., & Kazanina, N. (1999). Listeners use vowel harmony and word‐final stress to spot nonsense words: A study of Turkish and French. Laboratory Phonology, 1, 207–224. Kelso, J. A. S., Vatikiotis‐Bateson, E., Saltzman, E. L., & Kay, B. (1985). A qualitative dynamic analysis of reiterant speech production: Phase portraits, kinematics, and dynamic modeling. Journal of the Acoustical Society of America, 77, 266–280. Kong, Y.‐Y., & Jesse, A. (2017). Low‐ frequency fine‐structure cues allow for the online use of lexical stress during spoken‐word recognition in spectrally

degraded speech. Journal of the Acoustical Society of America, 141, 373–382. Kong, Y.‐Y., Stickney, G. S., & Zeng, F.‐G. (2005). Speech and melody recognition in binaurally combined acoustic and electric hearing. Journal of the Acoustical Society of America, 117, 1351–1361. Koster, M., & Cutler, A. (1997). Segmental and suprasegmental contributions to spoken‑word recognition in Dutch. In Proceedings of EUROSPEECH 97 (pp. 2167–2170). Rhodes, Greece: International Speech Communication Association. Lehiste, I., & Peterson, G. (1959). Vowel amplitude and phonemic stress in American English. Journal of the Acoustical Society of America, 31, 428–35. MacLeod, A., & Summerfield, Q. (1987). Quantifying the contribution of vision to speech perception in noise. British Journal of Audiology, 21, 131–141. Magnuson, J. S., Dixon, J. A., Tanenhaus, M. K., & Aslin, R. N. (2007). The dynamics of lexical competition during spoken word recognition. Cognitive Science, 31, 133–156. Marslen‐Wilson, W. (1990). Activation, competition, and frequency in lexical access. In G. T. M. Altmann (Ed.), Cognitive models of speech processing: Psycholinguistic and computational perspectives (pp. 148–172). Cambridge, MA: MIT Press. Mattys, S. L. (2000). The perception of primary and secondary stress in English. Perception & Psychophysics, 62, 253–265. Mattys, S. L., White, L., & Melhorn, J. F. (2005). Integration of multiple speech segmentation cues: A hierarchical framework. Journal of Experimental Psychology: General, 134, 477–500. McMurray, B., Tanenhaus, M. K., & Aslin, R. N. (2002). Gradient effects of within‐ category phonetic variation on lexical access. Cognition, 86, B33–42. McQueen, J. M., Norris, D., & Cutler, A. (1994). Competition in spoken word

Lexical Stress 263 recognition: Spotting words in other words. Journal of Experimental Psychology: Learning, Memory, and Cognition, 20, 621–638. McQueen, J. M., & Viebahn, M. C. (2007). Tracking recognition of spoken words by tracking looks to printed words. Quarterly Journal of Experimental Psychology, 60, 661–671. Mehler, J., Peña, M., Nespor, M., & Bonatti, L. (2006). The “soul” of language does not use statistics: Reflections on vowels and consonants. Cortex, 42, 846–854. Meister, H., Landwehr, M., Pyschny, V., et al. (2009). The perception of prosody and speaker gender in normal‐hearing listeners and cochlear implant recipients. International Journal of Audiology, 48, 38–48. Moore‐Cantwell, C., & Sanders, L. (2017). Effects of probabilistic phonology on the perception of words and nonwords. Language, Cognition and Neuroscience, 33, 148–164. Morris, D., Magnusson, L., Faulkner, A., et al. (2013). Identification of vowel length, word stress, and compound words and phrases by postlingually deafened cochlear implant listeners. Journal of the American Academy of Audiology, 24, 879–890. Munhall, K. G., Jones, J. A., Callan, D. E., et al. (2004). Visual prosody and speech intelligibility: Head movement improves auditory speech perception. Psychological Science, 15, 133–137. Nazzi, T., & Cutler, A. (2019). How consonants and vowels shape spoken‐ language recognition. Annual Review of Linguistics, 5, 25–47. Ortega‐Llebaria, M., Gu, H., & Fan, J. (2013). English speakers’ perception of Spanish lexical stress: Context‐driven L2 stress perception. Journal of Phonetics, 41, 186–197. Peng, S. C., Lu, N., & Chatterjee, M. (2009). Effects of cooperating and conflicting cues on speech intonation recognition by

cochlear implant users and normal hearing listeners. Audiology & Neuro‐ Otology, 14, 327–337. Peperkamp, S., Vendelin, I., & Dupoux, E. (2010). Perception of predictable stress: A cross‐linguistic investigation. Journal of Phonetics, 38, 422–430. Potisuk, S., Gandour, J., & Harper, M. P. (1996). Acoustic correlates of stress in Thai. Phonetica, 53, 200–220. Reinisch, E., Jesse, A., & McQueen, J. M. (2010). Early use of phonetic information in spoken word recognition: Lexical stress drives eye movements immediately. Quarterly Journal of Experimental Psychology, 63, 772–783. Risberg, A., & Lubker, J. (1978). Prosody and speechreading. Speech Transmission Laboratory Quarterly Progress Status Report, 4, 1–16. Scarborough, R., Keating, P., Mattys, S. L., et al. (2009). Optical phonetics and visual perception of lexical and phrasal stress in English. Language and Speech, 52, 135–175. Slowiaczek, L. M. (1990). Effects of lexical stress in auditory word recognition. Language and Speech, 33, 47–68. Slowiaczek, L. M. (1991). Stress and context in auditory word recognition. Journal of Psycholinguistic Research, 20, 465–481. Small, L. H., Simon, S. D., & Goldberg, J. S. (1988). Lexical stress and lexical access: Homographs versus nonhomographs. Perception & Psychophysics, 44, 272–280. Smith, Z. M., Delgutte, B., & Oxenham, A. J. (2002). Chimaeric sounds reveal dichotomies in auditory perception. Nature, 416, 87–90. Söderström, P., Horne, M., Frid, J., & Roll, M. (2016). Pre‐activation negativity (PrAN) in brain potentials to unfolding words. Frontiers in Human Neuroscience, 10, art. 512. Soto‐Faraco, S., Sebastián‐Gallés, N., & Cutler, A. (2001). Segmental and suprasegmental mismatch in lexical access. Journal of Memory and Language, 45(3), 412–432.

264 Perception of Linguistic Properties Spitzer, S., Liss, J., Spahr, T., Dorman, M., & Lansford, K. (2009). The use of fundamental frequency for lexical segmentation in listeners with cochlear implants. Journal of the Acoustical Society of America, 125, EL236–241. Srinivasan, R. J., & Massaro, D. W. (2003). Perceiving prosody from the face and voice: Distinguishing statements from echoic questions in English. Language and Speech, 46, 1–22. Sulpizio, S., & McQueen, J. M. (2012). Italians use abstract knowledge about lexical stress during spoken‐word recognition. Journal of Memory and Language, 66(1), 177–193. Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech intelligibility in noise. Journal of the Acoustical Society of America, 26, 212–215. Suomi, K., Toivanen, J., & Ylitalo, R. (2003). Durational and tonal correlates of accent in Finnish. Journal of Phonetics, 31, 113–138. Tagliapietra, L., & Tabossi, P. (2005). Lexical stress effects in Italian spoken word recognition. In B. G. Bara, L. Barsalou, & M. Bucciarelli (Eds.), Proceedings of the XXVII Annual Conference of the Cognitive Science Society (pp. 2140–2144). Stresa, Italy: Lawrence Erlbaum. Tyler, M. D., & Cutler, A. (2009). Cross‐ language differences in cue use for speech segmentation. Journal of the Acoustical Society of America, 126, 367–376. van Donselaar, W., Koster, M., & Cutler, A. (2005). Exploring the role of lexical stress in lexical recognition. Quarterly Journal of Experimental Psychology Section A, 58, 251–273. van Heuven, V. J. (1988). Effects of stress and accent on the human recognition of word fragments in spoken context: Gating and shadowing. In W. A. Ainsworth & J. N. Holmes (Eds.), Proceedings of Speech ’88, the 7th FASE Symposium (pp. 811–818). Edinburgh: Institute of Acoustics.

van Leyden, K., & van Heuven, V. J. (1996). Lexical stress and spoken word recognition: Dutch vs. English. In C. Cremers & M. den Dikken (Eds.), Linguistics in the Netherlands 1996 (Vol. 13, pp. 159–170). Amsterdam: John Benjamins. van Ooijen, B. (1996). Vowel mutability and lexical selection in English: Evidence from a word reconstruction task. Memory & Cognition, 24, 573–583. Vroomen, J., Tuomainen, J., & de Gelder, B. (1998). The roles of word stress and vowel harmony in speech segmentation. Journal of Memory and Language, 38(2), 133–149. Wagner, A. (2013). Cross‐language similarities and differences in the uptake of place information. Journal of the Acoustical Society of America, 133, 4256–4267. Wagner, A., Ernestus, M., & Cutler, A. (2006). Formant transitions in fricative identification: The role of native fricative inventory. Journal of the Acoustical Society of America, 120, 2267–2277. Warner, N. L., Hernandez, G. G., Park, S., & McQueen, J. M. (2018). A replication of competition and prosodic effects on spoken word recognition. Journal of the Acoustical Society of America, 143, 1921. White, K. S., Chambers, K. E., Miller, Z., & Jethava, V. (2018). Listeners learn phonotactic patterns conditioned on suprasegmental cues. Quarterly Journal of Experimental Psychology, 70, 2560–2576. Williams, B. (1985). Pitch and duration in Welsh stress perception: The implications for intonation. Journal of Phonetics, 13, 381–406. Woodson, E. A., Reiss, L. A. J., Turner, C. W., et al. (2010). The Hybrid cochlear implant: A review. Advances in Oto‐Rhino‐ Laryngology, 67, 125–134. Yu, J., Mailhammer, R. & Cutler, A. (2020). Vocabulary structure affects word recognition; Evidence from German listeners. In In N. Minematsu, M. Kondo,

Lexical Stress 265 T. Arai, & R. Hayashi (Eds.), Proceedings of Speech Prosody 2020 (pp. 474-478). Tokyo: ISCA. Zeng, F.‐G., Nie, K., Stickney, G. S., et al. (2005). Speech recognition with amplitude and frequency modulations. Proceedings of the National Academy of Sciences of the United States of America, 102, 2293–2298. Zhang, T., Dorman, M. F., & Spahr, A. J. (2010). Information from the voice

fundamental frequency (F0) region accounts for the majority of the benefit when acoustic stimulation is added to electric stimulation. Ear and Hearing, 31, 63–69. Zora, H., Heldner, M., & Schwarz, I.‐C. (2016). Perceptual correlates of Turkish word stress and their contribution to automatic lexical access: Evidence from early ERP components. Frontiers in Neuroscience, 10, 7.

10 Slips of the Ear Z. S. BOND Ohio University, United States

Everything’s got a moral, if only you can find it. Lewis Carroll, Alice’s Adventures in Wonderland Most of the time we understand what was said quite well. Occasionally we are not sure of what we heard, when speech seems indistinct or muffled. Sometimes, we struggle with obscure content or misjudge emotional tone. And sometimes we experience a slip of the ear. We hear as clearly as ever words or phrases that are different from what was intended by a talker. These perceptual errors are variously known as mishearing, misperceptions, slips of the ear, and sometimes mondegreens,1 a coinage that is developing a specialized meaning of misperceptions of song lyrics.2 Slips of the ear in casual conversation provide an insight into the use of knowledge in speech perception and language understanding. They have a moral, if only we can find it. A typical simple example is: (1) Talker: I teach speech science Listener: Speech signs? In place of the intended word, the listener recovered a word that seemed to her anomalous, surprising, or somehow inappropriate. The listener then said what she thought she heard, indicating uncertainty and asking for clarification. In slips of the ear of this kind, we are able to make inferences about the way listeners employ their knowledge of their language. Listeners’ reports of a clear perception that does not correspond to talkers’ utterances can arise only when they employ knowledge in order to recover and resolve unclear or ambiguous utterances. Furthermore, it seems reasonable to argue that features of speech that are perceived or recovered correctly – the perception matches the talker’s production – are properties that are salient and useful to listeners in perceiving speech.

The Handbook of Speech Perception, Second Edition. Edited by Jennifer S. Pardo, Lynne C. Nygaard, Robert E. Remez, and David B. Pisoni. © 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.

Slips of the Ear 267 In this chapter, I want to describe a collection of slips of the ear that my colleagues and I have assembled over the years. For the record, previous descriptions of parts of the collection can be found in Bond (1973, 1999, 2005); Bond and Garnes (1980a, 1980b); Bond and Robey (1983); Garnes and Bond (1975); Shockey and Bond (2007); and Shockey and Bond (2014). In Bond (1999) I attempted to list and describe all slips of the ear that were part of the collection. The collection keeps expanding, with approximately 250 new examples. Here I want to emphasize knowledge of language. As far as possible, I want to add examples that have entered the collection recently. New examples of slips of the ear provide insight yet fit into categories developed in previous work. Except where noted otherwise, all examples come from conversations between adults speaking a variety of American English.

Challenges with observational data Speech errors, slips of the tongue, or slips of the ear occur during ordinary conversation under ordinary circumstances. Because speech is evanescent, researchers who are interested in errors in ordinary speech have typically collected examples on the fly. They notice a speech error of some kind and write it down as quickly and reliably as possible.

Slips of the tongue This methodology has been used to assemble extensive corpora of speech production errors, or slips of the tongue. Students of production errors have repeatedly observed that collecting accurate observational data is challenging. For example, Bock (1996) pointed to the many sources of bias in collecting speech errors. First, an error must be heard. Some errors will be missed, while other reported errors may be false positives. The memory for some kinds of errors may be more accurate than for other kinds. Writing down errors reliably may be easier for word errors than for errors involving phonetic structure (see also Ferber, 1991). Accepting error reports from contributors raises further problems. The many speech errors attributed to the Rev. Spooner were apparently collected by students and colleagues. These reports suggest that there may have been embellishments of errors, deliberate or otherwise, which are responsible for their humor. For example: (2) The lord is a loving shepherd Pardon me madam, this pew is occupied Can I show you to another seat? Problems in collecting slips of the tongue have been somewhat ameliorated by recording technology. It is now possible to access recordings of spontaneous speech, for example of a group that is part of a news program or of talkers who call with questions or comments. Such recordings can be examined at great length.

268 Perception of Linguistic Properties Even then, listeners may not necessarily agree about either the presence or the nature of some speech errors.

Slips of the ear Whatever the difficulties for slips of the tongue, the problems in detecting, collecting, and analyzing slips of the ear pose greater challenges. Ideally, a talker says something that is perceived accurately by one listener, an observer, in a conversational group but that is misperceived by another listener, who then indicates puzzlement by repeating the misperception: (3) Talker: I had an argument with Dave about this. Listener: With slaves? However, most slips of the ear do not occur under ideal circumstances and recording them can lead to bias in a number of ways. • Misperceptions can be detected only if a listener indicates puzzlement or lack of understanding. Slips of the ear that listeners can resolve easily are not recorded. • The observer could be in error about the production or the perception. • Lack of understanding may be caused by a slip of the tongue, that is, the talker may be responsible, not the listener. • There is no direct access to what a talker has said, only a report or an inference. • There is no direct access to what a listener has perceived, only a report or (sometimes) an inference. • Either one of the participants in a conversation or a third party has to write down accurately what was said and what was perceived, raising possible problems with memory and reporting. • If a collection of slips of the ear includes reports from colleagues and students, it is open to embellishment. • If a collection does not include reports from many sources, it is based on a narrow selection of talkers and listeners and may not be representative. • The same considerations apply to websites that collect perceptual errors. There is a long experimental tradition of asking listeners to respond to deliberately mispronounced words, going back at least to Bagley (1900). The assumption underlying these experiments is that mispronunciations from which listeners can recover easily provide less useful information than mispronunciations that seriously impede recovering words (see e.g. Cole & Jakimik, 1978; Bond & Small, 1983; Lahiri & Marslen‐Wilson, 1991; Lahiri & Reetz, 2010). Though these experiments undoubtedly provide information about salient properties or, as some say, “islands of reliability,” the experiments do not necessarily mirror language use under ordinary circumstances.

Slips of the Ear 269 Slips of the ear provide this unique perspective. Talkers are more or less careful in what they say as they speak with different listeners; the environments in which talkers are speaking vary considerably from those that are noisy and full of distractions to those that are quiet and focused. Talkers employ different dialects and foreign accents that may or may not be familiar to listeners. The relationships between talkers and listeners vary from intimate to collegial, to novel. In the same way, listener attention, shared knowledge, and expectations vary. These are the ambient circumstances for ordinary conversation. When slips of the ear occur, they point us toward the mental processes at play. In this report, I am doing no more than giving examples for one classification scheme of slips of the ear. Although it would be valuable to be able to describe how common various error types are, at this time frequency counts say as much about the methodology of error collection as they do about the occurrence of error types. For example, errors affecting a word or a phrase are more common in the collection than errors affecting sentences. Yet this difference could simply reflect listeners’ preferences for reporting shorter errors, which are easier to remember.

Phonetics The majority of slips of the ear affect a word or a short phrase. Consequently, the majority of slips involve a discrepancy between the target utterance and the perception at the level of phonetics or phonology. Describing the differences between the said and the heard provides insight into the salience of various features of phonetic structure. Slips of the ear also show implicit knowledge of the phonology of casual speech.

Homophones and near homophones A small proportion of slips of the ear do not involve any actual misperception of phonetic properties. Rather, listeners are accurate in recovering the phonetic sequence produced by the talker but misinterpret what they hear; that is, they select the wrong homophone from their mental lexicon. For example: (4) Sirius XM radio → serious XM radio courts → quarts nones → nuns In another example of accurate phonetic information but erroneous lexical selection, the listener interpreted the proper name Hannah Graham as a newly coined word, a blend of Hannah and telegram, with the interpretation being a telegram sent by Hannah. Lexical items that contain medial flaps are also homophonous for almost all Americans, as for example: (5) Sweetish → Swedish Bertie → birdie

270 Perception of Linguistic Properties Because American English word‐final fricatives tend to be only partially voiced, the target word and its misperception are probably homophonous in: (6) Her niece is in the hospital → Her knees . . . The recovery of homophonous words suggests that listeners quickly connect a phonetic representation with the lexicon, with few or no constraints from contextual plausibility or word frequency. Vitevitch (2002) has also argued for this position. More common than selecting the incorrect homophone are misperceptions in which the phonetic sequence is perceived correctly but a word boundary is added or lost; examples are given in (7). Phonetically, listeners tend to misplace word boundaries in relationship to unstressed syllables. The recovered words lead to unexpected meanings or even nonwords, for example leppo; the incongruous word then cause listeners enough puzzlement to comment on the misperception. (7) Aleppo → a leppo Wall and Associates → Walland Associates the ultimate in convenience → the ultimate inconvenience hair events → Harry Vents Aleve → a leave cry of destiny → cryo destiny On occasion, relatively accurate recovery of segments will lead to a very far‐ fetched perception, as in: (8) myxomycetous → mix of my CDs pet hair and dirt → pet heron dirt

Vowels and stress Misperceptions affecting stressed vowels or prosodic patterns are relatively rare. Some vowel misperceptions are associated with phonetic structure, others with dialect variation; still other misperceptions show no obvious triggering mechanisms. Vowel misperceptions tend to involve vowel height and tenseness rather than the front–back dimension; they are often associated with consonants that affect vowel quality, particularly nasals and the liquids /l, r/. Labov (1994) has made a similar observation in describing his collections of slips of the ear. The examples in (9) are typical: (9) Elf → Alf Jane → Jean You can get it from wool → . . . wall snow pea → Snoopy I’m going to go down stairs and do some laminating → . . . lemon eating Dialect differences are involved in some vowel misperceptions. For example, in the misperception special heard as spatial, the talker was native to southeastern

Slips of the Ear 271 Ohio, a dialect in which tense vowels are used before palatal sibilants; the listener was from northern Ohio. The listener heard the tense vowel accurately in the talker’s production, but failed to recover the intended word. A similar misperception involved an unstressed syllable. In the conversation, a listener from Ohio interpreted a British talker’s pronunciation of hookah as hooker. In another example involving dialectal differences, the talker, who is from North Carolina, gives the place name Wattsville. The listener is apparently trying to compensate for the monophthongization characteristic of Southern speech and reported that he heard the place name as Whitesville. Sometimes vowels peek through even when many other properties of an utterance are misperceived. Some examples of accurately or nearly accurately perceived vowels in complex errors are given in (10): (10) That’s grist for the mill → . . . Christopher Mills a case of wine → to paint the blind How can you get rid of paint cans? → . . . pink hands? immense ethereal gulf → immense empirical goat However, vowels are not immune to misperception in complex errors, as in (11): (11) parcheesi set → parmesan cheese I have a headache → I haven’t had any What else? → mayonnaise never enough → mother him up Misperceptions affecting stress placement are less common than errors affecting stressed vowels, though there are a few examples: (12) giving an award → giving an oral usual grace period → illegal grace period psychic → psychotic Louisiana → New Zealand she isn’t saying → she’s insane cherry festival → Chinese These misperceptions involve not only stress but considerable restructuring of other properties of the utterance such as vowel quality, consonant quality, word boundaries, and number of syllables. Because the overall stress pattern and associated vowel quality seem to be quite resistant to misperception, we have suggested that they provide a primary scaffolding which is then filled in, with consonants and unstressed syllables serving in a subsidiary role. However, it is worth adding the caution that the acoustic specification of lexical stress, the perception of stress, and the role of stress in specifying lexical items is anything but simple and straightforward (see Cutler, 2005).

272 Perception of Linguistic Properties

Consonants Consonants are much more likely than vowels to be affected in slips of the ear, in both simple and more complex misperceptions. Listeners may simply not report consonants – that is, they were not perceived at all; listeners may report spurious consonants, or they may misperceive consonant quality along any dimension. Consonants present in the target utterance may be lost in any position within a word. Final consonant loss is the most common, probably because American English speakers use little force in articulation at the ends of words. Also, longer words with a final consonant missing can become a legitimate shorter word. Several examples are given in (13): (13) air assault →aerosol sex and insects → sex and incest card → car when their condition → air condition some rice → some ice What are those sticks? → . . . those ticks? Wilmington → Willington insufficient → inefficient studying Javanese internally → eternally Spurious consonants may also appear in any position within a word, as in (14): (14) slip of the ear → slip of the year too much air → too much hair It would hurt it → . . . pervert it What kind of pans did you use? → . . . pants? Rudal → Rudolf the white sauce ladies → . . . white socks . . . It will be a tenting situation → . . . tempting . . . doggie → donkey red spire pears → red spider pears Consonant quality misperceptions are the most commonly attested. Any one of the traditional features – manner, place, voicing – can undergo a perceptual substitution, though there is a tendency for obstruents to be replaced by obstruents and for resonants to be replaced by resonants. Some misperceptions of obstruent targets are given in (15): (15) constraint‐based phonology → straight‐faced phonology back window → tack one doe cloudy →ploddy flu shots → blue shots pink hair → think hair big classes → glasses

Slips of the Ear 273 Some misperceptions of resonant targets are shown in (16), not all by other resonants: (16) Get me a mine → Get me some wine warm strangers → born strangers loan word → long word a small computer → a smoke computer reed → wheat Misperceptions affecting consonants are quite diverse. Some misperceptions are dependent on phonetic factors such as acoustic similarity or weak articulation; others are related to misperceptions of word boundaries. Still other consonant misperceptions, like vowel misperceptions, do not have any obvious explanation.

Not much phonetic resemblance At the other extreme from homophones which simply lead to erroneous lexical retrieval are misperceptions in which much of the phonetic information is reported incorrectly. Some examples are shown in (17): (17) post doc → co‐star Q‐Tip → toothpick expletives → technical tips We don’t wear bangles → . . . fear vandals a case of wine → to paint the blind fish sticks → strip steaks

Well‐formedness Even though misperceptions can result in extensive mismatch between the spoken utterance and the perception, perceived utterances tend to be phonologically well formed. Almost all misperceptions are composed of English vowels and consonants. The only example of a non‐English consonant came from an anthropological linguist, who reported that she perceived Patwin as Paʔwin; possibly her experience speaking and listening to languages with a wide variety of consonants promoted the misperception.3 The phonotactic structure of syllables fit the English pattern. For example, in the word‐boundary error from this guy perceived as from the sky, the listener reported a voiceless stop /k/ for the voiced /g/ in the cluster with /s/, as English requires. When presented with consonant sequences not permissible in English, listeners reported an acceptable sequence instead: (18) tlumbering → klumbering Sruti → Trudy The phonetic and phonological structure of a language puts constraints on what listeners perceive or report.

274 Perception of Linguistic Properties To compare the talkers’ intended utterances with the listeners’ perceptions allows us to infer relatively stable phonetic properties. Vowels and associated prosodic patterns are acoustically prominent and tend to be perceived accurately. It seems plausible that listeners use the information conveyed by vowels to find likely word candidates in the lexicon. Consonant substitutions reflect phonetic categories, such as obstruents versus resonants. Consonant loss is most common when they are in word‐final position whereas consonant additions sometimes seem to be associated with word‐boundary assignments. Though there is dispute about the details, there is also considerable evidence that consonants and vowels have somewhat different functions in speech and language (e.g., Owren & Cardillo, 2006; Moates & Marks, 2012). Most clearly, slips of the ear are constrained by what the phonology of the language allows.

Casual speech In descriptions of casual speech or casual conversation, one property is paramount: variation. For example, Tucker and Ernestus (2016) report that in the Switchboard corpus, the word that was pronounced in 117 different ways; the word and had 87 variants, and the word people 21 different pronunciations. Of course, it is difficult to say how great differences have to be to influence perception. Yet clearly the pronunciation of words varies extensively, and variant pronunciations can be linked to slips of the ear. One way to show the connection is from the point of view of reductions typically present in casual speech. Shockey (2003) described phonological reductions commonly found in American English. Many of these reductions appear to be triggers for slips of the ear. When they encounter an utterance characterized by phonological reductions, listeners follow one of two paths. Either they accept the reduction at face value, that is, their perception is determined by phonetic accuracy rather than by speaker intent, or they implicitly reverse or correct for the phonological reduction and report perceiving an utterance that they believe matches the speakers’ intent. These corrections indicate that listeners employ knowledge of phonological variation and possible reductions. (We have discussed the relationship between casual speech reductions and perceptual errors in Shockey and Bond, 2007, 2014.) Lost consonants Word‐final sequences /nd/ or /nt/ as in wind or lent, are almost always pronounced without the stop. Sometimes listeners report a word with a final nasal, indicating no awareness of the speaker’s intention. In the reverse misperception, listeners recover a stop, never intended by the speaker: (19) an exam at Kent State → . . . Wayne State round trip → one trip She writes comments on our papers → . . . writes comets . . . Finn → friend Creek Inn was → Creek End was I can see you at four → I can’t see you at four

Slips of the Ear 275 Word final /s/ followed by a stop may be reduced to the fricative alone, as in firs[t] place. Listeners may perceive a word based on the phonetic ground. In the reverse, listeners supply the stop. (20) I just like it → . . . dislike it West look → bus look honors political science → honest political science you can weld with it – braze → braids goes, like → ghostlike Weak closure or constriction The American English labio‐dental fricatives /f v/ are fairly weak acoustically and may not be pronounced at all, for example in lots of as [lɑtsə]. This reduction leads to misperceptions either based on accurate phonetics or on a recovered fricative: (21) double life → double lie floor of the house → Florida house parachute → pair of shoes moos → moves Flapping (or tapping) is probably one of the most prominent characteristics of American English pronunciation. Just as in other reduction processes, perceptual errors can go in both directions. Listeners interpret the flap as coming from a word with either a voiced or voiceless stop (assuming listeners’ mental representations of these words contain stops). (22) Sweetish → Swedish Patting → padding Bertie → birdie Mrs. Winner → Mrs. Winter Weak or abbreviated closure for stops sometimes leads to turbulent air flow, just sufficient for stops to be perceived as fricatives. The reverse misperceptions suggest that listeners may overcorrect and report a stop for a fricative: (23) Those are mating gibbons → givens Laboratory situations → lavatory Long line of pillagers → villagers Belvita → Velveeta Oval → opal Love it → Lubbock Velarization In word or syllable final position, /l/ is velarized and often pronounced as a high back vowel of some kind; that is, there is no tongue tip contact with the alveolar ridge. Listeners report a vowel or, alternatively, interpret a vowel as a velarized lateral: (24) glottal wave → auto wave

276 Perception of Linguistic Properties mail → mayo Emil [iml] → [imʊ] meadow muffin → metal muffin Hawbecker → Holbecker Delwo → Delwol /ə/ reduction Some talkers omit /ə/ and other reduced vowels, as in the common pronunciation /spoz/ for suppose. Misperceptions may reflect this reduction or extra syllables may result from compensating for a supposedly lost syllable. (25) a process of residence selection → residence lection Support services → sport services About some follow‐up → some foul‐up Two models of speech perception → two miles Leena → Elena It would hurt it → it would pervert it Wrangler → regular Do you want me to recite? → recycle Palatalization A casual speech reduction that does not seem to be implicated in many misperceptions is commonly known as palatalization, exemplified by pronunciations such as [mɪʃu] for miss you or [wɑtʃə duɪn] for What are you doing? In our data, we have found no misperceptions clearly associated with palatalization. The closest would be: (26) Tulane → chulane Gallons and gallons of coffee → jallons and jallons /dʒælənz/ Middle of drill field → Jill field In these, an alveolar stop is perceived as a palato‐alveolar affricate in a phonetic environment in which palatalization might be expected. It is also possible, however, that the misperception Tulane → Chulane is the reverse of an example in (20) I just like it → I dislike it, classified as consonant loss.4 No observed misperceptions fit the classic examples of palatalization which would affect sequences such as miss you or what are you doing? One characteristic of slips of the ear involving phonetics and phonology is that (with the one exception) listeners never report hearing anything that contains non‐English consonants or vowels. The same constraint is also true of the phonotactic structure of syllables in that misperceptions conform to English syllable structure. Although many of the friends and colleagues who have contributed slips of the ear are native speakers of languages other than English or were extremely familiar with other languages, only one introduced a non‐English segment into a misperception. Possibly this is a result of reporting; it would be awkward to produce an English phrase containing non‐English segments. It is also possible, however, that phonetics and phonology impose strong constraints on what is a possible percept and therefore on possible misperceptions.

Slips of the Ear 277 A second characteristic that emerges from examining the phonetics and phonology of slips of the ear is that listeners use knowledge. When listeners compensate for a supposed phonological reduction, they are using their knowledge of phonology, whether represented by algorithms or by matching to features or patterns.

The shape of words Vitevitch (2002) shows that words that serve as targets and words that serve as replacements in slips of the ear do not differ on many characteristics in the lexicon. He compared the actual utterance to the perceived utterance using number of syllables, number of phonemes, word familiarity, word frequency, neighborhood density, and neighborhood frequency as the dependent variables. For all of these measures, he found no significant differences overall between the actual and the perceived utterances. Vitevitch suggested that the processing system selects the representation that best matches incomplete or erroneous input rather than catastrophically halting. This analysis is consistent with the idea that properties that are resilient to misperception serve as “islands of reliability” in retrieving words from the lexicon.

Word boundaries Word boundaries may appear or disappear without any other appreciable phonetic changes, as in (27): (27) I haven’t seen Alone → a loan piece of pie → pizza pie acute back pain → a cute back pain Hazelhurst → Hazel Hurst pet hair and dirt → pet heron dirt cinema and photography → cinnamon photography powerful and accurate shot → inaccurate shot a neuroma → an aroma As these slips show, word boundaries can be lost, added, or shifted, often in connection with function words. A short initial unstressed syllable may be perceived as an article (a loan, a cute) or an unstressed function word may become an extra syllable in a longer word (pizza, heron). When slips involve not just word boundaries but numerous other properties, quite extensive mismatches are possible, as in (28): (28) setting up of time → studying of time I get to leave this place → I can’t believe this place one cup of weak coffee →one Cocoa Wheat puff

278 Perception of Linguistic Properties It’s worth adding that misperceptions may have more or fewer syllables than the targets, that is, the number of syllables is not particularly stable, as in (29): (29) the most literate → the most illiterate psychic → psychotic Dover sole → Dover salad orgasm → organism support services → sport services unmasking of the spying → spine

Nonwords The lexicon does not constrain possible perceptions in that listeners report perceiving nonwords. In fact, sometimes nonwords suggest to a listener that a misperception has occurred. Some examples of reported nonwords are: (30) phone → thone /Θon/ the article → the yarticle sitter problems → sinter problems Paula played with Tom → polyp laden tham /Θɑm/ barbell set → bord la sell We even have an example of a nonword perceived by an 11‐year‐old child, beagle heard as beable; the child asked what beable means. That listeners can easily encode and report nonwords is consistent with the argument that lexical representations involve segments.

Order of segments Some slips of the ear suggest that the order of segments from target word to perception is not stable. We can view errors in segment order as a result of mistakes in using the overlapping articulatory information for consonants and vowels. We can also think of lexical representations as analogous to a rope in which the properties of segments are distributed over a shorter or longer portion. The misperceptions in (31) seem to involve changes in the order of segments: (31) frothing → throfing found a copy → pound of coffee spun toffee → fun stocking There’s a wasp trapped in the blinds → There’s a swap wrapped on the line

Function words In casual speech, function words are typically unstressed and reduced, with a great deal of variation in pronunciation (see Tucker & Ernestus, 2016). In slips of

Slips of the Ear 279 the ear, function words show little stability, being involved in word‐boundary misperceptions as well as in word substitutions. In (32), function words seem to appear and disappear in order to approximate reasonably well‐formed phrases: (32) change for a dollar → exchange a dollar She wants to be a teacher → She wants me to teach her I don’t intend to stay in the picture → . . . to stain the picture this friend of ours who visited → . . . is an idiot I don’t think of the bass as a solo instrument → . . . as so low an instrument

Syntax and semantics Slips of the ear that involve syntax in some way are not particularly common, requiring at least a phrase to be detectable. Still, there are enough data to make a few observations. As noted earlier, many slips of the ear appear to be well formed, perhaps through correction in reporting. On occasion, however, listeners report ungrammatical sentences. These slips tend to depend on interpreting a word as a different part of speech from what was intended: (33) we offered six → we Alfred six I’m conking out → I’m coffee out I think I may have found a bug → I think I me a found a bug I’ll catch my breath here → I’ll catch my brush up In the first example, the listener misinterpreted the verb offered as Alfred; a similar misreading of a verb as a noun occurred in coffee for conking. The third example shows a repetitive pronoun apparently based on misperceiving may as me. Misinterpreting the part of speech of words is the most common type of slip related to syntactic structure. For example in the misperception (34), the listener reinterpreted the noun teacher as a verb plus object, apparently adding the required word me to maintain well‐formedness. (34) She wants to be a teacher → She wants me to teach her Other example of misperceptions that hinge on misinterpreting part of speech include: (35) some tea with this → to eat with us Yankees’ four–two win → Yankees forced to win a star in the East → I start to eat a little kid on a skateboard → a little kid escaped John’s nose is on crooked → knows his own cooking Each of these misperceptions also shows accompanying modifications of function words that result in well‐formed utterances.

280 Perception of Linguistic Properties In almost all misperceptions, the overall phrasing marked by the intonation contour and the sentence stress pattern seems to be stable, supporting the conjecture that prosody serves as a scaffolding or framework that allows for further syntactic analysis. However, even though it is associated with intonation, the information specifying the grammatical function of sentences is not stable. Listeners reported misperceptions that did not correspond to the intent of the speaker’s utterance. For example, in the misperception Islamabad → Is his Lama bad? the listener reported an interrogative for a declarative sentence. In the misperception Where are my clothes? → germaphobe the reverse took place, whereby an interrogative was reported as one word. The interrogative Where are your jeans? was perceived as a command, Weal your jeans, by a child. These examples imply that the function of sentences is not necessarily reliable information. Two listeners have reported experiencing ambiguous interpretations of sentences while listening to speech. These may offer an insight into sentence understanding. In (36) the listener commented that she immediately saw an ambiguity: there are two different ways of interpreting the utterance, depending on whether the word spring is interpreted as a verb or as a noun: (36) Daffodils are blooming on our set Hope springs eternal / Hope spring’s eternal In (37) the listener heard a news report: (37) . . . called for the head of NSA . . . to testify Her first interpretation was that the head of NSA was being asked to resign. When she heard the phrase to testify, she changed her interpretation of the sentence. The two examples in (38) depend on a slip of the ear with the listener interrupting for clarification: (38) I had this appointment → . . . disappointment? toucan → two can (child 5 responds, ‘Two can what?’) In the first example in (38) the listener interrupted because he misinterpreted the phrase this appointment as the word disappointment, and was concerned about the problem facing the talker. In the second example, a child interpreted a word not familiar to him as a phrase requiring a continuation and asked for the expansion. These observations imply that listeners – even five‐year‐olds – develop a semantic interpretation of a phrase quickly, from partial information. It seems very likely that listeners become aware of slips of the ear because what they hear is unexpected, incongruous, or wonderfully weird. There are quite a few examples of slips that fit these categories. For example: (39) There’s a wasp trapped in the blinds → There’s a swap wrapped on the line May the Force be with you → Metaphors be with you

Slips of the Ear 281 immense ethereal gulf → immense empirical goat after the rubber boat had been wrecked in the squall → after the rubber boot had been erected in the squirrel I seem to be thirsty → I sing through my green Thursday my interactive Pooh → Mayan rack of Pooh That listeners report hearing novel and unexpected utterance should not be s urprising. One of the basic insights in linguistics is that talkers can say almost anything: the unexpected, the incongruous, or the wonderfully weird. Consequently, listeners would be expected to be open to hearing and understanding such utterances. In a few slips of the ear, the target word and the misperception seem to occupy the same semantic domain even though they show considerable phonological discrepancies. This is especially true of proper names. For example: (40) Celtic River → Kelton River Sruti → Trudy Ali Salim → Alisa Lim Stockholm → Scotland Latgalian → Hungarian Athens → Akron pathology → psychology It’s ten till nine → It’s Central Time It is possible that the general context of the conversation provided enough information for listeners to develop expectations about what was to come. Grammatical knowledge seems to place various linguistic constraints on perceptual errors but not to the same degree. Phonetics and phonology are the most rigid. Listeners do not report non‐English consonants or vowels, nor do they report non‐English syllables. Words are somewhat less rigidly defined. Nonwords appear in perceptual errors but are almost always well formed. Syntax also has fewer constraints than phonetics and phonology. Function words are added or omitted quite freely. Nongrammatical sentences appear though there is a strong tendency to respect constituent boundaries and to have perceptual errors follow the overall prosodic pattern of the utterance. Semantics – meaning – and pragmatics seem to allow almost any interpretation of an utterance.

Slips of the ear in other languages All the examples of slips of the ear so far have come from conversations in American English. To the best of my knowledge, the only other corpus of slips of the ear comes from German. I have managed to collect just a handful of slips in Latvian. Yet cross‐linguistic data would be immensely valuable. Although slips of the ear clearly show that linguistic knowledge is invoked in language understanding, it is not

282 Perception of Linguistic Properties entirely clear how to distinguish language‐specific from language‐universal processes. We can make judgments about specific instances, that syllable structure is language specific, for example. But many other characteristics of slips cannot be clearly attributed; for example, casual speech assimilations and reductions can be found in many languages. It is worth examining slips of the ear in German and in my few Latvian examples. Hopefully, the SEAR Project website (http://searproject. org), organized by K. Tang, will provide an outlet for slips of the ear in many languages.

German Rudolf Meringer, working with his colleague Karl Mayer, published two volumes of errors in the use of language, speaking, writing, listening, and reading (Meringer, 1908; Meringer & Mayer, 1895). His corpus of slips of the ear in ordinary conversation consists of 28 examples, discussed in detail by Celce‐Murcia (1980). The following examples of misperceptions affecting words and translations are from her work: (41) Beruf → Geruch [occupation → smell] Pferd → Bär [horse → bear] Sarg → Sack [Coffin → sack] The misperception of horse and bear is another one of the examples in which a target word and misperception share a semantic domain. There are also misperceptions that show a greater discrepancy between the target word and the misperception: For example: (42) Verdruss oder Kummer → Durst oder Hunger [dismay or sorrow → thirst or hunger] Jünger ist → Hühner isst [younger is → chicken eat] In his analysis of the misperceptions, Meringer suggested that vowels, especially stressed vowels, are typically perceived accurately whereas consonants tend to be more prone to error. In her analysis, Celce‐Murcia points to other contributing factors, such as dialect differences, proper nouns, foreign words, idiomatic expression, and the context of conversations. The German misperceptions do not introduce any characteristics that could be interpreted as uniquely German. It is possible, however, that slips involving morphology or syntax would show the use of language‐specific linguistic knowledge.

Latvian The set of misperceptions from Latvian is quite small, and all of them involve a word or short phrase. Latvian uses syllables with complex onsets and codas;

Slips of the Ear 283 almost all words are stressed on the initial syllable. Structurally, Latvian has grammatical gender and employs complex nominal and verbal inflections. The slips are given in traditional orthography in which a macron marks a long vowel. The majority of slips affected consonants and one of the consonant misperception resulted in a nonword. (43) Gaiga → Baiba [personal names] vēstule → mēstule [letter → nonword] tuncis → duncis [tuna → dagger] vēsture → vēstule [history → letter] Only one misperception affected a target vowel. The talker’s utterance was the English proper name Hillary in the context Hillary Clinton, perceived as Hillaru. The listener may not have been familiar with the name, and the talker may also have pronounced the final vowel indistinctly. There were no other simple vowel misperceptions. As we would expect, there are also misperceptions in which the correspondence between the target utterance and the perception showed a greater discrepancy in both consonants and vowels. Several also showed the tendency for misperceptions to occupy the same semantic field, here proper names and words referring to time. The misperception of the word slave creates a sentence that is not well formed. Although several word substitutions have inflectional case suffixes, the case forms for both the target word and the misperception are the same and appropriate, hinting that inflectional structure might show some stability. (44) Strādāt svēdienā ir grēks → vērgs [work on Sunday is sin → slave] vienu minūti → vienu mēnesi [one minute → one month] Sigulda → Sigita [place name → personal name] nevar būt par daudz kopijas → kafijas [can’t be too many copies → coffee] nosalusi → nozagusi [frozen → stolen] In (45) one misperception is a word‐boundary error while the other two involve extra syllables forming words. (45) Imants Krastiņš → Imants Skrastiņš [proper names] mīlas lietas → mīlestības vēstule [lovely things → love letter] lieliska izvēle → liela skaista izvēle [excellent selection → large beautiful selection] These samples of slips of the ear are very small in number, so any generalizations must be made with caution. Yet the overall impression of these slips is that the speech perception/language understanding processes from similar to what we have seen in English: fewer vowel than consonant misperceptions; some simple and some more complex substitutions; and word boundaries. As it stands, it seems that some sources of misperceptions may be language universal.

284 Perception of Linguistic Properties

Conclusion If everything has a moral, what is the moral to be drawn from slips of the ear? Because of the many sources of bias that can potentially influence the data, we must proceed with care. Slips of the ear give us a glimpse of language use under ordinary circumstances, and allow us to develop a clearer understanding of what listeners do. At the very least, we can see listeners accessing words and phrases quickly from partial information and using linguistic knowledge as they do so.

NOTES 1 The term mondegreen was coined in 1954 from a misperception of a line in a folk song, “Oh, they have slain the Earl o’ Moray and laid him on the green” → . . . and Lady Mondegreen. 2 Examples include some famous lyrics such as “The ants are my friends. They’re blowing in the wind” (Bob Dylan); “’Scuse me while I kiss this guy” (Jimi Hendrix); “Row, row, row your boat . . . Life’s a butter dream” (traditional). They can be found on numerous websites, and have been noted by Vitevitch (2002) and Edwards (1995). 3 Glottal stops as well as other consonants and vowels can be present phonetically but not function contrastively in English. 4 I want to thank one of the editors for this observation.

REFERENCES Bagley, W. C. (1900–1901). The apperception of the spoken sentence: A study in the psychology of language. American Journal of Psychology, 12, 80–130. Bock, K. (1996). Language production: Methods and methodologies. Psychonomic Bulletin & Review, 3, 395–421. Bond, Z. S. (1973). Perceptual errors in ordinary speech. Zeitschrift für Phonetik, 26, 691–695. Bond, Z. S. (1999). Slips of the ear: Errors in the perception of casual conversation. San Diego, CA: Academic Press. Bond, Z. S. (2005). Slips of the ear. In D. B. Pisoni & R. E. Remez (Eds.), The handbook of speech perception (pp. 290–310). Oxford: Blackwell.

Bond, Z. S., & Garnes, S. (1980a). Misperceptions of fluent speech. In R. A. Cole (Ed.), Perception and production of fluent speech (pp. 115–132). Hillsdale, NJ: Lawrence Erlbaum. Bond, Z. S., & Garnes, S. (1980b). A slip of the ear: A snip of the ear? A slip of the year. In V. A. Fromkin (Ed.), Errors in linguistic performance: Slip of the tongue, ear, pen, and hand (pp. 231–239). New York: Academic Press. Bond, Z. S., & Robey, R. R. (1983). The phonetic structure of errors in the perception of fluent speech. In N. J. Lass (Ed.), Speech and language: Advances in basic research and practice (pp. 249–283). New York: Academic Press.

Slips of the Ear 285 Bond, Z. S., & Small, L. H. (1983). Voicing, vowel and stress mispronunciations in continuous speech. Perception & Psychophysics, 34(5), 470–474. Celce‐Murcia, M. (1980). On Meringer’s corpus of “slips of the ear.” In V. A. Fromkin (Ed.), Errors in linguistic performance: Slip of the tongue, ear, pen, and hand (pp. 199–211). New York: Academic Press. Cole, R. A., & Jakimik, J. (1978). Understanding speech: How words are heard. In G. Underwood (Ed.), Strategies of information processing (pp. 67–116). London: Academic Press. Cutler, A. (2005). Lexical stress. In D. B. Pisoni & R. E. Remez (Eds.), The handbook of speech perception (pp. 264–289). Oxford: Blackwell. Edwards, G. (1995). ’Scuse me while I kiss this guy and other misheard lyrics. New York: Fireside. Ferber, R. (1991). Slip of the tongue or slip of the ear? On the perception and transcription of naturalistic slips of the tongue. Journal of Psycholinguistic Research, 20, 105–122. Garnes, S., & Bond, Z. S. (1975). Slips of the ear: Errors in perception of casual speech. In Proceedings of the Eleventh Regional Meeting of the Chicago Linguistic Society (pp. 214–225). Chicago: Department of Linguistics, University of Chicago. Labov, W. (1994). Principles of linguistic change: Vol. 1. Internal factors. Oxford: Blackwell. Lahiri, A., & Marslen‐Wilson, W. (1991). The mental representation of lexical form. Cognition, 38, 245–294.

Lahiri, A., & Reetz, H. (2010). Distinctive features: Phonological underspecification in representation and processing. Journal of Phonetics, 38, 44–59. Meringer, R. (1908). Aus dem Leben der Sprache: Versprechen, Kindersprache, nachahmingstrieb. Berlin: B. Bahr. Meringer, R., & Mayer, K. (1895). Versprechen und Verlesen: Eine Psychologische‐Linguisticische Studie. Stuttgart: G. J. Goschensche. Moates, D. R., & Marks, E. A. (2012). Vowel mutability in print in English and Spanish. Mental Lexicon, 7(3), 326–335. Owren, M. J., & Cardillo, G. C. (2006). The relative roles of vowels and consonants in discriminating talker identity versus word meaning. Journal of the Acoustical Society of America, 119(3), 1727–1739. Shockey, L. (2003). Sound patterns of spoken English. Oxford: Blackwell. Shockey, L., & Bond. Z. S. (2007). Slips of the ear demonstrate phonology in action. In Proceedings of the XVI Congress of Phonetic Sciences (pp. 33–34). Saarbrucken: Saarland University Conference Unit. Shockey, L., & Bond, Z. S. (2014). What slips of the ear reveal about speech perception. Linguistica Lettica, 22, 107–113. Tucker, B. V., & Ernestus, M. (2016). Why we need to investigate casual speech to truly understand language production, processing, and the mental lexicon, Mental Lexicon, 11(3), 375–400. Vitevitch, M. S. (2002). Naturalistic and experimental analyses of word frequency and neighborhood density effects in slips of the ear. Language and Speech, 45(4), 407–434.

11 Phonotactics in Spoken‐Word Recognition MICHAEL S. VITEVITCH1 AND FAISAL M. ALJASSER2 1 2

University of Kansas, United States Qassim University, Saudi Arabia

What are phonotactics? Phonotactics describes the phonological segments and sequences of phonological segments that are legal as well as illegal in a given language (Crystal, 1980). For example, /br/ is permissible word‐initially in English but not permissible at the end of a word. The opposite is true for Arabic. One’s (implicit) knowledge of phonotactics can influence how one perceives speech in a variety of ways. For example, in consonant clusters that are illegal in one’s native language, Dupoux et al. (2011) found that native speakers of Japanese and Brazilian Portuguese perceived an illusory vowel (i.e. no vowel was actually present in the speech signal) between the consonants in the illegal clusters, thereby making the illegal clusters into phonotactically legal sequences in their native language. Looking at only the segments and sequences of segments that are legal within a given language, one can further consider the phonotactic probability of those segments and sequences of segments. That is, some phonological segments and sequences of segments tend to be found in words in that language more often than other phonological segments and sequences of segments. For example, /p/ and /pæv/ occur often in English words and would be said to have high phonotactic probability, whereas /ʒ/ and /ðeʒ/ occur less often in English words and would be said to have low phonotactic probability. Similarly, in Arabic, /t/ and /tar/ occur often in Arabic words and would be said to have high phonotactic probability, whereas /dˁ/ and / dˁiʈˁ/ occur less often in Arabic words and would be said to have low phonotactic probability. Sensitivity to the probabilistic differences in the sequencing of The Handbook of Speech Perception, Second Edition. Edited by Jennifer S. Pardo, Lynne C. Nygaard, Robert E. Remez, and David B. Pisoni. © 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.

Phonotactics in Spoken-Word Recognition 287 phonological segments has not only been observed behaviorally (as will be reviewed) but also in physiological measures such as magnetoencepholography (Pylkkänen, Stringfellow, & Marantz, 2002), electroencepholography (Hunter, 2013), and hemodynamics (Majerus et al., 2002).

Milestones in research on phonotactics Since the publication of Auer and Luce (2005), which summarized the work on phonotactic probability up to that point, much work has continued to examine how phonotactic probability influences various language processes including spoken‐word recognition, speech perception, word production, segmentation of words from fluent speech, the acquisition of language in infants, and the learning of new words by children and adults. We briefly summarize this work here by highlighting some of the seminal studies in this line of research, but for a more comprehensive review of this earlier work see Auer and Luce (2005). An early hint that the frequency with which an item is encountered in the environment might influence language processing came from work by Landauer and Streeter (1973) who examined (among other things) the distribution of phonemes that made up rare and common words. They found that some sounds, such as /n/, /l/, and /t/, occur more frequently in common words than in rare words, whereas other sounds, such as /z/, /p/, and /g/, occur more frequently in rare words than in common words. Although Landaurer and Streeter discussed how their findings related to word‐frequency effects that were being widely studied by memory researchers at the time, these findings also have implications for various language‐related processes, as will be described.

Initial sensitivity to phonotactic patterns Sensitivity to phonotactic information develops very early in life. Jusczyk et al. (1993) found that by nine months of age infants in English‐speaking environments preferred to listen to English words, whereas infants in Dutch‐speaking environments preferred to listen to Dutch words. This finding suggests that very early in life children are aware of which sounds are found in their native language and which sounds are not, even when the languages to be distinguished are quite similar. Not only can nine‐month‐old infants discriminate which sounds are legal in their native language and which are not but, perhaps even more amazingly, infants at the same age are sensitive to more fine‐grained information about their native language. That is, by nine months of age infants prefer to listen to nonwords containing high‐ rather than low‐probability segments and sequences of segments, demonstrating that infants are sensitive to how often certain sounds occur within their native language (Jusczyk, Luce, & Charles‐Luce, 1994). As will be described, being sensitive to differences in phonotactic patterns may provide language learners with an initial entry point into the native language.

288 Perception of Linguistic Properties

Word segmentation and word learning Infants may use information about the frequency with which segments and sequences of segments occur in words to help them segment words from fluent speech. Saffran, Newport, and Aslin (1996) presented eight‐month‐old infants with streams of continuous nonsense speech for 2.5 minutes. Within the streams of nonsense speech were recurring patterns of consonants and vowels that varied in their transitional probabilities between syllables. Even with a short exposure to a stream of speech, the infants showed a preference for the high‐probability sequences of segments, which suggests that they may use this information to segment words from fluent speech and bootstrap into language. Evidence from Mersad and Nazzi (2011) suggests that phonotactic information continues to be used into adulthood (along with other cues) to segment words from fluent speech. In addition to helping the listener segment words from the stream of fluent speech, phonotactic information also plays a role in helping the listener learn new words. For example, Storkel (2001) found that children between three and six years of age learned consonant–vowel–consonant nonwords paired with novel referents more rapidly when the nonwords were composed of segments and sequences of segments that had high phonotactic probability rather than low phonotactic probability. The learning benefit for novel words with high phonotactic probability was still observed when the children were tested a week later (and has also been observed for nonword‐verbs; Storkel, 2003), suggesting that phonotactic information may not only be useful in segmenting words from fluent speech, but also aid in learning those words once extracted from the speech stream. Storkel, Armbruster, and Hogan (2006) examined the role of phonotactic probability in word learning in adults, and further distinguished the influences of phonotactic probability from those of a closely related variable – neighborhood density – on word learning. (For a review of the influence of neighborhood density on the perception and production of spoken words see Vitevitch and Luce, 2016.) Neighborhood density refers to the number of words that sound similar to a given word. A word with many similar‐sounding words, or phonological neighbors, is said to have a dense phonological neighborhood, whereas a word with few phonological neighbors is said to have a sparse phonological neighborhood. Vitevitch et al. (1999) found that words with dense phonological neighborhoods tend to have high phonotactic probability, and that words with sparse phonological neighborhoods tend to have low phonotactic probability. Said another way, words with many phonological neighbors tend to be composed of common segments and sequences of segments, whereas words with few phonological neighbors tend to be composed of rare segments and sequences of segments. (For suggestions of how to deal with other variables that are confounded with phonotactic probability and neighborhood density see Storkel, 2004.) Although phonotactic probability and neighborhood density are correlated, as observed by Vitevitch et al. (1999), Storkel, Armbruster, and Hogan (2006) were able to demonstrate that these two variables had different influences on word learning in adults by creating novel word stimuli for a word‐learning task that had

Phonotactics in Spoken-Word Recognition 289 high phonotactic probability and sparse neighborhoods (e.g. nεp), high phonotactic probability and dense neighborhoods (e.g. pim), low phonotactic probability and sparse neighborhoods (e.g. mug), or low phonotactic probability and dense neighborhoods (e.g. hif). The adults in the word‐learning task learned a lower proportion of high‐probability novel words than low‐probability novel words, and learned a higher proportion of novel words with dense phonological neighborhoods than novel words with sparse phonological neighborhoods. They suggested that novel words with low phonotactic probability are unique and trigger the word‐learning process, whereas novel words with high phonotactic probability resemble many already known words and therefore fail to trigger the word‐ learning process, and that this led to the novel words with low phonotactic probability being learned better than the novel words with high phonotactic probability. Once the word‐learning process has been effectively triggered by a novel word with low phonotactic probability, neighborhood density acts on the fragile and nascent representation of the word form to be learned to try to integrate the representation of the newly learned word into the extant lexicon. Newly learned words that are similar to many known words (i.e. they have a dense phonological neighborhood) are learned better than newly learned words that are similar to few known words, or that have a sparse phonological neighborhood. To understand the advantage that a dense phonological neighborhood affords in the case of word‐ learning, consider the analogy of a rock climber clinging to the face of a cliff. The climber is more likely to stay on the cliff face if she is holding onto it with all of her fingers, and uses both feet and a rope to give her many points of attachment to the cliff. In this analogy the rock climber is the nascent representation of the novel word to be learned. The word will more likely stay attached to the cliff face of long‐term memory if there are many points of contact with memory (i.e. the novel word is similar to many known words already stored in memory). Said another way, despite being a newly formed representation of a word form, a novel word with a dense phonological neighborhood can benefit from parts of the novel word overlapping with many known words in what is known as a “conspiracy effect” (Rumelhart, McClelland, & PDP Research Group, 1986), whereas a novel word with a sparse phonological neighborhood has less overlap with known words, and therefore does not benefit as much from the overlap that exists between novel and known words. For computational evidence of such conspiracy effects in word learning resulting from differences in neighborhood density, see Vitevitch & Storkel (2013) and Takac, Knott, & Stokes (2017). While it may seem that the resemblance of novel words with high phonotactic probability to many already known words should also benefit the learning of such words via a conspiracy effect (or provide many points of attachment to the memory cliff) the resemblance of such words to many known words is what leads to the word‐learning process failing to trigger in the first place. Storkel, Armbruster, and Hogan (2006) discuss a computational model known as adaptive resonance theory (ART; Carpenter & Grossberg, 1987), which has mechanisms that are similar to the word‐learning model proposed by Storkel et al. In ART a mismatch between the

290 Perception of Linguistic Properties input in the environment and representations stored in long‐term memory serves to identify the input as novel and triggers the creation of a new representation in long‐term memory for that novel input. The resemblance of words with high phonotactic probability to many already known words therefore acts against the learning of such words because the word‐learning process is not triggered in the first place. Thus, even though phonotactic probability and neighborhood density are two variables that are related to each other, the work of Storkel, Armbruster, and Hogan (2006) shows that they serve different purposes, with phonotactic probability triggering the word‐learning process and neighborhood density helping to consolidate the representation in memory. As we shall see, these variables also influence other language‐related processes and in different ways.

Spoken‐word recognition in adults A number of studies have also shown that phonotactic probability can influence the extent to which a nonword will sound more or less like a real word in a given language. One early study (Vitevitch et al., 1997) created two‐syllable nonwords that varied in phonotactic probability, and found that listeners rated the nonwords with higher phonotactic probability as being more like an English word than nonwords with lower phonotactic probability. Subsequent work by Bailey and Hahn (2001) and by Frisch, Large, and Pisoni (2000), using different measures of phonotactic probability than those used in Vitevitch et al. (1997), found additional evidence that ratings of wordlikeness are influenced in part by phonotactic probability (as well as by the number of syllables in the word and the number of words in the lexicon that sound like the nonword), further demonstrating that adults are sensitive to phonotactic information. Although ratings of wordlikeness suggest that listeners possess some knowledge of the phonotactic probabilities of their language, this measure does not tell us anything about how listeners might actually use this knowledge to recognize spoken words. As a first step to examine how phonotactic probability might affect processing, Vitevitch and Luce (1998) used an auditory naming task in which participants hear a word or nonword and simply repeat it back as quickly and as accurately as possible (both reaction time and accuracy rate serve as dependent variables). Vitevitch and Luce predicted that sublexical representations would be most useful for nonwords, whereas lexical representations would be most useful for real English words. In the auditory naming task, participants may employ whichever representation best predicts the input. Vitevitch and Luce found that nonwords with high phonotactic probability were named more quickly and accurately than nonwords with low phonotactic probability, suggesting that sublexical representations were being used to process the nonwords. In the case of real English words, however, words with high phonotactic probability (and also dense phonological neighborhoods, as noted) were named more slowly and less accurately than words with low phonotactic probability (and

Phonotactics in Spoken-Word Recognition 291 sparse phonological neighborhoods). This contradictory finding provided further evidence that there might be at least two types of representations involved in the recognition of spoken words: phonotactic information, which is stored in sublexical representations, and information about phonological neighbors, which is stored among lexical representations. Additional evidence for the existence of lexical and sublexical representations being involved in spoken‐word recognition came from Vitevitch and Luce (1999) and from Vitevitch (2003). Vitevitch and Luce (1999) found in a same–different task, where listeners hear two stimuli and must decide as quickly and as accurately as possible if the two items are the same or different, that monosyllabic words with low phonotactic probability/sparse phonological neighborhoods were responded to more quickly than monosyllabic words with high phonotactic probability/dense phonological neighborhoods, suggesting that lexical representations were being used to recognize the stimuli and perform the task. As predicted by the neighborhood activation model (Luce & Pisoni, 1998), words with few competitors (i.e. a sparse neighborhood) were recognized more quickly than words with many competitors (i.e. a dense neighborhood). In the case of monosyllabic nonwords, however, in the same–different task – a task like the auditory naming task used in Vitevitch and Luce (1998), where one is not required to access the mental lexicon in order to perform the task, nonwords with low phonotactic probability/sparse phonological neighborhoods were, in contrast to the real words, responded to more slowly than monosyllabic words with high phonotactic probability/dense phonological neighborhoods. As in Vitevitch and Luce (1998), this finding suggested that sublexical representations were being used to process the nonsense words. In Experiment 3 of Vitevitch and Luce (1999), a lexical decision task was used where listeners hear a stimulus and must decide as quickly and as accurately as possible if they heard a real word in English or a made‐up nonsense word. Note that in this task one must access the mental lexicon in order to perform the task (i.e. to decide if the stimulus matches a representation in the mental lexicon and is therefore a word, or if, after exhaustive search, the stimulus fails to match a representation in the mental lexicon and is therefore not a word). Vitevitch and Luce found that monosyllabic words with low phonotactic probability/sparse phonological neighborhoods were responded to more quickly than monosyllabic words with high phonotactic probability/dense phonological neighborhoods, as in the previous experiments by Vitevitch and Luce, again indicating that lexical representations were being used to process the words. For the nonwords in the lexical decision task, which now required participants to use lexical representations to process all stimuli, Vitevitch and Luce now found that monosyllabic nonwords with low phonotactic probability/sparse phonological neighborhoods were responded to more quickly than monosyllabic words with high phonotactic probability/dense phonological neighborhoods. This pattern resembled the way in which the real English words in this and in previous experiments were processed, and suggested that lexical representations were being used to process the nonwords.

292 Perception of Linguistic Properties Further evidence suggesting that sublexical and lexical representations were involved in the recognition of spoken words came from Vitevitch (2003), where the same–different matching task was used again, but this time the proportion of nonword pairs varied across participants. Recall that in the same–different matching task the listener can use either sublexical or lexical representations to process the input, depending on which type of representation was most predictive of the input. To encourage one group of participants to use sublexical representations in this task, more nonword pairs were presented to participants; this would result in sublexical representations being used to process the input even when presented with pairs of words. To encourage another group of participants to use lexical representations in this task, more word pairs were presented to participants; this would result in lexical representations being used to process the input even when presented with pairs of nonwords. As predicted, when more word pairs were present in the same–different task, Vitevitch (2003) found that a set of real word pairs were responded to such that the words with low phonotactic probability/sparse phonological neighborhoods were responded to more quickly than words with high phonotactic probability/dense phonological neighborhoods, indicating that lexical representations were used to process the stimuli. Crucially, however, when more nonword pairs were present in the same–different task, the same set of real word pairs were responded to such that the words with low phonotactic probability/sparse phonological neighborhoods were now responded to more slowly than words with high phonotactic probability/dense phonological neighborhoods, indicating that sublexical representations were used to process the stimuli, and providing additional evidence that sublexical and lexical representations may play a role in spoken‐word recognition in adults. Given the increasing amounts of evidence that sublexical and lexical representations play a role in spoken‐word recognition, we turn to models of spoken‐word recognition to consider how they may be able to accommodate these findings.

Representing phonotactic information in models of language processing Several previous models of spoken‐word recognition – for example, neighborhood activation model (NAM; Luce & Pisoni, 1998) and cohort theory (Marslen‐Wilson & Welsh, 1978) – suggested that spoken‐word recognition occurred by acoustic‐ phonetic input being mapped directly onto lexical word forms (depending on the extent to which the word form and spoken input matched). However, the findings by Vitevitch and Luce (1998) suggested that other – sublexical – representations might be involved in spoken‐word recognition as well. Although other (computational) models of spoken‐word recognition, such as TRACE (McClelland & Elman, 1986; see also Strauss, Harris, & Magnuson, 2007) and Shortlist (Norris, 1994; see also Norris & McQueen, 2008), did have phonological segments (or other sublexical representations) included in the model, these sublexical

Phonotactics in Spoken-Word Recognition 293 representations were included to account for various speech‐perception findings rather than as an important component of spoken‐word recognition, as Vitevitch and Luce (1998) suggested. In addition, the frequency with which the segments occurred in the language (i.e. phonotactic probability) wasn’t actually encoded in those representations, so even though those models had sublexical representations they didn’t directly model phonotactic knowledge. Note, however, that McClelland and Elman (1986) showed that some phonotactic knowledge could emerge as a result of a conspiracy effect among lexical representations. That is, the TRACE model would “know” that /tl/ was not a legal sequence of segments in English, but that /tr/ was a legal sequence of segments in English simply because there were several lexical representations that started with /tr‐/, but none that started with /tl‐/. Vitevitch and Luce (1998, 1999) appealed to the principles found in adaptive resonance theory (Grossberg & Stone, 1986), but relied on verbal reasoning (and previous simulations by Grossberg & Stone, 1986, in related domains) to make their case. Subsequent simulations by Pitt, Myung, and Altteri (2007) demonstrated that indeed the principles found in ART could account for the findings reported in Vitevitch and Luce (1998, 1999) and other studies including those in postlingually deafened adults who used cochlear implants (Vitevitch et al., 2002). In the ART framework acoustic‐phonetic input activates clusters of features in memory, referred to as items, which then activate larger groupings of items (e.g. phonological segments, sequences of segments, syllables, words), referred to as list chunks (Grossberg & Stone, 1986). Large list chunks (corresponding to words) compete with other large list chunks (i.e. other words), and inhibit (or mask) smaller chunks that correspond to syllables, sequences of segments, phonological segments, and so on. The list chunk that best matches the activated items establishes a resonance loop, thereby bringing to conscious awareness the information associated with the winning list chunk. For language tasks, such as the lexical decision task, that require access to lexical information, the larger, word‐sized list chunks are most predictive of the input, and establish resonance with the incoming list information. The use of these larger word‐sized list chunks during processing results in the effects described earlier: stimuli with low phonotactic probability/sparse phonological neighborhoods are responded to more quickly and accurately than monosyllabic stimuli with high phonotactic probability/dense phonological neighborhoods. For language tasks – such as the same–different task, or auditory naming task – that do not require access to lexical information, or when there is no lexical information to be retrieved as when nonwords are used as stimuli, the smaller list chunks corresponding to segments or sequences of segments prove to be most predictive of the input, and establish resonance with the incoming list information. The use of these smaller segment‐sized list chunks during processing results in the effects described earlier: stimuli with high phonotactic probability/dense phonological neighborhoods are responded to more quickly and accurately than monosyllabic stimuli with low phonotactic probability/sparse phonological neighborhoods. For more information regarding how conventional models of spoken‐word recognition accounted for the results of Vitevitch and Luce (1998, 1999) see Auer and

294 Perception of Linguistic Properties Luce (2005). Luce, Goldinger, and Vitevitch (2000) highlighted how the ART approach differs from TRACE and Shortlist by having lexical and sublexical representations, but not organizing them in an ordered, hierarchical manner, as is found in those other models of spoken‐word recognition. This arrangement provides an alternative way of thinking about “feedback” between lexical and sublexical representations, thereby sidestepping some of the issues associated with modular versus interactive models of spoken‐word recognition. In addition, the ART framework provides a set of principles that have been applied to other cognitive domains, including learning, visual perception, memory, and attention, enabling speech perception and word recognition to be connected to other aspects of cognition rather than be separate, isolated systems, as is found in most conventional models of speech perception and spoken‐word recognition.

Network science: An alternative way to model phonotactic probability A more recent and novel approach to spoken‐word recognition using the computational tools of network science (Vitevitch, 2008) offers an alternative way to account for the influences of phonotactic probability and neighborhood density on spoken‐word recognition. In the network science approach to spoken‐word recognition, words are represented by nodes, and connections are placed between nodes if the words are phonologically related, forming a web‐like network of the mental lexicon that reflects the phonological similarity of words (for a network of the mental lexicon that reflects the semantic similarity of words see Steyvers & Tenenbaum, 2005). In contrast to the more conventional psycholinguistic models described, which focused on the processes involved in spoken‐word recognition, the network science approach focuses on the overall structure that is formed by phonologically similar words in the network. One of the fundamental assumptions of network science is that the structure of a network influences the dynamics of that system (Watts & Strogatz, 1998). That is, a process may operate very efficiently on a network that is structured in one way, but in a network with the same number of nodes and the same number of connections arranged in a slightly different way the same process may now be very inefficient (Kleinberg, 2000). The focus on the overall structure of the phonological lexicon in the network science approach differs further from the psycholinguistic models described earlier, which tended to focus on the characteristics of an individual word (e.g. word frequency, word length, phonotactic probability, neighborhood density, etc.) for an explanation of why some words are responded to differently from others. In the network science approach the structure of the phonological lexicon influences how quickly and accurately a word can be retrieved from the lexicon. A computer simulation by Vitevitch, Ercal, and Adagarla (2011) implemented a simple process (i.e. activation diffusing across a network) on phonological networks to demonstrate that different network structures can influence how quickly and accurately a word

Phonotactics in Spoken-Word Recognition 295 is retrieved from the phonological network (see also Siew, 2019). Differences in the speed and accuracy of retrieving words from the phonological network were observed even though the characteristics of individual words that are typically examined in psycholinguistics (e.g. word frequency, neighborhood density, etc.) were controlled, further emphasizing the influence that the structure of the network has on processing. The structure observed in the phonological network can be examined at a variety of levels (Vitevitch & Castro, 2015). At the micro level one can examine the characteristics of an individual word (and perhaps the words immediately around it), much like conventional psycholinguistics does. At the macro level one can measure characteristics of the whole network, and examine how those global structures influence various processes. At the meso level one can examine groups of nodes – something smaller than the whole network, but more than an individual node – to determine how those groups might influence processing in some way. For work examining the micro level of the phonological network see Chan and Vitevitch (2009, 2010) who examined if words in a phonological neighborhood that were also neighbors of each other influenced word recognition and word production (this measure is known as the clustering coefficient). For research looking at the macro level of the phonological network see Vitevitch and Castro (2015), who examined whether words that were or were not part of the largest group of connected words, known as the giant component, in the network influenced their retrieval. More relevant to the present discussion of phonotactic probability is work by Siew (2013) who used the network approach to examine the phonological network at the meso level by looking for communities of words in the phonological network. Communities (also called modules) are subgroups of nodes that tend to be more connected to each other than to nodes in other communities. A commonly used community detection algorithm is the method developed by Girvan and Newman (2002; see Yang, Algesheimer, & Tessone, 2016, for a comparison of other community detection algorithms). Siew found a number of communities in the giant component. Closer examination of the words in each community revealed that each community tended to be populated by words that had a common set of segments and sequences of segments. Siew suggested that processing in the phonological network that occurred at the micro level might result in the lexical effects typically observed in other studies. Recall that the micro level of a network refers to the characteristics of an individual word and the words immediately around it. Thus, effects related to the number of words that sound like a given word, known in psycholinguistics as neighborhood density (and called degree in the phonological network), would emerge if processing in the network was focused on the micro level. Perhaps counterintuitively, if processing in the network were broadened out to the meso level, one might observe the sublexical effects reported in Vitevitch and Luce (1998, 1999). Recall that at the meso level groups of nodes – something smaller than the whole network, but larger than an individual node – might influence processing. The subgroups of nodes, referred to as communities, that Siew (2013) observed in the phonological network shared common segments and sequences of

296 Perception of Linguistic Properties segments. Thus, effects of phonotactic probability might emerge if processing in the network was not focused on an individual word node (i.e. the micro level) but on the subgroup of nodes (i.e. a community) that shared certain segments and sequences of segments. In the phonological network approach there is not a separate set of representations for segments and sequences of segments as suggested by Vitevitch and Luce (1998, 1999). Rather phonotactic information emerges from the lexical nodes that reside in the same community. The idea of phonotactic information emerging from a network of phonological word forms appears similar to very early models of spoken‐word recognition that proposed only word forms as the relevant representation used in processing, instead of there being lexical and sublexical representations as proposed by Vitevitch and Luce (1998, 1999). The network approach also resembles the way that the TRACE model extracts phonotactic information from the groups of words that are activated by input like /tr‐/, and the failure to activate any words that start with /tl‐/ instead of directly storing that information in a separate set of sublexical representations. Although network science has been used in a number of disciplines and to examine other aspects of language than the phonological lexicon – including certain aspects of word learning (Hills et al., 2009; Beckage, Smith, & Hills, 2011) – additional work needs to be done to determine the extent to which this approach will prove fruitful for understanding how phonotactic probability is represented and influences processing. In what follows we highlight several other topics related to phonotactic probability that would also benefit from continued research.

Languages other than English Just as the rat and pigeon became model organisms for the development and exploration of general laws of behavior, the English language could be viewed as a model language in the field of psycholinguistics, and used to develop and explore the general processing principles that are used to comprehend and produce speech (Vitevitch, Chan, & Goldstein, 2014). Indeed, there are many resources and databases available for speech and language researchers to use to assess many different aspects of English phonemes, words, sentences, and so on, but fewer such resources in other languages. There are many differences between human languages – especially in their phonology and morphology – that may limit the extent to which English can indeed be viewed as a model language in psycholinguistics, making it necessary for researchers to explore these characteristics in a variety of other languages in order to better understand how phonological, morphological, and semantic systems work together. Some research in Dutch suggests that phonotactic probability may be used the same way across languages. For example, Zamuner (2013) found that two‐year‐ old Dutch children perceived segmental contrasts more accurately when presented in high phonotactic probability environments than when the same contrasts were presented in low phonotactic probability environments. Similarly, Verhagen et al.

Phonotactics in Spoken-Word Recognition 297 (2016) found that two‐year‐old Dutch‐speaking children showed similar relationships between phonotactic probability and nonword recognition that are typically found in native English‐speaking children, which suggests that phonotactic probabilities may be used in the same way across languages. Finally, van der Kleij, Rispens, and Scheper (2016) examined the effect of phonotactic probability on the learning of nonwords in Dutch‐speaking children, and found results similar to those found in English (e.g. Hoover, Storkel, & Hogan, 2010), namely nonwords with low phonotactic probability/sparse phonological neighborhoods triggered word learning better than other combinations of phonotactic probability and neighborhood density, and dense phonological neighborhoods seemed to facilitate configuration and engagement of nascent lexical representations in the mental lexicon. Such results increase our confidence that phonotactic probability (and neighborhood density) influence processing the same way across languages (however, see Vitevitch & Rodríguez, 2005, for differences in how neighborhood density influences spoken‐word recognition in Spanish). Note, however, that Dutch comes from the same language family as English, so a stronger test of the assumption that phonotactic probability influences processing the same way across languages would be to examine a language that differs in significant ways from English. One such language is Arabic. Unlike English, Arabic is a Semitic language whose morphology is based on mapping fixed vowel patterns onto discontinuous root consonants (Holes, 1995). The consonant roots carry semantic information (Wright, 1995); for example, the root {ktb}refers to the notion of “writing.” By mapping different vowel patterns to the consonant root, different words related to writing can be created. For example, kataba means “to write,” kaataba means “he corresponded,” kitaab means “book,” kaatib means “writer,” maktab means “office,” kutub means “books,” and maktuub is the past participle “written.” Because of this interesting combination of root consonants and pattern vowels in Arabic morphology, studying phonotactic probability effects in such a language may provide insights into not only how phonotactic probability is represented but also how phonological, morphological, and semantic information interact during processing (see also Calderone, Celata, & Laks, 2014). Crucial to such work in other languages is the development of and broad accessibility to various databases for calculating different language characteristics such as phonotactic probability. Although a web‐based calculator of phonotactic probability for English segments and sequences of segments (Vitevitch & Luce, 2004) has been available (and widely used, as evidenced by over 465 citations on Google Scholar as of May 19, 2020) for over a decade, and a similar web‐based calculator has been available for English segments and sequences of segments based on a child’s lexicon (e.g. Storkel & Hoover, 2010; Storkel, 2013), there have been comparatively fewer resources available in other languages. However, a web‐based version of a phonotactic probability calculator has recently been made available in Spanish (Vitevitch, Stamer, & Kieweg, 2012) and in Arabic (Aljasser & Vitevitch, 2018), which will hopefully lead to an increase in research on languages such as Arabic whose morphological systems differ from English. To confirm the generalizability of various models of language processing, it would be important

298 Perception of Linguistic Properties to see if phonotactic probability influences processing in the same way as it influences processing in English (cf. Vitevitch & Rodríguez, 2005). If differences across languages are found, then language scientists may have a new set of questions to answer to increase our understanding of how language may be produced, perceived, and learned.

Phonotactic information in bilingual speakers A line of research related to the work described regarding the influence of phonotactic probability (and neighborhood density) in other languages examines how phonotactic information is represented in and used by individuals who know more than one language. Early work with Lithuanian–English bilinguals (Anisfeld, Anisfeld, & Semogas, 1969) and English–German bilinguals (Attenberg & Cairns, 1983) demonstrated that representations of the phonotactic information of both languages affect one another. In both studies, bilingual and monolingual participants gave wordlikeness ratings for nonword stimuli. The stimuli were constructed such that the initial consonant clusters were (1) legal in one language but not in the other, (2) legal in both languages, or (3) legal in neither language. When asked to rate how good an English word an item would be, one would expect equivalent ratings by bilinguals and monolinguals if the phonotactic information of the two languages were independent of one another. However, the results showed that the bilingual participants gave higher ratings than the monolinguals to the nonwords that had sequences that were legal in both languages. This suggests that the phonotactic information for each language is not stored independently of one another. Bilinguals appear to have access to phonotactic information from both languages simultaneously, rather than having to switch between the languages. Furthermore, evidence suggests that L2 speakers cannot suppress their L1 phonotactic knowledge when processing words in the L2 (Weber & Cutler, 2006). Using the word‐spotting paradigm, Weber and Cutler showed that, like native English speakers, German speakers learning English as a foreign language effectively used English phonotactic constraints in the lexical segmentation of English. However, Weber and Cutler also observed that L1 German speakers used their language‐specific phonotactic constraints even when those phonotactic constraints were not helpful to the task at hand. An early study on phonotactics by Anisfeld and Gordon (1971) suggests that very little exposure to a second language can influence wordlikeness ratings in both languages. Anisfeld and Gordon asked students who took one college course in German to rate the wordlikeness of nonwords that were legal in English but illegal in German, or nonwords that were legal in both English and German. For the nonwords that were legal in English but illegal in German, the wordlikeness ratings were the same for students who took one college course in German and for students who knew no German. However, the nonwords that were legal in both English and German were rated as being more wordlike by the students who took

Phonotactics in Spoken-Word Recognition 299 one college course in German than by the students who knew no German, suggesting that very little exposure to another language can influence the lexicon. Some research has started to investigate the effect of phonotactic knowledge on word learning in bilingual children, but more research in this area is still needed. For example, using a fast mapping task that manipulated phonotactic probability in English, Alt, Meyers, and Figueroa (2013) compared the performance of school‐ aged Spanish–English bilingual children (aged seven to eight years) and preschoolers (aged four to five years) to English monolingual children. They assessed fast mapping by using both name identification and naming tasks. Alt, Meyers, and Figueroa found that fast mapping for both age groups in both tasks was facilitated by high phonotactic probability. However, whereas school‐aged children showed equivalent performance in both tasks, bilingual preschoolers, despite their identical pattern of performance, were less accurate than monolinguals in the naming task. Although we’ve discussed several studies that have examined how phonotactic information influences the acquisition of a second language, much work remains to be done on how phonotactic information in one language influences processing in other languages that a person knows (e.g. the work of Weber & Cutler, 2006, described earlier; see also Carlson et al., 2016). Increasing our understanding of such influences may lead to enhanced methods for teaching second languages, for altering speech patterns that may contribute to a foreign accent, or for other clinical applications as will be described.

Implications for speech, language, and hearing disorders Increasing our understanding of how phonotactic information influences various language processes in typically developing children and adults may also provide clinicians with insights about the mechanisms that may be involved in atypical development; new tools for the diagnosis of various speech, language, and hearing disorders; and evidence that can inform the treatment of those disorders. Here we describe a few of the diagnostic advances that have been made by considering phonotactic information, and also give an example of how phonotactic information could inform treatment. Clinicians in the speech, language, and hearing sciences have a wide array of tests and word lists at their disposal for the assessment of various disorders. Given the prevalent influences of phonotactic probability on various aspects of language processing, the construction of future lists and tests should take into account item‐ level characteristics of words such as phonotactic probability, as suggested by Edwards and Beckman (2008; see also Kirk, Pisoni, & Miyamoto, 1997). Given that certain magnetoencephalography and electroencephalography components are sensitive to differences in phonotactic probability (Bonte et al., 2005), clinicians may also consider using these approaches as a means to detect certain language disorders or to provide converging evidence for a possible disorder in cases that may seem subtle or are borderline according to behavioral tests or word lists.

300 Perception of Linguistic Properties Phonotactic probability has been shown to be diagnostically relevant in a number of speech, language, and hearing disorders. For example, Stiles, McGregor, and Bentler (2013) found subtle differences in children with hearing loss compared to children with normal hearing in learning novel words that varied in phonotactic probability even though both groups of children identified high‐wordlike novel words more accurately than low‐wordlike novel words. Similarly, in adults who use cochlear implants, Vitevitch et al. (2002) found differences in responses to words and nonwords that varied in phonotactic probability. Finally, Han et al. (2016) found under noisy conditions (such as those that typically prove challenging for listeners with subtle hearing loss) that word learning in adults benefited from a convergence of phonotactic probability and neighborhood density variables, which differs from how these variables affect word learning under quiet listening conditions. Phonotactic probability has also been found to be diagnostically useful in children with specific language impairment (SLI), now referred to as developmental language disorder (DLD). For example, Munson, Kurtz, and Windsor (2005) found that measures of phonotactic probability were better predictors of repetition accuracy than judgments of wordlikeness, further highlighting the use of such measures in the construction of diagnostic word lists. The diagnostic sensitivity of phonotactic probability for certain disorders may depend on the nature of the task being used to test the speaker. For example, instead of looking at repetition accuracy, as is often done, Burke and Coady (2015) examined the errors that children actually made in a nonword‐repetition task. They found that children with SLI/DLD and their peers with typical language development both substituted less frequent phonemes with more frequent ones, and less probabilistic syllables with higher probability ones. That is, there was no difference between the two groups of speakers in the types of errors they made. If a more challenging task were to be employed, however, one might be able to observe group differences that would be diagnostically useful. Leonard, Davis, and Deevy (2007) had children with SLI/DLD, typically developing children matched for age, and children matched for mean length of utterance inflect novel verbs (varying in phonotactic probability) with the past tense ‐ed. Given that children with SLI/DLD are known to have problems with morphology (i.e. conjugating verbs), this is indeed a challenging task for children with SLI/DLD. Leonard, Davis, and Deevy found that children with SLI/DLD were less likely than the other two groups to produce the novel verbs with ‐ed, especially when the novel verbs had low phonotactic probability. The other two groups of children did not show this difference, suggesting that an appropriately challenging task with stimuli that vary in phonotactic probability may prove to be diagnostically useful. In addition to being useful in detecting language‐recognition disorders, phonotactic probability also seems to be a sensitive/diagnostic measure of disorders of speech production. Lallini and Miller (2011) found that some native speakers of English who acquired a speech‐output impairment after stroke repeated nonwords with high phonotactic probability more accurately than nonwords with low phonotactic probability. Thus, phonotactic probability may provide a more refined

Phonotactics in Spoken-Word Recognition 301 diagnostic tool for the differential diagnosis of speech output disorders as well as other language and hearing disorders as described earlier. Turning briefly now to insights into treatment discovered by studying phonotactics, Plante et al. (2011) demonstrated that long‐term exposure or extensive articulatory practice may not be needed to modify the representation of word forms in children. Instead brief and simple manipulations of the input may have beneficial treatment effects. Plante et al. presented typically developing children and children with SLI/DLD with nonwords in a nonword‐repetition task. The nonwords varied in phonotactic probability and in the frequency with which they were presented within the experiment. Phonotactic probability had an influence on how quickly and accurately the nonwords were repeated by the typically developing children and by the children with SLI/DLD. Both groups of children also showed effects of the short‐term manipulations of presentation frequency. Because it was previously unknown if children with SLI/DLD were sensitive to short‐term manipulations of presentation frequency, the study by Plante et al. demonstrated that even children with SLI/DLD could benefit from short‐term manipulation of presentation frequency in a single session, suggesting that brief, short‐term therapeutic sessions may hold much benefit for children with SLI/DLD. Not only has the research on phonotactic probability increased our understanding of the basic language sciences and been applied to the diagnosis and treatment of language‐related disorders, but it has also been applied in other contexts.

Phonotactics in other contexts Although much has been learned about language processing from carefully controlled experiments carried out in laboratory settings, relying on only such findings runs the risk of “controlling away” interesting effects that naturally occur in the real world (Vitevitch, 2002). Other methods, such as naturalistic observation and the mining of existing sources of big data from mobile devices and other ubiquitous technologies, should also be explored as test beds of theories and hypotheses about language processing. Indeed, Griffiths (2015) has called for a new kind of cognitive revolution that employs computational methods to analyze the large‐scale datasets generated from our interactions with and via computers to test theories of cognition so as to show that our findings have an impact in the real world and not just in the pages of our scientific journals. For example, Lambert et al. (2010) showed that the neighborhood density of drug names influenced the number and type of errors clinicians and patients made in recalling the names. Making an error in a drug name may, in an extreme case, prove to be fatal. Thus, this work demonstrates that the theories of word recognition and memory developed in the laboratory can have significant real‐world impact. The creation of easily shared databases of naturally occurring speech errors (Vitevitch et al., 2015) provides one opportunity for examining how phonotactic probability influences language processing. For example, one could examine

302 Perception of Linguistic Properties whether the findings of Burke and Coady (2015) described earlier – that children with SLI substituted less frequent phonemes with more frequent ones – also hold true for naturally occurring speech errors such as slips of the tongue (i.e. production error) or slips of the ear (i.e. perceptual error) in individuals without SLI. Large‐scale “mega‐studies” (e.g. Brysbaert et al., 2016) combined with sophisticated statistical techniques for analysis also make it easier to examine the accuracy and reaction time of responses to thousands of words to see how phonotactic probability interacts with numerous other lexical characteristics in ways that would be impossible to do in the traditional, conventional laboratory task using small samples of carefully controlled words and a small sample of (often times homogeneous) participants. In addition to conducting research on phonotactic probability outside of the laboratory, future research should also consider how phonotactic probability influences other cognitive processes. Perhaps not surprisingly, phonotactic probability influences other language‐related processes. For example, Apel, Wolter, and Masterson (2006) found that the spelling accuracy of five‐year‐old preschool children who were just beginning to learn how to read and write was influenced independently by phonotactic and orthotactic probability (the frequency of letters and sequences of letters in a word). But it’s not just language‐related processes that are influenced by phonotactic probability. Early work demonstrated that nonwords with high phonotactic probability were recalled more accurately than nonwords with low phonotactic probability in short‐term memory tasks (Gathercole et al., 1999). More recent work shows that the growth in verbal short‐term memory is in part a result of increased knowledge of phonotactic information in long‐term memory (Messer et al., 2015). Other higher‐level cognitive processes may also be influenced by phonotactic probability. For example, Song and Schwarz (2009) investigated the effect of the difficulty of pronouncing brand names that represented food additives or amusement rides on their perceived risk. Rather than use objective measures of phonotactic probability of the novel names they created, ease of pronunciation was subjectively assessed by using difficulty of pronunciation ratings from another set of subjects. (However, given the correlation between phonotactic probability and wordlikeness ratings [e.g. Vitevitch et al., 1997], it is likely that the difficult‐to‐pronounce names were low in phonotactic probability and the easier‐to‐pronounce names were high in phonotactic probability.) They found that names that were difficult to pronounce were rated as more harmful food additives or riskier amusement rides than names that were easier to pronounce. Based on these results, Song and Schwarz (2009) proposed that intentionally designing product names that were difficult to pronounce could draw consumer’s attention to potentially hazardous products. A more recent study investigated the effect of phonotactic probability on another aspect of decision making. Vitevitch and Donoso (2012) examined the influence of objectively measured phonotactic probability of potential brand names on participant’s attitudes as expressed in the likelihood of their buying the product. Vitevitch and Donoso found that brand names with high phonotactic probability were more likely to be purchased than brand names with low phonotactic

Phonotactics in Spoken-Word Recognition 303 probability. Taken together, these two studies show that the phonological characteristics of words as expressed in the subjectively rated difficulty of pronunciation (Song & Schwarz, 2009) or the objectively measured phonotactic probability (Vitevitch & Donoso, 2012) can influence important higher‐level cognitive processes like risk perception and decision making in consumers. Keeping information like phonotactic probability in mind when constructing brand names could draw attention to potentially hazardous products as Song and Schwarz (2009) suggest, or could result in brand names that are more appealing, as well as easy to discriminate from other products, as Vitevitch and Donoso (2012) suggest. As globally oriented companies branch out into international markets, studies of phonotactic probability in other languages may prove to be even more important.

Conclusion Since the publication of Auer and Luce (2005), which summarized the work on phonotactic probability to that point, much additional research on phonotactic probability has been carried out, only some of which has been summarized here (e.g. see Goldrick & Larson, 2008, for influences of phonotactic probability on speech production). Research with infants shows that sensitivity to phonotactic knowledge begins very early in life. This knowledge of phonotactic probabilities provides useful cues for segmenting words from the continuous stream of speech, and aids in adding new words to one’s lexicon. Phonotactic information also influences how quickly and accurately adults recognize spoken words, and has recently been shown to influence other cognitive processes and decisions as well. There is still much work to be done in the areas highlighted in the chapter, especially in other languages and with speakers who know more than one language. New developments in the methods used in network science are likely to prove useful in the cognitive and language sciences (see Vitevitch, 2019) by providing alternative ways to view how phonotactic information might be represented and might influence various language processes. Methods related to analyzing big data are also likely to provide insight into how phonotactic information is used in the real world rather than in laboratory‐based experiments, and may assist in developing interventions for various speech, language, or hearing disorders. We look forward to seeing how the field progresses and where it will be in the next decade, especially as it moves in new directions that we have not anticipated here.

REFERENCES Aljasser, F., & Vitevitch, M. S. (2018). A web‐based interface to calculate phonotactic probability for words and

nonwords in Modern Standard Arabic. Behavior Research Methods, 50, 313–322.

304 Perception of Linguistic Properties Alt, M., Meyers, C., & Figueroa, C. (2013). Factors that influence fast mapping in children exposed to Spanish and English. Journal of Speech, Language, and Hearing Research, 55, 1237–1248. Anisfeld, M., Anisfeld, E., & Semogas, R. (1969). Cross‐influences between the phonological systems of Lithuanian– English bilinguals. Journal of Verbal Learning and Verbal Behavior, 8, 257–261. Anisfeld, M., & Gordon, M. (1971). An effect of one German language course on English. Language and Speech, 14, 289–292. Apel, K., Wolter, J. A., & Masterson, J. J. (2006). Effects of phonotactic and orthotactic probabilities during fast mapping on 5‐year‐olds’ learning to spell. Developmental Neuropsychology, 29, 21–42. Attenberg, E. P., & Cairns, H. S. (1983). The effects of phonotactic constraints on lexical processing in bilingual and monolingual subjects. Journal of Verbal Learning and Verbal Behavior, 22, 174–188. Auer, E. T., & Luce, P. A. (2005). Probabilistic phonotactics in spoken word recognition. In D. B. Pisoni & R. E. Remez (Eds), The handbook of speech perception (pp. 610–630). Oxford: Blackwell. Bailey, T. M., & Hahn, U. (2001). Determinants of wordlikeness: Phonotactics or lexical neighborhoods? Journal of Memory and Language, 44, 568–591. Beckage, N., Smith, L., & Hills, T. (2011). Small worlds and semantic network growth in typical and late talkers. PLOS One, 6, e19348. Bonte, M. L., Mitterer, H., Zellagui, N., et al. (2005). Auditory cortical tuning to statistical regularities in phonology. Clinical Neurophysiology, 116, 2765–2774. Brysbaert, M., Stevens, M., Mandera, P., & Keuleers, E. (2016). How many words do we know? Practical estimates of vocabulary size dependent on word definition, the degree of language input and the participant’s age. Frontiers in Psychology, 7, 1116.

Burke, H. L., & Coady, J. A. (2015). Nonword repetition errors of children with and without specific language impairments (SLI). International Journal of Language & Communication Disorders, 50, 337–346. Calderone, B., Celata, C., & Laks, B. (2014). Theoretical and empirical approaches to phonotactics and morphonotactics. Language Sciences, 46, 1–5. Carlson, M. T., Goldrick, M., Blasingame, M., & Fink, A. (2016). Navigating conflicting phonotactic constraints in bilingual speech perception. Bilingualism: Language and Cognition, 19, 939–954. Carpenter, G. A., & Grossberg, S. (1987). A massively parallel architecture for a self‐organizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, 37, 54–115. Chan, K. Y., & Vitevitch, M. S. (2009). The influence of the phonological neighborhood clustering‐coefficient on spoken word recognition. Journal of Experimental Psychology: Human Perception and Performance, 35, 1934–1949. Chan, K. Y., & Vitevitch, M. S. (2010). Network structure influences speech production. Cognitive Science, 34, 685–697. Crystal, D. (Ed.). (1980) A first dictionary of linguistics and phonetics. London: Andre Deutsch. Dupoux, E., Parlato, E., Frota, S., et al. (2011). Where do illusory vowels come from? Journal of Memory and Language, 64, 199–210. Edwards, J., & Beckman, M. E. (2008). Methodological questions in studying consonant acquisition. Clinical Linguistics & Phonetics, 22, 937–956. Frisch, S. A., Large, N. R., & Pisoni, D. B. (2000). Perception of wordlikeness: Effects of segment probability and length on the processing of nonwords. Journal of Memory and Language, 42, 481–496. Gathercole, S. E., Frankish, C. R., Pickering, S. J., & Peaker, S. (1999). Phonotactic influences on short‐term memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 25, 84–95.

Phonotactics in Spoken-Word Recognition 305 Girvan, M., & Newman, M. E. (2002). Community structure in social and biological networks. Proceedings of the National Academy of Sciences of the United States of America, 99, 7821–7826. Goldrick, M., & Larson, M. (2008). Phonotactic probability influences speech production. Cognition, 107, 1155–1164. Griffiths, T. L. (2015). Manifesto for a new (computational) cognitive revolution. Cognition, 135, 21–23. Grossberg, S., & Stone, G. O. (1986). Neural dynamics of word recognition and recall: Attentional priming, learning, and resonance. Psychological Review, 93, 46–74. Han, M. K., Storkel, H. L., Lee, J., & Cox, C. (2016). The effects of phonotactic probability and neighborhood density on adults’ word learning in noisy conditions. American Journal of Speech‐ Language Pathology, 25, 547–560. Hills, T. T., Maouene, M., Maouene, J., et al. (2009). Longitudinal analysis of early semantic networks: Preferential attachment or preferential acquisition? Psychological Science, 20, 729–739. Holes, C. (1995). Modern Arabic: Structure, functions and varieties. London: Longman. Hoover, J. R., Storkel, H. L., & Hogan, T. P. (2010). A cross‐sectional comparison of the effects of phonotactic probability and neighborhood density on word learning by preschool children. Journal of Memory and Language, 63, 100–116. Hunter, C. R. (2013). Early effects of neighborhood density and phonotactic probability of spoken words on event‐ related potentials. Brain and Language, 127, 462–474. Jusczyk, P. W., Frederici, A. D., Wessels, J., et al. (1993). Infants’ sensitivity to sound patterns of native language words. Journal of Memory and Language, 32, 402–420. Jusczyk, P. W., Luce, P. A., & Charles‐Luce, J. (1994). Infants’ sensitivity to phonotactic patterns in the native language. Journal of Memory and Language, 33, 630–645.

Kirk, K. I., Pisoni, D. B., & Miyamoto, R. C. (1997). Effects of stimulus variability on speech perception in listeners with hearing impairment. Journal of Speech, Language, and Hearing Research, 40, 1395–1405. Kleinberg, J. M. (2000). Navigation in a small world. Nature, 406, 845. Lallini, N., & Miller, N. (2011). Do phonological neighbourhood density and phonotactic probability influence speech output accuracy in acquired speech impairment? Aphasiology, 25, 176–190. Lambert, B. L., Dickey, L. W., Fisher, W. M., et al. (2010). Listen carefully: The risk of error in spoken medication orders. Social Science and Medicine, 70, 1599–1608. Landauer, T. K., & Streeter, L. A. (1973). Structural differences between common and rare words: Failure of equivalence assumptions for theories of word recognition. Journal of Verbal Learning and Verbal Behavior, 12, 119–131. Leonard, L. B., Davis, J., & Deevy, P. (2007). Phonotactic probability and past tense use by children with specific language impairment and their typically developing peers. Clinical Linguistics and Phonetics, 21, 747–758. Luce, P. A., Goldinger, S. D., & Vitevitch, M. S. (2000). It’s good . . . but is it ART? Behavioral and Brain Sciences, 23, 336. Luce, P. A., & Pisoni, D. B. (1998). Recognizing spoken words: The neighborhood activation model. Ear and Hearing, 19, 1–36. Majerus, S., Collette, F., Van der Linden, M., et al. (2002). A PET investigation of lexicality and phonotactic frequency in oral language processing. Cognitive Neuropsychology, 19, 343–361. Marslen‐Wilson, W. D., & Welsh, A. (1978). Processing interactions and lexical access during word recognition in continuous speech. Cognitive Psychology, 10, 29–63. McClelland, J. L., & Elman, J. L. (1986). The TRACE model of speech perception. Cognitive Psychology, 18, 1–86. Mersad, K., & Nazzi, T. (2011). Transitional probabilities and positional frequency

306 Perception of Linguistic Properties phonotactics in a hierarchical model of speech segmentation. Memory and Cognition, 39, 1085–1093. Messer, M. H., Verhagen, J., Boom, J., et al. (2015). Growth of verbal short‐term memory of nonwords varying in phonotactic probability: A longitudinal study with monolingual and bilingual children. Journal of Memory and Language, 84, 24–36. Munson, B., Kurtz, B. A., & Windsor, J. (2005). The influence of vocabulary size, phonotactic probability, and wordlikeness on nonword repetitions of children with and without specific language impairment. Journal of Speech, Language, and Hearing Research, 48, 1033–1047. Norris, D. (1994). Shortlist: A connectionist model of continuous speech recognition. Cognition, 52, 189–234. Norris, D., & McQueen, J. M. (2008). Shortlist B: A Bayesian model of continuous speech recognition. Psychological Review, 115, 357–395. Pitt, M. A., Myung, J. I., & Altteri, N. (2007). Modeling the word recognition data of Vitevitch and Luce (1998): Is it ARTful? Psychonomic Bulletin & Review, 14, 442–448. Plante, E., Bahl, M., Vance, R., & Gerken, L. (2011). Beyond phonotactic frequency: Presentation frequency effects word productions in specific language impairment. Journal of Communication Disorders, 44, 91–102. Pylkkänen, L., Stringfellow, A., & Marantz, A. (2002). Neuromagnetic evidence for the timing of lexical activation: An MEG component sensitive to phonotactic probability but not to neighborhood density. Brain and Language, 81, 666–678. Rumelhart, D. E., McClelland, J. L, & PDP Research Group (1986). Parallel distributed processing: Explorations in the microstructure of cognition: Vol. 1. Foundations. Cambridge, MA: MIT Press. Saffran, J. R., Newport, E. L., & Aslin, R. N. (1996). Word segmentation: The role of distributional cues. Journal of Memory and Language, 35, 606–621.

Siew, C. S. Q. (2013). Community structure in the phonological network. Frontiers in Psychology, 4, 553. Siew, C. S. Q. (2019). Spreadr: An R package to simulate spreading activation in a network. Behavior Research Methods, 51, 910–929. Song, H., & Schwarz, N. (2009). If it’s difficult to pronounce, it must be risky: Fluency, familiarity, and risk perception. Psychological Science, 20, 135–138. Steyvers, M., & Tenenbaum, J. (2005). The large‐scale structure of semantic networks: Statistical analyses and a model of semantic growth. Cognitive Science, 29, 41–78. Stiles, D. J., McGregor, K. K., & Bentler, R. A. (2013). Wordlikeness and word learning in children with hearing loss. International Journal of Communication Disorders, 48, 200–206. Storkel, H. L. (2001). Learning new words: Phonotactic probability in language development. Journal of Speech, Language, and Hearing Research, 44, 1321–1337. Storkel, H. L. (2003). Learning new words II: Phonotactic probability in verb learning. Journal of Speech, Language, and Hearing Research, 46, 1312–1323. Storkel, H. L. (2004). Methods for minimizing the confounding effects of word length in the analysis of phonotactic probability and neighborhood density. Journal of Speech, Language, and Hearing Research, 47, 1454–1468. Storkel, H. L. (2013). A corpus of consonant– vowel–consonant real words and nonwords: Comparison of phonotactic probability, neighborhood density, and consonant age of acquisition. Behavior Research Methods, 45, 1159–1167. Storkel, H. L., Armbruster, J., & Hogan, T. P. (2006). Differentiating phonotactic probability and neighborhood density in adult word learning. Journal of Speech, Language, and Hearing Research, 49, 1175–1192. Storkel, H. L., & Hoover, J. R. (2010). An online calculator to compute phonotactic

Phonotactics in Spoken-Word Recognition 307 probability and neighborhood density on the basis of child corpora of spoken American English. Behavior Research Methods, 42, 497–506. Strauss, T. J., Harris, H. D., & Magnuson, J. S. (2007). jTRACE: A reimplementation and extension of the TRACE model of speech perception and spoken word recognition. Behavioral Research Methods, 39, 19–30. Takac, M., Knott, A., & Stokes, S. (2017). What can neighbourhood density effects tell us about word learning? Insights from a connectionist model of vocabulary development. Journal of Child Language, 44, 346–379. van der Kleij, S. W., Rispens, J. E., & Scheper, A. R. (2016). The effect of phonotactic probability and neighbourhood density on pseudoword learning in 6‐ and 7‐year‐old children. First Language, 36, 93–108. Verhagen, J., Bree, E., Mulder, H., & Leseman, P. (2016). Effects of vocabulary and phonotactic probability on 2‐year‐ olds’ nonword repetition. Journal of Psycholinguistic Research, 46(3), 507–524. Vitevitch, M. S. (2002). Naturalistic and experimental analyses of word frequency and neighborhood density effects in slips of the ear. Language and Speech, 45, 407–434. Vitevitch, M. S. (2003). The influence of sublexical and lexical representations on the processing of spoken words in English. Clinical Linguistics and Phonetics, 17, 487–499. Vitevitch, M. S. (2008). What can graph theory tell us about word learning and lexical retrieval? Journal of Speech, Language, and Hearing Research, 51, 408–422. Vitevitch, M. S. (2019). Network science in cognitive psychology. New York: Taylor & Francis. Vitevitch, M. S., & Castro, N. (2015). Using network science in the language sciences and clinic. International Journal of Speech‐Language Pathology, 17, 13–25.

Vitevitch, M. S., Chan, K. Y., & Goldstein, R. (2014). Using English as a “model language” to understand language processing. In N. Miller & A. Lowit (Eds), Motor speech disorders: A cross‐language perspective. Buffalo, NY: Multilingual Matters. Vitevitch, M. S. & Donoso, A. (2012). Phonotactic probability of brand names: I’d buy that! Psychological Research, 76, 693–698. Vitevitch, M. S., Ercal, G., & Adagarla, B. (2011). Simulating retrieval from a highly clustered network: Implications for spoken word recognition. Frontiers in Language Sciences, 2, 369. Vitevitch, M. S., & Luce, P. A. (1998). When words compete: Levels of processing in spoken word perception. Psychological Science, 9, 325–329. Vitevitch, M. S., & Luce, P. A. (1999). Probabilistic phonotactics and neighborhood activation in spoken word recognition. Journal of Memory and Language, 40, 374–408. Vitevitch, M. S., & Luce, P. A. (2004). A web‐based interface to calculate phonotactic probability for words and nonwords in English. Behavior Research Methods, Instruments, & Computers, 36, 481–487. Vitevitch, M. S., & Luce, P. (2016). Phonological neighborhood effects in spoken word perception and production. Annual Review of Linguistics, 2, 75–94. Vitevitch, M. S., Luce, P. A., Charles‐Luce, J., & Kemmerer, D. (1997). Phonotactics and syllable stress: Implications for the processing of spoken nonsense words. Language and Speech, 40, 47–62. Vitevitch, M. S., Luce, P. A., Pisoni, D. B., & Auer, E. T. (1999). Phonotactics, neighborhood activation and lexical access for spoken words. Brain and Language, 68, 306–311. Vitevitch, M. S., Pisoni, D. B., Kirk, K. I., et al. (2002). Effects of phonotactic probabilities on the processing of spoken words and nonwords by postlingually

308 Perception of Linguistic Properties deafened adults with cochlear implants. Volta Review, 102, 283–302. Vitevitch, M. S., & Rodríguez, E. (2005). Neighborhood density effects in spoken word recognition in Spanish. Journal of Multilingual Communication Disorders, 3, 64–73. Vitevitch, M. S., Siew, C. S. Q., Castro, N., et al. (2015). Speech error and tip of the tongue diary for mobile devices. Frontiers in Psychology, 6, 1190. Vitevitch, M. S., Stamer, M. K., & Kieweg, D. (2012). The Beginning Spanish Lexicon: A web‐based interface to calculate phonological similarity among Spanish words in adults learning Spanish as a foreign language. Second Language Research, 28, 103–112. Vitevitch, M. S., & Storkel, H. L. (2013). Examining the acquisition of phonological word forms with

computational experiments. Language and Speech, 56, 491–527. Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of “small‐world” networks. Nature, 393, 409–410. Weber, A. & Cutler, A. (2006). First‐ language phonotactics in second‐ language listening. Journal of the Acoustical Society of America, 119, 597–607. Wright, W. (1995). A grammar of the Arabic language. Cambridge: Cambridge University Press. Yang, Z., Algesheimer, R., & Tessone, C. J. (2016). A comparative analysis of community detection algorithms on artificial networks. Scientific Reports, 6, 30750. Zamuner, T. S. (2013). Perceptual evidence for young children’s developing knowledge of phonotactic probabilities. Language Acquisition, 20, 241–253.

12 Perception of Formulaic Speech: Structural and Prosodic Characteristics of Formulaic Expressions DIANA VAN LANCKER SIDTIS1,2 AND SEUNG YUN YANG3 New York University, United States 2 Nathan Kline Institute for Psychiatric Research, United States 3 Brooklyn College, United States 1

Background Since the advent of generative linguistics in the 1950s (e.g. Chomsky, 1957, 1972), formulaic language has traveled through the three stages of a new idea. The first stage, in the context of generative linguistics, was ridicule: the topic was dismissed as peripheral, referencing the “anomaly,” “puzzle,” or “problems” emanating from linguistic approaches (Chafe, 1968; Pawley & Syder, 1983; Weinreich, 1969). This was followed by the second stage, vigorous argument, much of which involved debating the structural status of idioms (Cacciari & Tabossi, 1988; Cutting & Bock, 1997; Titone & Connine, 1999) and the classification of formulaic expressions (Van Lancker, 1988; Wray, 2002). It appears now that formulaic language has entered the third, “we always knew it was true” stage: general acceptance of its important role in language use and earnest study of its properties (e.g. Moon, 1998a, 1998b; Nordmann & Jambozova, 2017; Nunberg, Sag, & Wasow, 1994; Schmitt & Carter, 2004; Sinclair, 1991; Tannen, 1989; Sprenger, 2003; Wood, 2006). Formulaic expressions (FEs) serve many functions in communication (Wray & Perkins, 2000) and they seem to be present in all languages.1

The Handbook of Speech Perception, Second Edition. Edited by Jennifer S. Pardo, Lynne C. Nygaard, Robert E. Remez, and David B. Pisoni. © 2021 John Wiley & Sons, Inc. Published 2021 by John Wiley & Sons, Inc.

310 Perception of Linguistic Properties Formulaic language includes a large range of utterance types that appear to be acquired and used as unitary, fixed expressions and stored in memory, each with its unique phonetic, semantic, and usage characteristics. The various subsets within formulaic expressions (FEs) differ notably from each other in form, meaning characteristics, and use. Most FEs carry strong attitudinal or affective connota tions. Other less nuanced FEs are irreversible binomials (salt and pepper), lexical bundles (at this point in time), and verb particle constructions (catch up). These are also pronounced with a signature phonetic and intonational shape. Expletives, idioms, conversational speech formulas, proverbs, and lexical bundles, while dif fering in several parameters, have in common essentially that they are known to native speakers. Given this variety of types, it is challenging to seek generaliza tions about formulaic prosody across FEs. This chapter provides an overview of formulaic language (FL), reviewing function, mental representation, and evidence for its unique status in a model of language, followed by a review of the perceptual characteristics of FL. The few studies revealing listeners’ abilities, offering evidence that FEs can be perceptually distinguished given the auditory‐acoustic signal, are supported by indirect evidence of significant acoustic differences between FEs and novel expressions matched in different ways. Types of FEs studied for listeners’ perceptions and acoustic differences include idioms and proverbs. Evidence for perceptual and acoustic features in two important types of formulaic intonation, sarcasm and irony, is described. These observations, as well as a brief overview of relevant neu rological findings, lead to a dual process model, whereby FEs and novel expres sions constitute essentially different modes of language, engaging disparate cerebral processes.

Formulaic language in contemporary studies At least three recent advances and shifts in related fields provide a welcome con text for the serious study of formulaic language. In speech perception, retention by listeners of auditory‐acoustic characteristics in spoken sentences suggests that acoustic‐phonetic detail, including the articulatory features of rate and voice quality, are processed along with the abstract linguistic meaning (Goldinger, 1996, 1998; Levi & Schwartz, 2013; Pisoni, 1993). This acknowledgment of the role of episodic memory in speech perception is congruent with earlier notions of reten tion of prior, entire perceptual experiences (Jacoby, 1983), and is consistent with the acquisition of a very large repertory of holistically stored utterance types. Modern approaches to FL are also compatible with recent advances in construction grammar, which proposes that structural shapes of utterances paired with their meanings are retained in memory (Bod, 2006; Goldberg, 2006; Gurevich, Johnson, & Goldberg, 2010; Luka & Choi, 2012). In fact, the idiom has been labeled the ulti mate example of construction grammar, as both the words and the structure are known to the native speaker. Third, while linguistic approaches in earlier days pro moted a central, abstract grammar as the main contributor to language competence

Perception of Formulaic Speech: Structural and Prosodic Characteristics 311 housed in a specialized cognitive module, theories of emergentism integrate gen eral cognitive processes into descriptions of language ability (Bates & Goodman, 1999; MacWhinney, 1999; Snow, 1999; Van Lancker Sidtis, 2015). Something is emergent if it arises from a set of properties or elements but is different in impor tant ways from the aggregate of those properties. This perspective allowed for integrated descriptions of language, cognition, and consciousness. For FEs, it is clear that assembling a certain string of words with standard lexical meanings, such as “she has him eating out of her hand,” yields a product that is different in meaning attributes from that predicted by those words in that order, and that memory is crucially involved. Emergentism provides a theoretical context for appreciating the value and specialized status of formulaic expressions. This approach permits investigation into differing modes of acquisition and use to model language ability. For FL, some of these processes include the acquisition, storage, and retrieval of holistic structures, each with unique access to cognitive, emotive, attitudinal, and social characteristics.

Functions of formulaic expressions FEs perform multiple functions in conversation (Coulmas, 1979; Kuiper, 2004). They assist in maintaining fluency (Fillmore, 1979; Kuiper, 2000), and foster social qualities such as bonding, affirmation, humor, and affective innuendo, as well as claiming group identity, manipulating others, and asserting personal identity (Bell & Healey, 1992; Kuiper, 2009). In conversation, FEs utilize repetition to signal affir mation, camaraderie, and shared aims (Wolf, Van Lancker Sidtis, & Sidtis, 2014). In healthy speakers, depending on topic and interlocutor, about 24 percent of spontaneous speech consists of FEs (Sidtis, Canterucci, & Katsnelson, 2009; Van Lancker Sidtis & Rallon, 2004). Significantly more than this proportion, as in Alzheimer’s (Bridges & Van Lancker Sidtis, 2013) and left‐hemisphere damage, or significantly less, as in right‐hemisphere damage and Parkinson’s disease (Van Lancker Sidtis et al., 2016; Bridges, Van Lancker Sidtis, & Sidtis, 2013), or other subcortical disturbance (Sidtis, Canterucci, & Katsnelson, 2009; Speedie et al., 1993), yields a deviant communicative product (Van Lancker Sidtis & Postman, 2006; see below and reviews in Van Lancker Sidtis, 2004, 2010, 2012a, 2012b). In conversation, deviant communication arises from too few FEs – failing to naturally express expressions of affirmation and bonding – or too many FEs, thereby neglecting propositional language, which carries information.

Incidence of FEs in spoken language: Mental representation The number of FEs, including conversational speech formulas, expletives, pause fillers, idioms, and proverbs known to a speaker of a language, may approach hundreds of thousands (Jackendoff, 1995). A field study of occurrences of proverbs

312 Perception of Linguistic Properties for three years in a small German town revealed frequent and daily usage of these selected kinds of FEs (Hain, 1951). When conventional expressions or “lexical bundles” (in the meantime), irreversible binomials (salt and pepper), sentence initials (I guess, I’d like you to meet) and verb plus particle (catch up) constructions are included (Biber, 2009; Biber & Barbieri, 2007), as well as titles, lyrics of songs, rhymes, and literary quotes, estimates increase, for which an upper limit has not been established. FEs can be single words, as in expletives (darn!: Van Lancker & Cummings, 1999) and conversational speech formulas (Right! Okay!), 2–3 words (Good boy! It’s a wrap), 4–7 or 8 words, as in most idioms and proverbs; or even many words, such as fixed, known expressions with literary or other cultural ori gins (Hoblitzelle, 2008). These FEs exist as formulemes in mental representation in coherent form carrying conventionalized meanings, with specific words in a certain order, canonical phonetic instantiation, and signature prosody forming traces of a complete auditory‐semantic gestalt in episodic memory. This is to say that the utterance is stored along with several characteristics: specific words in a certain order, connotative meanings, appropriate social use, and, most importantly for this review, a stereotyped phonetic‐prosodic contour (Calhoun & Schweitzer, 2012). This rather complex holistic storage constitutes a mode of processing that differs from the lexical unit and rule‐governed model that accounts for novel, grammatical language (Erman, 2007; Erman & Warren, 2000; Jiang & Nekrasova, 2007; Kuiper et al., 2007; Van Lancker Sidtis, 2012b).

Acquisition of FEs While text frequency undoubtedly plays a role in establishing many FEs, espe cially lexical bundles such as on the other hand and or something like that, this per spective is not profitable in considering idioms and proverbs, which, in surveys conducted with native speakers, are endorsed as personally familiar but not recently used or heard (Hallin & Van Lancker Sidtis, 2017; Rammell, Van Lancker Sidtis, & Pisoni, 2017). While FEs all have in common that they are known, along with their pertinent characteristics, to native speakers, as mentioned above, these features differ between subtypes of FEs. Lexical bundles are frequent, especially in written language, and are relatively literal and bereft of connotative meaning. Idioms and proverbs, on the other hand, occur much less frequently, and carry strong connotative nuances along with their nonliteral meanings (Siyanova‐ Chanturia & Van Lancker Sidtis, 2019). It has been proposed that these character istics, engaging attentional and affective systems in the brain, may assist in their acquisition from very few exposures. A study by Reuterskiöld and Van Lancker Sidtis (2013) tested this assumption by introducing idioms along with matched novel utterances to young children in a naturalistic setting. Selected formulaic utterances, such as cross swords with someone and go through the mill, matched with novel expressions such as give one a

Perception of Formulaic Speech: Structural and Prosodic Characteristics 313 ball and shop at different stores, were produced with their characteristic prosody in the course of working with children on craft activities. Participants were six girls in each of two age groups, 8–9 and 12–14 years, native speakers of American English, recruited from a public school in Manhattan, in New York City, and two adult teachers. In subsequent testing, the children in both age groups recognized significantly more target idioms than novel expressions, indicating that the idioms had been successfully uploaded into memory from one‐time exposure. Participants also scored higher on comprehension of target idioms than other idioms not used in the naturalistic session. These results suggest that idioms, given their characteristics of nonliteral and nuanced meanings, are apt to engage attention and arousal mechanisms, leading to retention into an FE repertory, even from one‐time exposure. It can be speculated, from the idiom identification studies reviewed in this chapter, that the proper prosodic material inherent in each idiom is also stored.

Phonetics of FEs: Stereotyped patterns Evidence for the view that FEs, with their auditory‐acoustic characteristics, are known (stored in memory) arises from several sources. In a language community, it can be observed that unique pronunciation and vocal features are embedded in known expressions, which are reproduced by persons alluding to expressions such as fuggetaboud‐it (forget about it; New York dialect); “Ex‐cy‐use me!” (signa ture phrase of Steve Martin); Martin’s comedy routine “What the hell is that?”, which utilizes vocal‐phonetic cues for a befuddled state; “We don’t care; we don’t have to,” telephone operator Lily Tomlin’s remark in an articulatorily lip rounded and fronted, “snooty” tone; Arnold Schwarzenegger’s Terminator’s “I’ll be back,” with his emphatic Austrian prosody. When referring to previous interactions, people may mock the typically intoned FEs of individuals, reproducing (and exaggerating) the full force of their idiosyncratic pronunciation. Ordinary FEs work best when pronounced with their “received” stereotyped prosodic contour – for example, You’ve got to be kidding! and get out! with equal syllabic stress and a glottal stop between the two words; better you than me with relative high then lower stress on the two key words. Other examples known to a native speaker are I wouldn’t want to be in his shoes; you’ve got a nerve; Gimme a break! no man is an island. FEs have stereotyped prosodic characteristics (Hallin & Van Lancker Sidtis, 2017; Lin, 2010, 2012). To pronounce these utterances with a dif ferent prosodic contour sounds humorous or foreign, or it may be an attempt to signal a literal interpretation. Prosodic errors on FEs can be heard from nonna tive speakers, even those advanced enough to have acquired a repertory of FEs in their second language; for example, a professor of German (and native speaker) at a university said, “I wouldn’t want to be in his shoes like that,” placing the sentence accent on shoes and thereby deviating from the canonical form mentioned earlier.

314 Perception of Linguistic Properties

Studies of comprehension and perception of FEs There is information from written material, using such approaches as eye tracking and latencies in phrase classification tasks, that FEs and novel expressions are processed (recognized or comprehended) in ways that differ from processing of novel (newly created) expressions (Siyanova‐Chanturia, Conklin, & Schmitt, 2011; Underwood, Schmitt, & Galpin, 2004). Swinney and Cutler (1979) showed differ ences in recognition of FEs as compared to matched literal expressions, using reac tion‐time measures in a noun‐phrase recognition task. In other research, subjects recognized visually presented FEs more quickly than matched novel expressions (Gibbs & Gonzales, 1985; Tabossi, Fanari, & Wolf, 2009). For reading, studies using various research designs have reported faster reading times for formulaic expres sions compared to matched control strings (Conklin & Schmitt, 2008; Katz & Ferretti, 2001; Siyanova‐Chanturia, Conklin, & Schmitt, 2011; Turner & Katz, 1997; Underwood, Schmitt, & Galpin, 2004). However, reaction time for the detection of constituent elements in familiar phrases was longer with respect to speech and accuracy than in unfamiliar phrases, suggesting that these elements were not ini tially processed but that the phrase was apprehended as a unit. Similarly, subjects remembered and recalled the constituent words of novel expressions better than those of matched FEs (Horowitz & Manelis, 1973; Osgood & Housain, 1974). These and related studies lead to the proposal that formulaic expressions are comprehended holistically as a coherent unit, in that the whole is greater than the sum of the parts, or, said differently, a new whole emerges from the assemblage of constituents (Lounsbury, 1963; Sinclair, 1991). Other evidence of speaker knowledge of FEs is seen when subjects are able to accurately label written FEs and novels as formulaic (idiomatic) or novel (Van Lancker Sidtis & Rallon, 2004) and to appro priately write predicted words in blanks (in slot‐filler tasks) for FEs, while a large variety of words appear in slots provided in matched novel expressions (Van Lancker Sidtis, et al., 2015). Fewer studies have utilized spoken stimuli, the focus of this review, and even fewer have probed perception directly. An early demonstration using perception of spoken language was performed by Lieberman (1963), who suspected that words in familiar adages were pronounced differently from the same word types in novel utterances. He presented, for example, the utterances a stitch in time saves nine and the number can be nine, with the last word (nine) excised, to listeners, who were less able to perceive the excerpted words in FEs (described at the time as “high redundancy”) than those in novel utterances. Since then, some direct per ceptual and much indirect evidence has emerged to indicate that in the spoken mode FEs are produced and perceived differently from comparable or matched novel expressions. The first study to directly examine listeners’ abilities to distinguish between FEs and novel expressions from the auditory‐acoustic signal utilized utterances that are ambiguous regarding formulaic (idiomatic) or literal meanings, such as It broke the ice, called ditropic sentences (Van Lancker & Canter, 1981). Listeners heard these utterances excised from disambiguating paragraphs that had been read

Perception of Formulaic Speech: Structural and Prosodic Characteristics 315 aloud and were asked to indicate, on an answer sheet, whether each utterance was literal or idiomatic. Listeners were unable to correctly identify the intended meanings from the auditory signal, likely due to a bias toward labeling items as idiomatic, the status of the speech as read aloud, and the overriding influences of continuous speech on prosodic detail. In a second experiment, two native speakers produced each sentence with either an intended idiomatic or a literal meaning. The randomized utterances were tape recorded and played singly, and in a second condition in pairs, without any other context, to a group of listeners, who circled I for idiomatic or L for literal on an answer sheet. The intended meanings were iden tified in both conditions well above chance. The same (randomized) test materials, single and paired ditropic utterances, were later administered to four groups of listeners: native American English speakers, non‐American English speakers, highly competent nonnative speakers of English, and students in an ESL (English as a second language) class (Van Lancker‐Sidtis, 2003). Again, the native speakers recognized the intended mean ings well above chance, while the two second‐language groups performed at chance or below. It was concluded that the prosodic material making up formulaic, in contrast to literal, utterances (made up of the same words) is acquired early in native‐language acquisition. The finding that listeners could distinguish ditropic utterance was replicated for the Korean language (Yang, Ahn, & Van Lancker Sidtis, 2015) using utterances produced by four healthy, native speakers. Sentences with this particular kind of ambiguity, having either a literal or idiomatic meaning, were obtained for Korean, recorded, and presented auditorily to healthy listeners, who were highly successful at identifying the intended meanings. An acoustic analysis (reviewed later) revealed some of the auditory cues that may have been used in making the correct discriminations. Another study of perception of Korean ditropic utterances by listeners was per formed by Yang, Sidtis, and Yang (2016), using stimuli produced by healthy speakers and by persons with unilateral brain damage. Healthy listeners identi fied intended meanings (idiomatic vs. literal) of ambiguous idiomatic sentences in the Korean language produced by individuals with brain damage (left‐hemisphere damage [LHD], right‐hemisphere damage [RHD]) and by healthy controls (HCs). These stimulus sets (literal and idiomatic) were made up of the same words, spoken for the purposes of the experiment with either a literal or an idiomatic meaning. Native listeners were able to discriminate the intended meanings spoken by healthy speakers at a high level (82.62%). However, listeners had difficulties identifying the meanings of idiomatic utterances produced by individuals with brain damage, especially ones produced following RHD (63.81% for LHD, 38.69% following RHD). The results from HC listeners further support the notion, using a non‐Indo‐European language, that the ability to discriminate the subtle prosodic differences between idiomatic and literal utterances belongs to native listeners’ language competence. Native listeners’ deficient performance on utterances pro duced by persons with left or right brain damage indicates that signature prosodic cues are necessary for the successful perception of FEs. Acoustic analyses of these stimuli (Yang & Van Lancker Sidtis, 2016), reviewed later, cast some light on why

316 Perception of Linguistic Properties the utterances were not well recognized, and why the site of brain damage pro duced a difference in perceptual competence. Finally, a recent report of performance by listeners utilized two sets of spoken stimuli, 70 FEs and 70 novel (newly created) expressions matched for word and syllable count and grammatical form (but not identical in wording), which were treated with auditory noise (Rammell, Van Lancker Sidtis, & Pisoni, 2017). The stimuli were produced by a native speaker of American English. Listeners were asked to transcribe each utterance. As predicted, perceptual competence for FEs was dramatically higher than for novel expressions. This result can be attributed to at least two features: the stereotyped phonetic and prosodic shapes of FEs, which confer recognizability, and a presence in memory as a stored trace of the entire known expression, which can be readily and more efficiently accessed from less complete acoustic material. These studies converge to suggest that FEs are stored and processed as a unified configuration (formulemes), that prosodic and phonetic detail is part of the formuleme, and that these auditory‐acoustic details are utilized in perception.

Prosodic material differentiating FEs from novel expressions: Indirect measures When listeners are able to discriminate between spoken FEs and novel utterances, or to better recognize or distinguish formulaic compared to novel utterances from the auditory signal, a direct measure is being applied. This has been called a high‐ inference variable, as the dependent variable, listeners’ performance, has a direct relationship to the independent variable (perceptibility of FEs; Johnston & Pennypacker, 1993). Perceptual competence can reasonably be inferred from this kind of experiment. While so‐called indirect evidence is lower in inference power, the accumulation of data over the past decade suggests that phonetic and prosodic material, as measured from FEs and related novel expressions, play a significant role in use and perception of FEs. To support the listener results, specific auditory‐acoustic cues were measured, comparing identical utterances that are capable of conveying either an idiomatic or a propositional expression –known as ditropic expressions. In examining the pairing between the formulaic and the novel renditions of ambiguous (ditropic) expressions (e.g. It broke the ice), idioms had significantly shorter durations and fewer pauses than their literal counterparts (Van Lancker, Canter, & Terbeek, 1981). Mean fundamental frequency (F0) did not significantly differentiate the two types of utterances, nor did the other two measures of F0 (standard deviation and range). However, an increased number of local pitch contours, present on individual words, were found in the literal exemplars. The increased pausing and local pitch contours in the contrasting literal exemplars supported the notion that FEs, pro nounced on a single intonational contour and with no pauses, represent a kind of phonological coherence (Lin, 2010). For the literal interpretations, acoustic anal ysis revealed that speakers placed focus on individual words in the sentence

Perception of Formulaic Speech: Structural and Prosodic Characteristics 317 (Ashby, 2006) to distinguish the literal from the canonical prosodic shape of the idiomatic version. Van Lancker, Canter, and Terbeek (1981) speculated that these acoustic cues (fewer pitch accents and pauses and shorter duration, i.e. faster rate in the idioms) iconically reflected the nonliteral versions as unitary utterances, an interpretation that is in line with the phonological coherence proposal mentioned early. Longer durations on individual words and pausing in the literal versions serve to convey the standard, literal meanings of the words. Alert radio listeners will hear this practice in talk shows, whereby a radio host indicates their intention to communicate the literal meaning of a known formulaic expression by breaking up the unitary prosodic contour, introducing pauses between words, and high lighting individual lexical items. An acoustic analysis was also performed on the Korean ditropic utterances pro duced by healthy native speakers that were successfully discriminated by listening. The idiomatic exemplars showed greater mean intensity, intensity variation, and greater variation in syllable duration than the matched literal counterparts (Yang, Ahn, & Van Lancker Sidtis, 2015; Figure 12.1). Idiomatic utterances were produced with longer sentence durations and greater percentage of pause durations com pared to literal counterparts. Also, idiomatic utterances differed from literal utter ances with respect to overall F0 and F0 variation, with greater variation seen in the idiomatic exemplars (Figure 12.2). These cues derived from Korean exemplars differ from those reported for English, and both languages differ again in measures from a dialect of French. (No

Idiomatic

Literal

200

ms

150

100

50

0

LHD

RHD

HC

Group

Figure 12.1 Means and standard errors for variability of syllable duration for utterances produced by three groups on the elicitation task: left hemisphere damage (LHD); right hemisphere damage (RHD); healthy control (HC).

318 Perception of Linguistic Properties Idiomatic

0.25

Literal

Coefficient variation

0.2

0.15

0.1

0.05

0

LHD

RHD

HC

Group

Figure 12.2 Means and standard errors for F0 variation for utterances produced by three groups on the elicitation task: left hemisphere damage (LHD); right hemisphere damage (RHD); healthy control (HC).

listening data are available for these utterances.) In Parisian French, in contrast to English and Korean, native speakers produced significantly longer sentence dura tions for idiomatic than for literal expressions (Abdelli‐Beruh et al., 2007). Further, differences in F0 were found. Parisian French used significantly higher F0, larger F0 standard deviations, higher maximum F0 measures, and wider F0 range for idi omatic expressions. The American English utterances tended to follow an opposite direction, with significant differences for F0 standard deviations, maximum F0, and F0 range, which were lower for idiomatic than literal meanings, across sen tence measures. A comparison of American English and Parisian French shows that speakers in each language used the same cues to distinguish literal from idio matic meanings of ambiguous sentences, but in an opposite manner. In English, longer durations and more variations in F0 are seen for literal utterances; in con trast, in French idiom literal pairs, longer durations and greater F0 variation were measured for the idiomatic than the literal versions of the sentences. The last two words in the literal or idiomatic utterances appear to provide cues to the sentence meanings in the three languages studied. Parisian French used sig nificantly higher F0 for the last words in the idiomatic expressions, compared to the literal counterparts, whereas American English showed a significantly lower F0. In Korean, terminal F0 measures consistently showed falling F0 in the last two sylla bles of literal sentences, while idiomatic sentences had rising ones. In these three different languages, one Germanic, one Romance, and a third non‐Indo‐European,

Perception of Formulaic Speech: Structural and Prosodic Characteristics 319 Table 12.1 Acoustic measures for American English idiomatic and literal utterances matched for length and structure.

Idiomatic Literal

Duration

Mean F0

F0 standard deviation

F0 range

1.384 1,965

113.889 113.317

29.197 33.100

103.206 116.017

acoustic parameters significantly distinguished the two sentence types, but with different profiles for each language. It can reasonably be inferred that these cues assist in the perception of idiomatic meanings in ambiguous sentences. An acoustic analysis of Swedish proverbs produced by adult speakers revealed a consistent tonal pattern that appeared to represent a proverbial stereotyped contour. This pattern differentiated spoken proverbs from patterns expected in novel sen tences (Hallin & Van Lancker Sidtis, 2017). Here, too, it can be inferred from these indirect measures that speakers “know” these prosodic contours, and that the pro sodic material contributes to successful perception of utterances as proverbs. A similar acoustic analysis was performed on the 140 utterances presented in noise to listeners for the previously mentioned transcription task of American FEs and matched literal utterances (Rammell, Van Lancker Sidtis, & Pisoni, 2017). As reported earlier, listeners were significantly more successful in transcribing the FEs than the novel stimuli. Phrase and syllable durations, mean F0, and F0 ranges were compared between the two types of expressions. There were no significant differences between utterance types for syllable durations or the two F0 measures. However, overall phrase durations did differ significantly (t = 4.2041, df = 69, p