122 40 36MB
English Pages 817 Year 2022
T h e Ox f o r d H a n d b o o k o f
T H E M E N TA L L E X IC ON
OXFORD HANDBOOKS IN LINGUISTICS RECENTLY PUBLISHED
THE OXFORD HANDBOOK OF LYING Edited by Jörg Meibauer THE OXFORD HANDBOOK OF TABOO WORDS AND LANGUAGE Edited by Keith Allan THE OXFORD HANDBOOK OF MORPHOLOGICAL THEORY Edited by Jenny Audring and Francesca Masini THE OXFORD HANDBOOK OF REFERENCE Edited by Jeanette Gundel and Barbara Abbott THE OXFORD HANDBOOK OF EXPERIMENTAL SEMANTICS AND PRAGMATICS Edited by Chris Cummins and Napoleon Katsos THE OXFORD HANDBOOK OF EVENT STRUCTURE Edited by Robert Truswell THE OXFORD HANDBOOK OF LANGUAGE ATTRITION Edited by Monika S. Schmid and Barbara Köpke THE OXFORD HANDBOOK OF NEUROLINGUISTICS Edited by Greig I. de Zubicaray and Niels O. Schiller THE OXFORD HANDBOOK OF ENGLISH GRAMMAR Edited by Bas Aarts, Jill Bowie, and Gergana Popova THE OXFORD HANDBOOK OF AFRICAN LANGUAGES Edited by Rainer Vossen and Gerrit J. Dimmendaal THE OXFORD HANDBOOK OF NEGATION Edited by Viviane Déprez and M. Teresa Espinal THE OXFORD HANDBOOK OF LANGUAGE CONTACT Edited by Anthony P. Grant THE OXFORD HANDBOOK OF LANGUAGE PROSODY Edited by Carlos Gussenhoven and Aoju Chen THE OXFORD HANDBOOK OF LANGUAGES OF THE CAUCASUS Edited by Maria Polinsky THE OXFORD HANDBOOK OF GRAMMATICAL NUMBER Edited by Patricia Cabredo Hofherr and Jenny Doetjes THE OXFORD HANDBOOK OF COMPUTATIONAL LINGUISTICS Second Edition Edited by Ruslan Mitkov THE OXFORD HANDBOOK OF THE MENTAL LEXICON Edited by Anna Papafragou, John C. Trueswell, and Lila R. Gleitman For a complete list of Oxford Handbooks in Linguistics, please see pp. 783–786
The Oxford Handbook of
THE MENTAL LEXICON Edited by
ANNA PAPAFRAGOU JOHN C. TRUESWELL and
LILA R. GLEITMAN
1
3 Great Clarendon Street, Oxford, ox2 6dp, United Kingdom Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries © editorial matter and organization Anna Papafragou, John C. Trueswell, and Lila R. Gleitman 2021 © the chapters their several authors 2021 Chapter 18 © John Wiley and Sons 2020 The moral rights of the authors have been asserted First Edition published in 2021 Impression: 1 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by licence or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above You must not circulate this work in any other form and you must impose this same condition on any acquirer Published in the United States of America by Oxford University Press 198 Madison Avenue, New York, NY 10016, United States of America British Library Cataloguing in Publication Data Data available Library of Congress Control Number: 2021940130 ISBN 978–0–19–884500–3 DOI: 10.1093/oxfordhb/9780198845003.001.0001 Printed and bound in the UK by TJ Books Limited Links to third party websites are provided by Oxford in good faith and for information only. Oxford disclaims any responsibility for the materials contained in any third party website referenced in this work.
In loving memory of Lila
Contents
List of Abbreviations Contributors
xi xv
1. Introduction Anna Papafragou, John C. Trueswell, and Lila R. Gleitman
1
PA RT I R E P R E SE N T I N G T H E M E N TA L L E X IC ON PA RT IA F OR M 2. Phonological abstraction in the mental lexicon Eric Baković, Jeffrey Heinz, and Jonathan Rawski
11
3. Phonological variation and lexical form Ruaridh Purse, Meredith Tamminga, and Yosiane White
33
4. Neural encoding of speech and word forms David Poeppel and Yue Sun
53
PA RT I B M E A N I N G 5. Morphology and the mental lexicon: Three questions about decomposition David Embick, Ava Creemers, and Amy J. Goodwin Davies 6. Syntax and the lexicon Artemis Alexiadou
77 98
7. Lexical semantics Ray Jackendoff
126
8. Logic and the lexicon: Insights from modality Valentine Hacquard
151
viii Contents
PA RT IC I N T E R FAC E S A N D B OU N DA R I E S 9. Pragmatics and the lexicon Florian Schwarz and Jérémy Zehr
173
10. Efficient communication and the organization of the lexicon Kyle Mahowald, Isabelle Dautriche, Mika Braginsky, and Edward Gibson
200
11. Compositionality of concepts Gala Stojnic and Ernie Lepore
221
12. Language and thought: The lexicon and beyond Barbara Landau
236
PA RT I I AC QU I R I N G T H E M E N TA L L E X IC ON PA RT I IA F OR M 13. Infants’ learning of speech sounds and word-forms Daniel Swingley
267
14. Learning words amidst speech sound variability Sarah C. Creel
292
PA RT I I B M E A N I N G 15. How learners move from sound to morphology Katherine Demuth
313
16. Systematicity and arbitrariness in language: Saussurean rhapsody Charles Yang
327
17. Children’s use of syntax in word learning Jeffrey Lidz
356
18. Easy words: Reference resolution in a malevolent referent world Lila R. Gleitman and John C. Trueswell
378
19. Early logic and language Stephen Crain
401
Contents ix
PA RT I IC I N T E R FAC E S A N D B OU N DA R I E S 20. Contributions of pragmatics to word learning and interpretation Myrto Grigoroglou and Anna Papafragou
421
21. Differences in vocabulary growth across groups and individuals Christine E. Potter and Casey Lew-Williams
438
PA RT I I I AC C E S SI N G T H E M E N TA L L E X IC ON PA RT I I IA V IA F OR M 22. Spoken word recognition James S. Magnuson and Anne Marie Crinnion 23. Word-meaning access: The one-to-many mapping from form to meaning Jennifer M. Rodd 24. Learning and using written word forms Rebecca Treiman and Brett Kessler
461
491 506
PA RT I I I B V IA M E A N I N G 25. The dynamics of word production Oriana Kilbourn-Ceron and Matthew Goldrick
521
26. The neural basis of word production Nazbanou Nozari
536
PA RT I I IC I N T E R FAC E S A N D B OU N DA R I E S 27. The structure of the lexical item and sentence meaning composition Maria Mercedes Piñango
561
28. On the dynamics of lexical access in two or more languages Judith F. Kroll, Kinsey Bice, Mona Roxana Botezatu, and Megan Zirnstein
583
x Contents
29. Lexical representation and access in sign languages Rachel I. Mayberry and Beatrijs Wille
598
30. Disorders of lexical access and production Daniel Mirman and Erica L. Middleton
615
References Index
629 777
List of Abbreviations
±1
First person feature
±2
Second person feature
±N
Nominal feature
±past
Past tense feature
±perf
Perfect aspect feature
±pl
Plural number feature
A.pass
adjectival passive
A1
primary auditory cortex
ACC
anterior cingulate cortex
Acc accusative Act active AG
angular gyrus
ALE
activation likelihood estimate
ASD
Autism spectrum disorders
ASL
American Sign Language
ATL
anterior temporal lobe
AxS
analysis by synthesis
BSL
British Sign Language
CC
compositionality constraint
DAS
deletion, addition, or substitution
DGS
Deutsche Gebärdensprache (German Sign Language)
DIVA
directions into velocities of articulators
DN
double negation
DS
Down syndrome
DS
default to stereotype
EARSHOT Emulation of Auditory Recognition of Speech by Humans Over Time ECoG Electrocorticography EEG Electroencephalography
xii List of abbreviations ERF
event-related field
ERP
event-related potential
F0
fundamental frequency
FC
composition function
FI
interpretation function
fMRI
functional magnetic resonance imaging
FWNP
frequency-weighted neighborhood probability
GODIVA gradient order directions into velocities of articulators HG
Heschl’s gyrus
IAC
interactive activation and competition
IFG
inferior frontal gyrus
IFS
inferior frontal sulcus
Intr intransitive IPA
International Phonetic Alphabet
IPC
inferior parietal cortex
ITG
inferior temporal gyrus
JSL
Japanese Sign Language
L1
First Language
L2
Second Language
LCfC
lexically mediated compensation for coarticulation
LSC
Llengua de signes catalana (Catalan Sign Language)
LSE
Lengua de Signos Española (Spanish Sign Language)
LSF
Langue des Signes Française (French Sign Language)
LSTM
long short-term memory
MEG Magnetoencephalography MMN
mismatch negativity
MSS
metrical segmentation strategy
MTG
middle temporal gyrus
Nact
non active
NAM
Neighborhood Activation Model
NC
negative concord
NegP
negation phrase
Nom nominative Pass passive
List of abbreviations xiii PC
predictive coding
Perf. perfective PET
positron emission tomography
POA
place of articulation
PRED
predicate phrase
PreSMA pre-supplementary motor areas PWA
people with aphasia
SD
semantic dementia
SES
Socioeconomic status
SG singular SMA
supplementary motor area
SMG
supramarginal gyrus
SMP
simplified mapping perspective
SNR
signal-to-noise ratio
SpT
Sylvian parietal temporal (parietal temporal junction)
SSM
selection modification model
STG
superior temporal gyrus
STS
superior temporal sulcus
SUB
subject phrase
svPPA
semantic variant of primary progressive aphasia
SWR
spoken word recognition
TÏD
Türk Ïşaret Dili (Turkish Sign Language)
TMS
transcranial magnetic stimulation
TR transitive V.pass
verbal passive
VLPFC
ventrolateral prefrontal cortex
vMC
ventral primary motor cortex
VOT
voice onset time
vPMC
ventral premotor cortex
WS
Williams syndrome
Contributors
Artemis Alexiadou is Professor of English Linguistics at the Humboldt University in Berlin and Vice Director of Leibniz-Centre General Linguistics (ZAS) in Berlin. Her research is concerned with the syntax and morphology of noun phrases and argument alternations, on which she has published several articles and books. Alexiadou currently serves on the editorial board of Glossa, English Language and Linguistics, Linguistic Analysis, Linguistic Inquiry, Natural Language and Linguistic Theory, and Syntax. She is a winner of the 2014 Gottfried-Wilhelm Leibniz Prize, awarded by the German Research Foundation. She has served as Chairperson of GLOW. Eric Baković is Professor and Chair of the Linguistics Department at UC San Diego. His research is in phonological theory, with a focus on comparing and assessing how different theoretical frameworks address foundational questions of phonological representation and computation. Kinsey Bice, formerly a postdoctoral fellow at the University of Washington’s Institute for Learning and Brain Sciences and the Department of Psychology, now works as a language data researcher on the Alexa AI team at Amazon. Her research focuses on individual differences in adult language learning, their neural underpinnings, and how to improve learning for adults using neurotechnology. She is a member of the Cognitive Science Society and the Cognitive Neuroscience Society. Mona Roxana Botezatu is Assistant Professor in the Department of Speech, Language and Hearing Sciences at the University of Missouri. Her research focuses on understanding the dynamics of bilingual lexical access, the cognitive mechanisms that support L2 acquisition, and the consequences of bilingualism on native language performance. Mika Braginsky, a graduate student in Brain and Cognitive Sciences at MIT, works in Ted Gibson’s language lab and is interested in language development, psycholinguistics, and developing computational tools in those areas. Stephen Crain is Emeritus Professor of Linguistics at Macquarie University. He was an Australian Research Council (ARC) Federation Fellow, and Director of the ARC Centre of Excellence in Cognition and its Disorders. He is a fellow of the Academy of Social Sciences in Australia. He previously served as head of the Linguistics Department at the University of Connecticut and the University of Maryland. His research focuses on children’s acquisition of logical expressions, with an emphasis on the logical structures of two typologically distant languages, English and Mandarin Chinese.
xvi contributors Sarah S. Creel is Professor of Cognitive Science at UC San Diego where she directs the Language Acquisition and Sound Recognition Lab. Her research centers around how people form memory representations of sound patterns, including speech sounds, words, voices, accents, and music, and how sound pattern representations change from childhood to adulthood. She uses a combination of eye-tracking and behavioral measures to address these research avenues. Creel has served as Associate Editor at the Journal of Experimental Psychology: Learning, Memory, and Cognition and organized the 2017 conference of the Society for Music Perception and Cognition. Ava Creemers is a postdoctoral researcher in the Psychology of Language Department at the Max Planck Institute for Psycholinguistics. She received her PhD in linguistics from the University of Pennsylvania. Her dissertation research focused on lexical processing and, in particular, the processing of multi-morphemic words that are semantically opaque. She is interested more broadly in language comprehension, including language processing above the word-level. Anne Marie Crinnion is a graduate student in the Department of Psychological Sciences at the University of Connecticut, where she is a Jorgenson Fellow and a member of the Cognitive Neuroscience of Communication and Science of Learning & Art of Communication interdisciplinary training programs. Her research focuses on how acoustic, lexical, and semantic levels of processing interact in the time course of speech processing, from behavioral, neural, and computational perspectives. Isabelle Dautriche is a cognitive scientist in the Centre National de la Recherche Scientifique (CNRS) at Aix-Marseille University (France). Her research focuses on language acquisition and its link to the structure of languages and uses a range of computational and experimental methods with human infants and animals. Katherine Demuth is Distinguished Professor of Linguistics at the Macquarie University where she is Director of both the Child Language Lab and the Centre of Language Sciences (CLaS). Her research focuses on how children acquire morphosyntax, showing how their growing competence with phonological/prosodic structure influences their comprehension, processing, and production of grammatical morphology. She is an ARC Laureate Fellow, a fellow of the Academy of Social Sciences in Australia (FASSA), and a fellow of the Royal Society of NSW (FRSN). David Embick is Professor and Chair in the Linguistics Department at the University of Pennsylvania. His research in theoretical linguistics concentrates on syntactic approaches to morphology and their connections with other parts of grammar, including phonology and argument/event structure. Other research interests include questions concerning the architecture of the language faculty, experimental approaches to lexical access and representation, and employing neuroimaging and experimental techniques to examine language impairments in different clinical child populations.
contributors xvii Ted Gibson is Professor of Cognitive Science at the Department of Brain and Cognitive Sciences at MIT. He competed for Canada at the 1984 Olympics in rowing, coming seventh in the coxless 4. His research examines how language is processed and how language processing constraints affect language structure. Lila R. Gleitman is Professor Emerita of Psychology and (by secondary appointment) Linguistics at the University of Pennsylvania. She has conducted foundational work on language acquisition, and the relationship between language and other cognitive systems. She is an elected member of the National Academy of Sciences, an elected fellow of the American Association for the Advancement of Science, a Fyssen Foundation laureate, and a winner of the Rumelhart Prize in Theoretical Foundations of Human Cognition. She has served as past President of the Linguistic Society of America and the Society for Philosophy and Psychology. Matthew Goldrick is Professor in the Department of Linguistics and (by courtesy) Psychology at Northwestern University. His research draws on behavioral experiments, acoustic analysis, and computational modeling to develop theories of the cognitive and neural mechanisms underlying the production, perception, and acquisition of sound structure in mono-and multilingual speakers. He currently serves on the editorial boards of Cognitive Neuropsychology, Journal of Memory and Language, Laboratory Phonology, and Phonology. Amy J. Goodwin Davies’ research in linguistics focuses on the mental representations involved in morphological processing, integrating approaches from psycholinguistics and theoretical morphology. She completed her PhD at the University of Pennsylvania in 2018. Goodwin Davies applies the research and data analysis skills that she developed in her graduate work to her current role as a data scientist at the Children’s Hospital of Philadelphia. Myrto Grigoroglou is Assistant Professor in the Department of Linguistics and the Cognitive Science Program at the University of Toronto. She obtained her PhD in Linguistics and Cognitive Science at the University of Delaware and was a postdoctoral fellow in the Department of Applied Psychology and Human Development at the University of Toronto (OISE). Her research focuses on understanding how children acquire meaning in their native language, and how children and adults produce and comprehend language in conversation. Valentine Hacquard is Professor in the Department of Linguistics at the University of Maryland, College Park. Her research focuses on formal semantics and pragmatics, with a focus on modality and attitude reports, and their acquisition. Hacquard is currently an Associate Editor at the Journal of Semantics, and on the editorial board of Semantics and Pragmatics. Jeffrey Heinz is Professor of Linguistics at Stony Brook University with a joint appointment in the Institute for Advanced Computational Science. He conducts research in theoretical phonology and several related areas including theoretical and
xviii contributors mathematical linguistics, theoretical computer science, computational learning theory, cognitive science, and artificial intelligence. Ray Jackendoff is Seth Merrin Professor Emeritus of Philosophy and former Co- director of the Center for Cognitive Studies at Tufts University, and Research Affiliate in the Department of Brain and Cognitive Sciences at MIT. He has written extensively on syntax, semantics, morphology, music cognition, social cognition, consciousness, and the architecture of the language faculty and its place in the mind. He is a recipient of the Jean Nicod Prize in Cognitive Philosophy and of the Rumelhart Prize in Theoretical Foundations of Human Cognition, and he has served as President of both the Linguistic Society of America and the Society for Philosophy and Psychology. Brett Kessler is Associate Professor Emeritus at Washington University in St. Louis, having taught in the Linguistics and the Philosophy- Neuroscience- Psychology programs. Currently he serves as Senior Research Scientist in Psychological and Brain Sciences, where he studies the acquisition of reading and writing systems, with emphasis on the implicit learning of statistical patterns. Oriana Kilbourn-Ceron is a postdoctoral fellow in the Department of Linguistics at Northwestern University. Her research focuses on phonetic and phonological variation, and how pronunciation is shaped by the process of speech planning. She has published quantitative analyses of spontaneous speech corpora, behavioral experiments on speech production, and theoretical linguistic analyses. Her 2017 dissertation was recognized with McGill’s Arts Insight Award for excellence in doctoral research in the social sciences. Judith F. Kroll is Distinguished Professor in the School of Education at the University of California, Irvine and former Director of the Center for Language Science at Pennsylvania State University. Her research uses the tools of cognitive neuroscience to examine bilingualism. She is a fellow of the American Association for the Advancement of Science, the American Psychological Association, the Association for Psychological Science, the Psychonomic Society, the Society of Experimental Psychologists, and the American Academy of Arts and Sciences. She was a founding editor of Bilingualism: Language and Cognition (Cambridge University Press), and one of the founding organizers of Women in Cognitive Science, a group developed to promote the advancement of women in the cognitive sciences. Barbara Landau is the Dick and Lydia Todd Professor of Cognitive Science at Johns Hopkins University (secondary appointments in Psychological and Brain Sciences and Neuroscience). Her research focuses on spatial representation, language, and the relationship between these two systems of knowledge in typically and atypically developing individuals. She is an elected member of the National Academy of Sciences, an elected fellow of the American Academy of Arts and Sciences, the American Association for the Advancement of Science, and the Cognitive Science Society. She was named a
contributors xix Guggenheim Fellow and received the William James Fellow Award from the Association for Psychological Science. Ernie Lepore is Board of Governors Professor of Philosophy at Rutgers University. He is the author of numerous books and papers in the philosophy of language, philosophical logic, metaphysics, and philosophy of mind, including Imagination and Convention (with Matthew Stone), Meaning, Mind and Matter: Philosophical Essays (with Barry Loewer), Liberating Content and Language Turned on Itself, Insensitive Semantics (both with Herman Cappelen), Donald Davidson: Meaning, Truth, Language and Reality, and Donald Davidson’s Truth-Theoretic Semantics (both with Kirk Ludwig), and Meaning and Argument. He has co-authored, with Jerry Fodor, Holism: A Shopper’s Guide and The Compositionality Papers; and with Sarah-Jane Leslie What Every Student Should Know. Casey Lew-Williams is Professor in the Department of Psychology at Princeton University. He works with postdocs, graduate students, and undergraduates in the Princeton Baby Lab to study how young children learn, and his research focuses in particular on language learning. He is a Chief Editor of Frontiers for Young Minds and a Co- founder of ManyBabies. Jeffrey Lidz is a Distinguished Scholar-Teacher and Professor of Linguistics at the University of Maryland. His research explores language acquisition from the perspective of comparative syntax and semantics, focusing on the relative contributions of experience, extralinguistic cognition, and domain-specific knowledge in learners’ discovery of linguistic structure. Lidz was the Co-editor of the Oxford Handbook of Developmental Linguistics (2016) and was the Editor in Chief of Language Acquisition from 2012 to 2020. He currently serves on the editorial boards of Syntax, Journal of Semantics, Semantics & Pragmatics, Frontiers in Language Science, and the Journal of South Asian Linguistics. James S. Magnuson is Professor in the Department of Psychological Sciences at the University of Connecticut and the Director of UConn’s interdisciplinary program in Science of Learning & Art of Communication. He also leads the Computational Neuroscience group at the Basque Center on Cognition, Brain, and Language in Donostia-San Sebastián, Spain, and is a Basque Foundation for Science (Bilbao, Spain) Ikerbasque Research Professor. His research focuses on theoretical, computational, and empirical aspects of spoken and written language. Kyle Mahowald is Assistant Professor in Linguistics at the University of Texas at Austin. His research interests are in computational psycholinguistics, the relationship between NLP and linguistics, and experimental methods in the behavioral sciences. Rachel I. Mayberry is Professor in the Department of Linguistics at the University of California San Diego and directs the Laboratory for Multimodal Language Development funded by NIDCS and NSF. She is also affiliated with the Centers for Research on Language, Research and Teaching in Anthropogeny, Human Development Science, and the Joint Doctoral Program in Language and Communication Disorders.
xx contributors Her primary research investigates the psycho-and neurolinguistic correlates of the critical period for language using sign language; other research investigates language, gesture, and reading development. Mayberry serves on the editorial board of the Journal of Child Language and received the Distinguished Alumni Award for Research Leadership from the SCSD of McGill University. Erica L. Middleton is Institute Scientist and Director of the Language and Learning Lab at Moss Rehabilitation Research Institute. Her research focuses on how words are mentally represented and produced, both in healthy speakers and in people with language impairment due to stroke (aphasia). A major emphasis in her work is to delineate how treatments for language impairment can be designed in accord with fundamental principles of human learning to maximize and sustain recovery. Dr. Middleton’s research is supported by the National Institutes of Health and the Einstein Society (Einstein Healthcare Network, Philadelphia, PA). Daniel Mirman is Senior Lecturer in Psychology of Language at the University of Edinburgh. His research focuses on the neuroanatomy of spoken language processing, language deficits in post-stroke aphasia, and the organization of semantic knowledge. Mirman currently serves on the editorial boards of Cognitive Science and PLOS ONE. Nazbanou Nozari is Associate Professor in the Department of Psychology at Carnegie Mellon University. Her research focuses on the cognitive and neural basis of language production and monitoring. She is currently Associate Editor for Psychonomic Bulletin & Review, and the 2020 winner of the American Psychological Association’s APA Distinguished Scientific Awards for an Early Career Contribution to Psychology in Cognition and Human Learning. Anna Papafragou is Professor in the Department of Linguistics at the University of Pennsylvania and the Director of the University’s interdisciplinary graduate program in Language and Communication Sciences. Her research focuses on how children acquire meaning in language, how language is used and understood, and how language interfaces with human perception and cognition. Papafragou is currently Associate Editor of Language Learning and Development and serves on the editorial board of Language Acquisition, Semantics and Pragmatics, and the Yearbook of Linguistic Variation. She is an elected fellow of the Association for Psychological Science and serves on the Governing Board of the Cognitive Science Society. Maria Mercedes Piñango is Associate Professor in the Department of Linguistics and the Interdepartmental Neuroscience Program at Yale University. She directs Yale’s Language and Brain Lab. Her research focuses on the structure of linguistic meaning and its neurocognitive embedding through the study of real-time processing, intra and cross-dialectal variation, brain functional organization, and meaning change dynamics. Piñango currently serves as Associate Editor for Language and Computation, a component of Frontiers in Artificial Intelligence.
contributors xxi David Poeppel is Professor in the Departments of Psychology and Neural Science at NYU and the Director of the Neuroscience Department at the Max Planck Institute for Empirical Aesthetics in Frankfurt, Germany. He is Co-director of the Max-Planck- NYU Center for Language, Music, and Emotion. His research focuses on the brain basis of hearing, speech, and language. He is an elected fellow of the American Association for the Advancement of Science. Christine E. Potter is Assistant Professor in the Department of Psychology at The University of Texas at El Paso. Her research explores the role of experience in language learning, with an emphasis on learning across different communities and environments. Ruaridh Purse is a PhD candidate in the Department of Linguistics at the University of Pennsylvania. His research explores the structural and representational properties of sociolinguistic variables with a focus on the interface between phonetics and phonology. In particular, his work involves using instrumental techniques from articulatory phonetics to investigate the details of phonetic implementation. Jonathan Rawski is Assistant Professor of Linguistics at San José State University. His research concerns the mathematics of language and learning, applied to both human cognition and artificial intelligence systems. He received his PhD from Stony Brook University, where he was also a member of the Institute for Advanced Computational Science, and the Center for Neural Circuit Dynamics. Jennifer M. Rodd is Professor of Cognitive Psychology at University College London (UK). She has worked at UCL since 2003. Her research aims to characterize the cognitive and neural processes that support human communication—specifically, to understand the complex computations involved in decoding the meanings of spoken and written words. Her lab has primarily been funded by the UK Economic and Social Research Council with additional funding from the Leverhulme Trust and the Experimental Psychology Society. Prior to joining UCL, she was a research fellow at the University of Cambridge (UK). Florian Schwarz is Associate Professor and Undergraduate Chair of Linguistics at the University of Pennsylvania, where he also serves as the Associate Director for Education of MindCORE, Penn’s hub for the integrative study of the mind. His research is concerned with the study of meaning, combining formal tools from semantics and pragmatics in theoretical linguistics with experimental methods from psycholinguistics to study phenomena such as presuppositions, implicatures, and definite descriptions. He was co-Editor of the Pragmatics and Semantics section of the Language and Linguistics Compass from 2017–2020, and is an Associate Editor of the open access journal Glossa Psycholinguistics founded in 2021. He serves on the editorial boards of Semantics and Pragmatics as well as Natural Language Semantics. Together with Anna Papafragou, he initiated the interdisciplinary conference Experiments in Linguistic Meaning. With Jérémy Zehr, he maintains PCIbex, a free and open access online experiment platform.
xxii contributors Gala Stojnic is a developmental cognitive researcher, interested in how young children and infants represent the surrounding world in the course of their development. Her research focuses on the nature and development of core cognitive systems that allow for efficient learning about agents, objects, spaces, and other domains of knowledge. Gala is currently a postdoctoral research scientist at New York University. She earned her MS and PhD from Rutgers University, where she studied the intersection between social cognition and word learning. Yue Sun is a researcher in the Neuroscience Department of the Max Planck Institute for Empirical Aesthetics in Frankfurt. He obtained his PhD in cognitive neuroscience at the University of Paris VI. His current research focuses on linguistic regularities in the lexicon and speech of various languages and how these regularities interface with computational properties of speech operations as well as language learning. His work combines quantitative analyses of linguistic corpora, experimental psychology, and neuroimaging. Daniel Swingley is Professor and presently Chair of the Department of Psychology at the University of Pennsylvania. His research concerns the beginning of language learning in infancy, particularly in the domains of sound learning and word learning. Swingley is Associate Editor of the journal Language Learning and Development, and Treasurer of the Society for Language Development. He serves on the editorial boards of several journals and is a winner of the Lindback Award, Penn’s highest award for teaching. Meredith Tamminga is Associate Professor in the Department of Linguistics at the University of Pennsylvania. Her research investigates the psycholinguistic mechanisms involved in sociolinguistic production and perception across different levels of grammar as well as the role of such mechanisms in language change. She directs the Language Variation and Cognition Lab at Penn and is currently the Associate Director of Outreach for MindCORE, Penn’s hub for the integrative study of the mind. Rebecca Treiman is the Burke and Elizabeth High Baker Professor in the Department of Psychological and Brain Sciences at Washington University in St. Louis. She conducts research on language processing and language development, with a special focus on reading and spelling and their acquisition. She is the past Editor in Chief of the Journal of Memory and Language, an elected fellow of the Association for Psychological Science, and a winner of the Distinguished Scientific Contributions Award from the Society for the Scientific Study of Reading. John C. Trueswell is Professor in the Department of Psychology at the University of Pennsylvania and the Co-director of the University’s MindCORE initiative in Integrative Language Science and Technology. His research focuses on understanding how children develop the ability to process language in real-time, and how this ability interacts with the acquisition of language. Trueswell is currently Associate Editor of Language Learning and Development and serves on the editorial board of Cognition. He is an
contributors xxiii elected fellow of the Association for Psychological Science, the American Association for the Advancement of Science, and the Cognitive Science Society. Yosiane White is a PhD candidate in the Department of Linguistics at the University of Pennsylvania. Her research investigates the mental representation of morphological and phonological variables. She uses experimental auditory priming paradigms to tease apart aspects of shared representation of multi-morphemic words containing variable suffixes. Beatrijs Wille is a postdoctoral researcher in linguistics at Ghent University (Belgium). She has a background in speech, language, and hearing sciences, and in linguistics. During her interdisciplinary doctorate, she conducted foundational work on Flemish Sign Language (VGT) acquisition by focusing on the early development of VGT, and the visual communication of parents with deaf children. Later she joined the Mayberry Laboratory for Multimodal Language Development, and worked as a visiting researcher at other highly regarded research centers such as the VL2 lab at Gallaudet University (USA) and the Language and Communication across Modalities Laboratory (Italy). Her postdoctoral research focuses on early literacy of deaf children in Flanders. Charles Yang teaches linguistics and directs the Program in Cognitive Science at the University of Pennsylvania. His honors include a Guggenheim fellowship and the Leonard Bloomfield Award from the Linguistic Society of America for his most recent book The Price of Linguistic Productivity: How Children Learn to Break the Rules of Language (MIT Press 2016). Jérémy Zehr is an application developer for the Linguistic Data Consortium at the University of Pennsylvania. His research investigates how the various contexts in which people use words and sentences influence their short-and long-term understanding of those words and sentences. Zehr is also the designer of PCIbex, a platform designed to develop and run online experiments. Megan Zirnstein is Assistant Professor in the Department of Linguistics and Cognitive Science at Pomona College. Her research focuses on the interactions between cognitive control and reading comprehension ability across the lifespan and across monolingual and bilingual contexts. Her interdisciplinary work utilizes behavioral, psychophysiological, and neurophysiological techniques to understand and develop theoretical explanations for the impact that bilingual experience and second language learning have on cognition. Zirnstein was previously a National Science Foundation PIRE and SBE postdoctoral research fellow at the Pennsylvania State University, and research scientist at the University of California, Riverside.
Chapter 1
Introdu c t i on Anna Papafragou, John C. Trueswell, and Lila R. Gleitman
And Adam gave names to all cattle, and to the fowl of the air, and to every beast of the field. Genesis 2:20 What’s in a name? That which we call a rose by any other name would smell as sweet. W. Shakespeare, Romeo and Juliet Don’t gobblefunk around with words. R. Dahl, The BFG
Throughout history, human beings have been fascinated, enlightened, misled, and moved by words. From parents marveling at their child’s first sounds, to writers, politicians, and anyone else who has had to engage in a conversation, words are a central part of the human experience. The word, properly understood, lies at the heart of the human social and cognitive mentality. Furthermore, the capacity to invent new words (graphically honored in William Blake’s depiction of Adam naming the animals on the cover of this volume) is one of the essential evolutionary gifts to our species. This Handbook is a comprehensive survey of the state of the art on the “mental lexicon,” the representation of language in the mind/brain at the level of individual words and sub- word meaningful units (morphemes). Despite the centrality of words to our mental life, in the past—even the relatively recent past—the topic of this Handbook would have been considered too particularistic to be of much interest outside the traditional realms of lexicography and etymology. A long line of scholars beginning at least with Saussure (1916) have considered the lexicon to be a repository of exceptions and idiosyncrasies, very much unlike the rule- governed nature of the grammar. In recent years, however, the study of words as mental objects has grown rapidly across several fields, including linguistics, psychology, philosophy, neuroscience, education, and computational cognitive science. As a result, the
2 Anna Papafragou, John C. Trueswell, and Lila R. Gleitman mental lexicon is seen in most quarters today as far more structured and integrated with the mental grammar. Relatedly, it is now recognized that many features of language development and psycholinguistic processing have to be couched at the lexical level as well. Despite many present-day theoretical differences within and across individual disciplines, then, by general consensus the lexicon itself has become the nexus of much significant empirical research and theorizing within modern cognitive science. The goal of this Handbook was to connect and synthesize these recent theoretical and empirical advances. The key issues surrounding the mental lexicon can be articulated in terms of three framing questions: What exactly do you know when you know the lexicon in your native language? How did you come to know it? And how do you put that knowledge to use? Viewed in this way, the mental lexicon epitomizes the very issues that have been posed for the study of human language as a whole (see Chomsky, 1986). The present Handbook is divided into three parts corresponding to the three framing questions above about the representation, acquisition, and processing of lexical knowledge. Each part follows a similar organizational structure: it begins with chapters that discuss issues of form (mostly, phonology); it then moves on to contributions pertaining to meaning (drawn from the domains of morphology, syntax, and semantics); and it concludes by chapters addressing the interface of the lexicon with other linguistic or non-linguistic (cognitive or communicative) systems in ways that typically straddle disciplinary boundaries. Chapters are written by leading authors from a variety of backgrounds who present their own perspective on an aspect of the lexicon within an up-to-date summary of the state of the art. Part I (‘Representing the Mental Lexicon’) introduces modern linguistic and cognitive theories of how the mind/brain represents words and sub-word units at the phonological, morphological, syntactic, and semantic levels. This part also discusses broad topics at the interface of the lexicon and other linguistic and non-linguistic systems, such as the contribution of pragmatics to lexical interpretation, the role of compositionality, and the relation between words and concepts. Individual chapters comment on highly interrelated questions, including: How are speech and word forms encoded in the mind/brain? How do we map continuously varying acoustic input to discrete mental representations that form the basis for stored words? How abstract are phonological representations of words in the lexicon? How should we capture the fact that words are produced differently by different individuals and by a single individual on different occasions? Do words break down into smaller meaningful units? What is the best way of accounting for the relation between the lexicon and syntax? How does the mind organize our vast knowledge of word meanings, and how does such knowledge relate to the human conceptualization of the world? How is logical vocabulary represented? What is the appropriate division of labor between aspects of meaning that are semantically encoded in words vs. inferred through pragmatic processes in context? What constraints does the process of combining word meanings to form phrasal or sentential meanings place on theories of lexical semantics? What do lexical universalities and diversities say about informational communicative constraints, and generally about human cognition? By adopting distinct but often complementary perspectives, this first
Introduction 3 set of chapters outlines core generalizations about the contents of the (adult) mental lexicon across linguistic domains and illustrates the rich connections between the mental lexicon and other linguistic and extra-linguistic systems within mental architecture. This series of chapters also sets the stage for the next sections that focus on how the contents of the mental lexicon are learned and accessed during language processing. Part II (‘Acquiring the Mental Lexicon’) turns to the process through which children learn the phonological, morphological, syntactic, and semantic properties of words in their native language. This part also discusses contributions of pragmatics to word learning and interpretation as well as the variability observed in lexical learning. Individual chapters address questions such as: How do children acquire the phonological shape of words in their native language, and how do they handle the variability in how words sound across speakers and contexts? How do children begin to learn the internal structure of words and the rules that govern such structure? What is the role of syntax in word learning? What sources of evidence contribute to the acquisition of word meanings through infancy into adulthood? What meanings do child language learners assign to logical expressions, and how do these meanings correspond to classical logic? How do considerations of speaker intention affect how words are learned and used? And what accounts for the fact that vocabulary learning differs both across individual children and across groups of learners? Together, these chapters show how our understanding of the nature and contents of the mental lexicon in adults inform theories of lexical development in children (and, inversely, how developmental data might provide constraints on linguistic and cognitive theories of the mature-state lexicon). The questions at the core of these chapters bear directly on the prior knowledge that the learner does (and does not) bring to the task of language acquisition, and on the relative contribution of learner-driven and environment-driven factors to different aspects of lexical learning. Despite their differences in topics and angles, the authors in this collection of chapters aim to understand how building a mental lexicon requires the coordination of prior knowledge, information processing capacity, and extra-linguistic cognitive and communicative resources; relies on a proper access to and understanding of information available in the linguistic input; and is modulated by the child’s computational and cognitive capacities and limitations at different stages of development. Finally, Part III (‘Accessing the Mental Lexicon’) examines how the mental lexicon contributes to language use during listening, speaking, and conversation, and includes perspectives from bilingualism, sign languages, and disorders of lexical access and production. The chapters in this part ask foundational questions such as: How are spoken words recognized from the speech signal? How are written words recognized from their orthographic form? What are the processes that are involved in reading, and how do children learn to read and spell? How are words accessed in the mind/brain as people prepare to speak? Does lexical access change when speakers plan and produce multiple words in connected speech? How does stored lexical meaning interact with context as people understand language during conversation? How does bilingualism alter the mental lexicon? Does lexical access and use differ in people learning sign languages? How should we characterize difficulties of lexical access in aphasia and other neurogenic and developmental disorders, as well as in typical aging? Together these chapters
4 Anna Papafragou, John C. Trueswell, and Lila R. Gleitman emphasize that mobilizing the mental lexicon during listening and speaking is best understood as a dynamic interaction of information from multiple levels of representation that engages stored linguistic knowledge, information processing capacity, and extra-linguistic cognitive resources. These themes, present throughout the volume, take on special significance here because psycholinguistic studies combine evidence from a broad array of tools, methods and populations to investigate the link between the rapidly unfolding processes of lexical comprehension or production and the underlying mechanisms that guide these processes. As a result, such studies provide particularly promising sources of evidence in adjudicating among competing theories of lexical organization. As is clear from this brief summary, far from being separate, the three framing questions that organize this Handbook are highly intertwined and the corresponding research programs that have been designed to address lexical representation, acquisition, and access can mutually inform and constrain one another. Within and across the three parts of this volume, for instance, specific proposals about lexical knowledge in mature (adult) minds often rely on evidence from vocabulary acquisition or processing and, inversely, theories of lexical acquisition and access are informed by linguistic and cognitive theories of the contents of the lexicon. Furthermore, even though individual chapters were designed to offer a stand-alone overview of the empirical and theoretical state of the art within a particular domain, throughout the entire book similar issues are addressed in considerable depth across multiple chapters. Recurring themes include, among others, the nature of abstractness in representations of lexical form and meaning (especially in the face of rampant surface variability); the role of experience in lexical representation, acquisition, and processing; the boundary between shared and specific properties of the lexicons in languages of the world (both spoken and sign); the relation between stored lexical knowledge and computations performed over such knowledge; the tension between productivity and arbitrariness; the relation between the lexicon and other levels of representation within the language faculty as well as to non-linguistic cognitive and communicative mechanisms; and the nature and extent of individual and population-level differences in lexical knowledge, acquisition, and processing. The present tri-partite organization of each part of the Handbook in terms of form, meaning, and interfaces or boundaries highlights these common themes and makes it easy to track them throughout the book. The present collection is unique in offering both a comprehensive overview of the set of phenomena that the study of the mental lexicon is responsible for, and a definitive compendium of the kinds of theories at the representational, learning, and processing level that can adequately describe and explain these phenomena. The picture of the lexicon that emerges from these pages reveals a rapidly changing and dynamic field that spans multiple topics, theories, and methods. It covers a variety of different languages and populations of language users, and moves along multiple levels of analysis from the neural to the social underpinnings of words, and multiple timescales ranging from years (as vocabulary grows from infancy to adulthood) to milliseconds (as lexical information is processed on a moment-to-moment basis). Furthermore, it is clear that the study of
Introduction 5 the mental lexicon is defined by complex questions about how the lexicon is psychologically represented that lie at the core of the modern cognitive science of language but defy simplifying assumptions or neat disciplinary labels. To navigate this rich intellectual territory, the present volume brings together theoretical and empirical traditions that until recently were independent of one another. This strongly interdisciplinary approach highlights cutting-edge theoretical and methodological advances but, just as importantly, also identifies areas of disagreement, debate, and inconsistencies or gaps across the field that create opportunities for scientific progress. It is our hope that this book will inspire the current and next generation of researchers in ways that lead to new empirical discoveries and theoretical breakthroughs for the study of words in the mind. ***
Postscript As this volume was going to press, Lila Gleitman passed away at the age of 91. The loss of our close friend and collaborator was profound. Lila was an inspiration to us and a guiding light to many, both personally and intellectually. A giant in our field, Lila was legendary for her passion for words (and ideas!)—but also for people and friendships. She worked on this book tirelessly through all its many stages, even choosing a beloved painting for the cover. We will miss her beyond words. Anna Papafragou and John C. Trueswell
Lila Gleitman (1929–2021)
PA RT I
R E P R E SE N T I N G T H E M E N TA L L E X IC ON
Pa rt IA
F OR M
Chapter 2
Phonol o g i c a l Abstraction i n T h e M ental Le x i c on Eric Baković, Jeffrey Heinz, and Jonathan Rawski
2.1 Talking and listening When we try to understand how we speak and how we listen, an unavoidable fact is that we use the same vocabulary items again and again. So, not only are these items stored in memory but also they are learned on a language-specific basis. In this way, we directly confront the issue of mental representations of speech, that is, something about how we store, retrieve, and remember units of speech. What are these units? What are these mental representations? These questions are central to generative phonology, a fact reflected in the title of a collection of important papers by Morris Halle, a founder of the field: From Memory to Speech and Back Again (Halle, 2003). In this chapter, we review some of the most basic—and most fascinating—conclusions and open questions in phonology regarding how abstract these mental representations of speech are. In an era of big data and data mining, the prevailing attitude in science and scientific observation seems to be to store every detail, never to assume any detail is irrelevant no matter how small because you never know when it may in fact make a difference somewhere. This attitude is also present in the psychology of language, where it has been shown that people are sensitive to astounding details of aspects of speech reflecting people’s age, gender, social status, neighborhood, and health (Coleman, 2002; Pierrehumbert, 2002; Johnson, 2006). Pierrehumbert (2016) reviews this literature and highlights some theoretical accounts (see also Purse, Tamminga, and White, this volume).
12 Eric Baković, Jeffrey Heinz, and Jonathan Rawski In the context of this prevailing attitude, the theory of generative phonology makes some surprising claims. First, it claims it is necessary to abstract away from much of this detail to explain systematic aspects of the pronunciations of the morphemes, words, and phrases we utter. In other words, while a person is sensitive to subtle phonetic details that are informative about various aspects of a speaker’s condition, the theory of generative phonology claims those very same details are irrelevant to the very same person’s mental representations of the pronunciations of morphemes and words. As such, the mental representations of these pronunciations are particular abstractions. We review the arguments for this position in Section 2.3. Section 2.4 shows how these ideas lead to the phoneme and to distinctive features, a necessarily abstract categories that are said to be the fundamental units out of which the pronunciation of vocabulary items is built. We then turn to a question posed by Paul Kiparsky: how abstract is phonology? (Kiparsky, 1968)—or, more specifically, how abstract are the mental representations of speech? A vigorous back and forth debate over this question followed, but now, over fifty years later, there is still no consensus regarding the answer. A basic problem is that some languages present good evidence that morphemes are specified with phonological content that is never realized as such in any surface manifestation of those morphemes. At issue is precisely what kinds of evidence justifies such abstract phonological content. In our minds, this question, and others related to it, are particularly important and remain among the most interesting and understudied questions in phonology today. But first, we begin this chapter with some thoughts on the nature of abstraction and idealization.
2.2 What is abstraction? The first thing to point out about abstraction is not how odd it is but how common it is. Orthographic letters are abstractions. The capital letter “A” and the lowercase letter “a” are quite different and yet at some level of abstraction they are referring to the same thing. Money is an abstraction. Whole numbers and fractions are also abstractions. For example, there is no such thing as “three.” There are only examples of collections of three items. Such abstractions are not only taken for granted but they are also valued: the earlier that toddlers and young children learn such abstractions the more we marvel at their intelligence. A very common approach to problem solving one learns in grade school is shown in the diagram in Figure 2.1. For example, in an elementary-level math class, students are often given a problem expressed in plain language. Their job is to extract the relevant information, organize it into an equation, solve the equation, and report the answer. Figure 2.2 shows a fourth grade math exercise.
Phonological Abstraction in The Mental Lexicon 13 Solving
Problem
Solution
Realization
Abstraction
Complicated messy system
Figure 2.1 Abstraction, problem-solving, and realization. 24 + 7 − 2
Solving
Abstraction
Question: A farmer has two dozen pigs. Seven more pigs are born, and two pass away. How many pigs are there now?
= 29
Realization
Answer: 29 pigs.
Figure 2.2 A fourth grade math exercise illustrating the process in Figure 2.1.
Similarly, when responding to challenges that his law of falling bodies did not correspond to the real world, Galileo used an analogy to accounting: what happens in the concrete [ . . . ] happens in the abstract. It would be novel indeed if computations and ratios made in abstract numbers should not thereafter correspond to concrete gold and silver coins and merchandise [ . . . ] (quoted in Wootton, 2015, pp. 23–24)
Here Galileo mentions an important feature: one can make computations with abstract numbers without directly referencing the particular objects that implement them. Given a proper abstraction, one can make an inference or calculation at the level of numbers with results that then correspond to some specific physical effect or process. This property, that a given abstraction can have many implementations, is known as multiple realizability (Bickle, 2020). An abstract object is multiply-realizable by a number of concrete objects. The concrete objects might differ from each other in various ways, but the ways in which they differ are irrelevant to the abstraction. Orthographic letters and the mental representations of units of speech are abstract in this way, too. The second important point about abstraction is that there are degrees of abstraction. Even the question in Figure 2.2 as written in plain language is abstract, since, for example, it uses numbers to explain the situation and not, for instance, a photograph of
14 Eric Baković, Jeffrey Heinz, and Jonathan Rawski THE ABSTRACT-O-METER
Too realistic
Just right
Too abstract
Figure 2.3 Degrees of abstraction (from https://computersciencewiki.org/index.php/Abstraction).
the two dozen pigs and/or video of seven pigs being born and two others dying. Figure 2.3 illustrates the concept of degrees of abstraction. The only change we would make to it would be to add question marks after each of the subcaptions in the figure. The question is always “Is the level of abstraction too realistic, too abstract, or just right?” Answering this question is not easy. It is not at all obvious what the right level of abstraction is. It is appropriate to deliberately consider whether a particular degree of abstraction is appropriate or not. We are in an age of big data and the milieu of the age seems to be to err on the side of “too realistic” and to avoid analyses that are “too abstract.” Our own view is that this is shortsighted. Many things were at one time considered “too abstract” but are now recognized as being perfectly reasonable and in fact essential and useful. When one studies the history of mathematics, it is interesting how notions once considered to be “too abstract” were considered crazy or useless. This includes things like the number zero, real numbers, −1, uncountable infinity, number theory, and so on. When Pythagoras and his group realized that 2 could not be expressed as a fraction of whole numbers, they called it “irrational,” literally “unreasonable.” The term sticks today. Cantor went mad after realizing that infinitely-sized sets had different degrees of cardinality; today this fact underlies the Church-Turing thesis of computability. The development of the complex numbers caused much consternation in the 17th and 18th centuries but are routinely used today to understand complex physical systems. Developments in number theory in the early part of the 20th century had no purpose other than to satisfy some strange mathematical aesthetic, and these now underlie secure cryptographic communications, protecting business operations, journalists, and other vital communications. In short, we are sympathetic to the view put forth by Cheng (2015, p. 22): “Abstraction can appear to take you further and further away from reality, but really you’re getting closer and closer to the heart of the matter.”1 In this chapter we show that abstractness is a property exploited both by speakers of natural languages and by scientists describing the linguistic knowledge of those speakers. 1
One of the clearest recent expositions on abstractness and its virtues is in Chapter 2 of Cheng’s book, How to Bake π, which we encourage readers of the present chapter to read.
Phonological Abstraction in The Mental Lexicon 15
2.3 Phonology and the mental lexicon The pronunciation of a word is a sequence of events in time. From the perspective of speech production, these events are defined articulatorily. From the perspective of speech perception, these events can be defined acoustically and perceptually. The International Phonetic Alphabet (IPA) defines categories of speech sounds articulatorily and provides a symbol for each categorized speech sound. For consonants, these symbols specify three aspects of articulation: the place of articulation, the manner of articulation, and the activity of the vocal folds. Symbols representing vowels primarily specify the degree of jaw aperture, how forward the positioning of the root of the tongue is, and any rounding of the lips. For example, the pronunciation of the word ‘math’ is transcribed in the IPA as [mæθ], as there are three distinct speech sounds in sequence: [m], which is articulated by stopping airflow at the lips but releasing it through the nose, with the vocal folds held together such that they vibrate; [æ], which is an open front unrounded vowel; and [θ], which is articulated by constricting but not stopping airflow with the blade of the tongue between the teeth with the vocal folds spread apart. For most speech acts, these articulations are similar across speakers of the same speech variety, despite individual physiological variation. For a word like ‘tune,’ one transcription of it using the IPA is [tun], often referred to as a broad transcription. In contrast, a narrow transcription of this word using the IPA is [thũːn]. The difference between the broad and narrow transcriptions is the degree of abstraction. Both transcriptions reveal systematic aspects of standard American English speech. However, the broad transcription only includes so-called constrastive information and the narrow transcription includes some non-contrastive information as well, indicated in this case with various diacritic marks: the aspiration on the [t], indicated with [h], and the nasalization [~] and extra duration [ː] of the vowel [u]. Both kinds of transcriptions can be used as abstract representations of the pronunciation of this word stored in memory, and one could ask whether the long-term memory representation of the pronunciation of the word ‘tune’ is more like the broad transcription, the narrow transcription, or something else. One can ask if there is only one long-term memory representation of the pronunciation of the word ‘tune,’ or if there are multiple, possibly partially redundant representations. All of these hypotheses are open to study. In the remainder of this chapter, we use broad IPA transcriptions as we discuss representations of the pronunciations of words. The degree of abstractness chosen is not critical, but we settle here to facilitate additional discussion. With this in mind, we ask the question: what does modern generative phonological theory say about the mental representations of speech? The central empirical fact that informs this question is that, in many languages of the world, morphemes are pronounced in different ways depending on context. To illustrate with an example, Odden (2014) draws attention to the pattern exhibited by the different verb forms in Kerewe shown in Table 2.1. There is an interesting difference between the first group of verb forms and the second group of verb forms. In the first group, for example, the pronunciation for the verb
16 Eric Baković, Jeffrey Heinz, and Jonathan Rawski Table 2.1 Kerewe verbs, from Odden (2014, pp. 88–89) Infinitive
1sg habitual
3sg habitual
Imperative
gloss
[kupaamba]
[mpaamba]
[apaamba]
[paamba]
‘adorn’
[kupaaŋɡa]
[mpaaŋɡa]
[apaaŋɡa]
[paaŋɡa]
‘line up’
[kupima]
[mpima]
[apima]
[pima]
‘measure’
[kupuupa]
[mpuupa]
[apuupa]
[puupa]
‘be light’
[kupekeʧa]
[mpekeʧa]
[apekeʧa]
[pekeʧa]
‘make fire w/stick’
[kupiinda]
[mpiinda]
[apiinda]
[piinda]
‘be bent’
[kuhiiɡa]
[mpiiɡa]
[ahiiɡa]
[hiiɡa]
‘hunt’
[kuheeka]
[mpeeka]
[aheeka]
[heeka]
‘carry’
[kuhaaŋɡa]
[mpaaŋɡa]
[ahaaŋɡa]
[haaŋɡa]
‘create’
[kuheeba]
[mpeeba]
[aheeba]
[heeba]
‘guide’
[kuhiima]
[mpiima]
[ahiima]
[hiima]
‘gasp’
[kuhuuha]
[mpuuha]
[ahuuha]
[huuha]
‘breathe into’
stem ‘adorn’ is consistently [paamba] regardless of whether it is in the infinitival form (prefixed with [ku]), the 1sg habitual form (prefixed with [m]), the 3sg habitual form (prefixed with [a]), or the imperative form (not prefixed). It thus makes sense to assume that /paamba/represents in a Kerewe speaker’s mental lexicon the major features of the pronunciation of the verb stem ‘adorn.’2 However, when the same kind of morphological analysis is applied to the forms of the verb ‘hunt’ in the second group of verb forms, we find that the verb stem’s pronunciation in the 1sg habitual form is [piiɡa] whereas in the other forms it is [hiiɡa]. So the question naturally arises, what long-term memory representation of the pronunciation of the verb stem ‘hunt’ do speakers of Kerewe have? One possibility is that both /piiɡa/ and /hiiɡa/are stored, along with the knowledge that [piiɡa] is used in the 1sg habitual form and that [hiiɡa] is used otherwise. This is fine as far as it goes, but it is of interest that the other verbs in this second group pattern exactly the same way: they begin with [p]in the 1sg habitual form, and with [h] otherwise. Furthermore, there are no verb stems in Kerewe that begin with [h] in the 1sg habitual form. Taken together, these observations suggest there is something systematic about the variation in the pronunciation of the various forms of this group of verbs in Kerewe, a systematicity that storage of all pronunciations plus information about their distributions does not readily or insightfully capture.3 2
Here and elsewhere in this chapter we follow traditional phonological notation of transcribing the mental representation of speech between slashes and the actual pronunciation within square brackets. When the distinction is immaterial, we use italics. 3 More—and more detailed—arguments against this “morpheme alternant theory” can be found in Kenstowicz and Kisseberth (1979, pp. 180–196).
Phonological Abstraction in The Mental Lexicon 17 The methods of modern generative phonology lead analysts to posit that the long- term memory representation of the pronunciation of the verb stem ‘hunt’ that speakers of Kerewe have is /hiiɡa/, and that h representations are transformed to p representations when they immediately follow m representations. Consequently, the 1sg habitual form of ‘hunt’ is [mpiiɡa] because it derives from /m+hiiɡa/by application of this phonological transformation.4 This phonological analysis explains the systematic variation observed because it predicts that every /h/-initial verb stem ought to be realized as [p]-initial when it follows any prefix ending with m, such as the 1sg habitual. There are two key reasons why the mental representation of ‘hunt’ cannot instead be /piiɡa/, with p transformed to h when it immediately follows not-m. First, it is assumed that phonological transformations cannot make direct reference to the negation of a class of speech sounds, in this case to ‘not-m.’ Second, and more significantly, it has been independently determined that the members of the first group of verb stems in Table 2.1 all begin with /p/, and yet the putative transformation of p to h immediately following not-m does not apply to those verb forms. Positing that members of the first group of verb stems begin with /p/and that members of the second group begin with /h/ (transformed to p immediately following m) succinctly and systematically distinguishes the two groups. One argument against this position is that there are cases where it does seem like two distinct pronunciations of a morpheme are stored. A well-known example is the nominative suffix in Korean, which has two pronunciations: [ka], which is suffixed onto vowel-final words, and [i], which is suffixed onto consonant-final words. This alternation is phonologically conditioned, but it is nevertheless a stretch of a phonologist’s imagination to concoct a rule that transforms /ka/to [i], or /i/to [ka], or some other machination that relates these two forms of the nominative suffix in some phonological way. Furthermore, there are no other places in the Korean lexicon or language that exemplify any k ~ ∅ or a ~ i alternation, where ‘Ø’ denotes the absence of a phone. In other words, a single posited underlying form and rules (whatever they are) would only account for the alternation between [ka] and [i] in the nominative suffix. So, in Korean, the best analysis of the nominative suffix alternation appears to be a long-term memory representation of the pronunciation of the suffix as {/i/, /ka/} with the choice of which one to select being based on the phonological properties of the stem the suffix attaches to. Given that such examples exist in the world’s languages, the question is: why don’t we just use the same kind of analysis for Kerewe, for example? The answer is the one given above. The systematicity observed in the realization of Kerewe verb stems points to the straightforward phonological analysis in that case, and the lack of systematicity observed in the realization of the Korean nominative suffix points to the alternant selection analysis in that case. For more discussion of phonologically conditioned allomorphy, see Nevins (2011).
4
The “+” indicates the presumed morphological boundary between the 1sg habitual prefix /m/and the verb stem /hiiɡa/.
18 Eric Baković, Jeffrey Heinz, and Jonathan Rawski The argument just made is the basic one for the position that morphemes are stored in long-term memory with a single representation, known as the underlying form. The fact that the same morpheme can be pronounced in different ways depending on context is due to phonological transformations of this underlying form into its various surface forms. The fact that these transformations are systematically applied explains the systematicity in the alternations of the morphemes across the vocabulary of the language. The phonological analysis in the generative tradition thus comes with two symbiotic parts. The first posits, where systematicity and its explanation demand it, a single mental representation in long-term memory of the pronunciation of a lexical item—that item’s underlying form. The second is the transformational part of a phonological grammar that defines how underlying forms are mapped to surface forms, which are more concrete representations of the pronunciation of the lexical item in the particular context in which it is realized. For example, in the case of Kerewe, the underlying form for ‘hunt’ is /hiiɡa/; the underlying form for the 1sg habitual form of ‘hunt’ is /m+hiiɡa/. There is a phonological transformation changing h to p immediately following m, and so the surface form in this case is [mpiiɡa]. Once underlying forms and transformations making them distinct from some of their corresponding surface forms are posited, a natural question arises: how distinct can underlying forms be from surface forms? This is the question of abstractness in phonology. We approach this question first from the perspective of two interrelated, fundamental concepts in phonological theory, phonemes and distinctive features, in Section 2.4, setting the stage for discussion of more specific examples of evidence for different types of abstractness in analyses of phonological patterns in Section 2.5.
2.4 Phonemes and features Much of the question of abstractness in phonology concerns the individual units of speech whose concatenation makes up the pronunciation of lexical items. For example, the speech sounds identically transcribed as k in the Kerewe verb stems heeka ‘carry’ and pekeʧa ‘make fire with stick’ are the same at some level of abstraction; they are just being utilized in different contexts in different lexical items. Precise aspects of the articulation or these two instances of k may well exhibit systematic differences, parallel to the way that the (narrowly transcribed) [th] in [thɪk] ‘tick,’ the [t]in [stɪk] ‘stick,’ and the [ɾ] in [ˈæɾɪk] ‘attic’ differ systematically in English but are also the same, and at the same level of abstraction as the k in Kerewe. The idea that people’s mental lexicons make use of a mental alphabet of abstract speech sounds has a very long tradition in linguistics, dating back at least to Pāṇini’s grammar of Sanskrit, likely written in the 6th–5th century bce.5 Details of individual 5 A
much more comprehensive historical context can be gained from Anderson (1985); see in particular pp. 270–276.
Phonological Abstraction in The Mental Lexicon 19 theories aside, the speech sounds of this abstract mental alphabet are referred to as phonemes. Phonemes are obviously abstractions: they represent groups or categories of speech sounds, abstracting away from many phonetic particulars, such as the systematic and relatively easily perceptible differences between [th], [t], and [ɾ] in the English example above. In this section, we trace the evidence for the psychological reality of phonemes and highlight two prominent views of phonemes: one as an indivisible alphabetic symbol, and the other as a local confluence of more basic units or properties known as distinctive features. Readers interested in a more comprehensive review and discussion of the phoneme are directed to Dresher (2011), who distinguishes three views of the phoneme as an entity: the phoneme as physical reality, the phoneme as psychological concept, and the phoneme as theoretical fiction. Dresher (2011, p. 245) concludes, “Once we abandon empiricist assumptions about science and psychology, there is no obstacle to considering the phoneme to be a psychological entity.” This is an important framing, because it means that any linguistic work describing the nature and content of the phoneme is by necessity a statement about mental representation, in much the same way as the concept of underlying forms discussed in Section 2.3. So the question becomes, how much abstraction or idealization is involved in such a psychological entity? Edward Sapir was the first to present evidence for the psychological reality of phonemes. What is perhaps Sapir’s better-known article, “The Psychological Reality of Phonemes” (Sapir, 1933), was preceded by his article in the first issue of the journal Language, “Sound Patterns of Language” (Sapir, 1925). Together, these two articles establish the perspective that (1) the phoneme is a psychological unit of speech, and (2) the character of the phoneme requires taking into account how it functions in the larger phonological context of the language; it cannot be understood solely from investigation of the articulatory or acoustic properties of its surface realizations. Sapir (1925) argued that languages with the same surface inventory of speech sounds could be organized differently at the underlying, phonemic level. Sapir (1933) argued that the same phoneme- level organization can explain the behavior of native speakers of a language, whether it be with respect to errors they make in the perception or production of speech or in terms of the choices they make when devising or using a novel writing system to represent the speech sounds of their language. Nearly a century later, psychological evidence for phonemes continues to be found and presented in the literature. The first issue of Language in 2020, 95 years after the publication of Sapir’s (1925) article in the very first issue of the same journal, includes an article by William Labov arguing that the regularity of a sound change in Philadelphia is understandable to the extent that speakers of this dialect of English have an abstract mental representation of the front unrounded mid vowel (Labov, 2020). Jumping back in time to observations by Bloomfield (1933) and Bloch (1941), among others, English unstressed vowels reduce to schwa [ə] in many contexts, making [ə] an allophone of every English vowel phoneme. Thus we have [ˈfoɾəˌɡɹæf] ‘photograph,’ with a primary-stressed [o], an unstressed [ə], and a secondary-stressed [æ], alternating with [fəˈthɑɡɹəˌfi] ‘photography,’ with primary-stressed [ɑ] flanked by unstressed [ə]s. It
20 Eric Baković, Jeffrey Heinz, and Jonathan Rawski follows that phonemic representations of English morphemes must include unreduced vowels only, with reduction to [ə] being due to a phonological transformation, and moreover that the underlying form of many morphemes, such as /fotɑɡɹæf/ ‘photograph,’ will not directly correspond to any of their complete surface manifestations due to the nature of stress assignment in English.6 Phonemic analysis is the set of methods by which a language’s underlying inventory of phonemes is induced from the distribution and behavior of its surface speech sounds, which can of course differ from language to language. Borrowing an example from Hayes (2009, pp. 31–34), English and Spanish have a set of surface speech sounds that can be broadly transcribed as [t d ð ɾ],7 but their distribution and behavior in each language is such that: a. / ɾ/is a phoneme distinct from /t/and /d/in Spanish, but not in English, where [ɾ] is sometimes a surface realization of /t/(e.g. [bæt] ‘bat,’ [ˈbæɾə˞] ‘batter’) and other times a surface realization of /d/(e.g. [sæd] ‘sad,’ [ˈsæɾə˞] ‘sadder’), and b. /ð/is a phoneme distinct from /t/and /d/in English, but not in Spanish, where [ð] is always a surface realization of /d/(e.g. [unˈdisko] ‘a record,’ [loszˈðiskos] ‘the records’). From the same surface inventory of speech sounds [t d ð ɾ], then, we arrive at a different phonemic inventory in each language: /t d ð/in English and /t d ɾ/in Spanish, with [ɾ] a conditioned variant of /t d/in English and [ð] a conditioned variant of /d/in Spanish.8 These examples illustrate the basis of an important development in 20th-century linguistics: the idea that the phoneme is not the minimal unit after all, and that instead there are subphonemic units—distinctive features—out of which phonemes are built. Generally speaking, distinctive features are used to describe phonologically relevant phonetic distinctions between speech sounds: those phonetic distinctions that are common to similarly behaving sounds, and those that differ between the conditioned variants of phonemes. In Spanish, for example, the complementary distribution of [d] and [ð] is matched by complementary distribution of [b] and [β] and of [ɡ] and [ɣ], and the members of each of these pairs are related to each other in precisely the same way: [b d ɡ] are voiced stops (vibrating vocal folds and stopping of airflow at three different places of articulation), while [β ð ɣ] are voiced continuants (constriction but not stopping of airflow at corresponding places of articulation). In English, the related speech sounds [t 6 Bloch (1941, pp. 281– 283) observed that this represents a potential learning challenge, because arriving at the “right” underlying form for a morpheme requires exposure to a sufficient variety of its surface manifestations. 7 Narrower differences include the precise tongue tip position for the articulation of [t d] (alveolar in English, dental in Spanish) and the degree of constriction of [ð] (more closed in English, more open in Spanish). 8 Whether /d/or /ð/is the right representation of this phoneme in Spanish is a matter of some debate. Harris (1969) says /d/, Baković (1994) says /ð/, while Lozano (1979) opts for underspecification of the difference between the two speech sounds.
Phonological Abstraction in The Mental Lexicon 21 d ɾ] are all coronals (constriction by the tongue tip/blade); the phonemes /t d/are both coronal stops, and the conditioned variant [ɾ] is a coronal continuant. The distinctive feature idea was explored further by the Prague School of phonologists, notably Roman Jakobson and Nikolai Trubetzkoy, for whom contrast between phonemes was a core theoretical premise. Analyzing the systematicity of such contrasts required viewing phonemes as possessing features distinguishing each from the others. The particular features necessary to distinguish phonemes describe the content of that phoneme. In this sense, [a]ny minimal distinction carried by the message confronts the listener with a two- choice situation. Within a given language each of these oppositions has a specific property which differentiates it from all the others. The listener is obliged to choose either between two polar qualities of the same category [ . . . ] or between the presence and absence of a certain quality [ . . . ]. The choice between the two opposites may be termed distinctive feature. The distinctive features are the ultimate distinctive entities of language since no one of them can be broken down into smaller linguistic units. The distinctive features combined into one simultaneous or [ . . . ] concurrent bundle form a phoneme. (Jakobson, Gunnar, Fant, and Halle, 1952, p. 3)
Phonological features were intended to be the cognitive connection between the articulatory and perceptual speech systems. [T]he distinctive features correspond to controls in the central nervous system which are connected in specific ways to the human motor and auditory systems. In speech, perception detectors sensitive to the property detectors [ . . . ] are activated, and appropriate information is provided to centers corresponding to the distinctive feature[s] [ . . . ]. This information is forwarded to higher centers in the nervous system where identification of the utterance takes place. In producing speech, instructions are sent from higher centers in the nervous system to the different feature[s] [ . . . ] about the utterance to be produced. The features then activate muscles that produce the states and configurations of different articulators[.] (Halle, 1983, p. 95)
Over a quarter century later, the same idea informs the neurolinguistics and biolinguistics literatures. The [ . . . ] featurally specified representation constitutes the format that is both the endpoint of perception—but which is also the set of instructions for articulation. (Poeppel and Idsardi, 2011, p. 179) Features serve as the cognitive basis of the bi-directional translation between speech production and perception, and are part of the long-term memory representation
22 Eric Baković, Jeffrey Heinz, and Jonathan Rawski for the phonological content of morphemes, thus forming a memory-action-perception loop [ . . . ] at the lowest conceptual level. (Volenec and Reiss, 2017, p. 270)
Just as a phonemic analysis of a language’s phonological patterns reveals its phonemic inventory, so does a featural analysis reveal its distinctive feature inventory. A phoneme consists of an individual combination of distinctive feature values, but there will be some particular distinctive feature value combinations that do not correspond to phonemes in a given language. For example, the phonemic inventory of Russian includes voiceless oral stops /p t k/and voiced oral stops /b d ɡ/at the same three places of articulation (labial, coronal, and dorsal), but nasal stops /m n/at only two of these (labial and coronal) —meaning that the combination of distinctive feature values describing a dorsal nasal stop /ŋ/does not correspond to a phoneme in Russian. A phonemic inventory is often presented as a stand-alone entity in the context of a phonological analysis, but phonologists recognize restrictions on the distributions of individual phonemes (see, e.g., Hall, 2013). For example, some argue /ŋ/is a phoneme of English, but it is only found word-finally (e.g. [sɪŋ] ‘sing’), before word- level suffixes (e.g. [ˈsɪŋə˞] ‘singer,’ [ˈsɪŋɪŋ] ‘singing’), or when followed by /k/or /ɡ/ ([sɪŋk] ‘sink,’ [fɪŋɡə˞] ‘finger’). Similarly, /ð/is a phoneme of English, but its distribution is heavily restricted, being found at the beginning of a handful of function words, mostly determiners ([ðɪs, ðæt, ðiz, ðoz, ðɛm, ði, ðaɪ, ðə, ðɛn] ‘this, that, these, those, them, thee, thy, the, then’), in a handful of words ending in [ðə˞] ([ˈʌðə˞, ˈbɑðə˞, ˈwɛðə˞, ˈfɛðə˞, ˈmʌðə˞, ˈbɹʌðə˞, ˈfɑðə˞] ‘other, bother, weather, feather, mother, brother, father’), and at the end of a handful of verbs, mostly denominal ([bɹið, beɪð, ʃið, ɹaɪð] ‘breathe, bathe, sheathe, writhe’). Conditioned variants of phonemes by definition also have restricted distributions. The distinction in Russian between voiced and voiceless obstruents (a class that includes the oral stops noted above) is found only before sonorants, and is otherwise neutralized in agreement with following obstruents (voiced before voiced, voiceless before voiceless) or to voiceless word-finally. These restrictions on the distributions of phonemes and of their conditioned variants, and those on the recombination of distinctive features mentioned earlier, provide fodder for the kinds of abstract analyses that command particular attention in phonology, to which we now turn.
2.5 How abstract are mental representations? The question posed in this section title mainly lurks under the surface of modern debates in phonological theory, such as the extent to which phonology can be reduced to physiological principles governing articulation and perception (Ohala, 1981, 1997;
Phonological Abstraction in The Mental Lexicon 23 Hale and Reiss, 2000; Hayes, Kirchner, and Steriade, 2004; Blevins, 2004; Heinz and Idsardi, 2013; Reiss, 2018). The question was asked more directly by Kiparsky (1968), as betrayed by this classic paper’s title: How abstract is phonology? The paper addresses concerns about early work in generative phonology, which admitted the possibility of many transformations applying in crucial sequence from underlying to surface representations such that the specifications of a given underlying representation could be quite a bit different from those of its eventual surface representation.9 The concerns were about the possibility of “excessive abstractness,” though different types of abstractness and how one is to measure them with respect to one another were often more matters of opinion than principle. As discussed further below, Kiparsky’s proposal was not so much a line drawn in the sand as it was the curtailment of one particular type of abstractness. His and subsequent theoretical proposals have placed limits on the possible types of differences that may hold between an underlying representation and its various surface manifestations, but they have not somehow placed limits on the distance between these representations. The form of the evidence for a relatively abstract phonological analysis is typically one where what otherwise appears to be the same phoneme exhibits two distinctive forms of behavior: one expected (call this one A) and one unexpected (call this one B). The more strands of evidence of this sort that exist in any given analysis, the more compelling the case for it; examples from Yokuts (Kisseberth, 1969), Nupe (Hyman, 1970), and Dida (Kaye, 1980) are particularly compelling cases. The unexpected behavior of B can be approached in one of two ways, both being abstract in some sense. One approach is to posit that A and B are indeed phonemically identical, but that there is some abstract non-phonological diacritic marking X, present in forms with B but not in those with A, to which relevant phonological transformations are sensitive. (X may but need not have other functions in the language.) We call this the abstract diacritic approach. Kiparsky refers to the other approach as “the diacritic use of phonological features,” and it is instantiated in one of two basic ways. One instantiation is to directly posit that B is a phoneme distinct from A in terms of some feature such that the combination of distinctive features describing B does not surface, either anywhere in the language or in the phonological contexts or lexical items in question. Relevant phonological transformations sensitive to the A/B distinction apply, and this distinction is subsequently neutralized to A. The other instantiation is to posit that A and B are phonemically identical but that there is some other phoneme C, co-present with B but not with A, to which relevant phonological transformations are sensitive. C is either subsequently deleted or neutralized with some other phoneme. We call both of these instantiations the abstract phoneme approach.
9 Or very similar in some respects, by way of a non- spurious there-and-back-again sequence of transformations known as a Duke of York derivation (see Pullum, 1976; McCarthy, 2003; Baković, 2013 for discussion).
24 Eric Baković, Jeffrey Heinz, and Jonathan Rawski Abstract phonemes can be either absolute or restricted. An absolutely abstract phoneme—or an “imaginary segment” (Crothers, 1971)—is a combination of distinctive feature values posited as a phoneme in some set of morphemes but that is not realized as such in the surface forms of any morphemes in the language. An example appears in an analysis of exceptions to vowel harmony in Hungarian (Vago, 1976); some stems with the neutral front vowels [i iː eː] unexpectedly condition backing of suffix vowels, motivating the postulation of abstract back vowel phonemes /ɯ ɯː ɤː/in these stems that are eventually and uniformly fronted. A restrictedly abstract phoneme (O’Hara, 2017) is a combination of distinctive feature values posited as a phoneme in a restricted set of contexts but that is not realized as such in any surface contexts corresponding to that set. An example appears in an analysis of exceptions to vowel coalescence in Mushunguli (Hout, 2017): some stems beginning in the high vowels [i u] unexpectedly block otherwise regular coalescence with preceding /a/, motivating the postulation of glide phonemes /j w/that respectively never surface before [i u], repectively, but that do surface in other contexts. These glides block coalescence and are subsequently deleted just before [i u]. All abstract phoneme analyses crucially rely on the possibility of opaque interactions between phonological transformations (Kiparsky, 1973). In each of the examples sketched above, the transformation that voids the representation of its abstract phoneme crucially must not apply before application of the relevant transformation(s) sensitive to the abstract phoneme.10 In some cases, the abstract phoneme is needed in the representation to prevent the application of an otherwise applicable transformation, as in the case of Mushunguli. These involve the type of opaque interaction known as counterfeeding. In other cases, the abstract phoneme is needed in the representation to ensure the application of an otherwise inapplicable transformation, as in the case of Hungarian. These involve the type of opaque interaction known as counterbleeding. Abstractness is thus intimately intertwined with opacity: to the extent that a theoretical framework (dis)allows opaque interactions, it (dis)allows abstract phoneme analyses. Kiparsky’s (1968) proposed principle was meant to ensure that every phoneme specified as a particular combination of distinctive features in a given morpheme’s underlying form will be realized with that precise set of specifications in at least one surface form of that morpheme.11 This effectively excludes abstract phoneme analyses such as those sketched above—but in each of these cases the alternative is to rely instead on diacritic marking, which is, as already noted, also a form of abstractness. More recently, O’Hara (2017) specifies a MaxEnt learning model that assigns sufficiently high probability to a restrictedly abstract phoneme analysis (based on a case in Klamath), and
10 We write ‘must not apply before’ as opposed to ‘must apply after’ because these opaque interactions can be had with simultaneous as opposed to ordered application of phonological transformations (Anderson, 1974; Kenstowicz and Kisseberth, 1979; Joshi and Kiparsky, 1979, 2006; Kiparsky, 2015). See also Baković and Blumenfeld (2017) for discussion. 11 We put aside the precise formulation of Kiparsky’s principle, as well as of its ensuing revisions (Mascaró, 1976; Kiparsky, 1982, 1993), focusing instead on its intended function.
Phonological Abstraction in The Mental Lexicon 25 significantly lower probability to both absolutely abstract phoneme and abstract diacritic alternatives. It is thus reasonably clear that a hypothetical abstract-o-meter (recall Figure 2.3) would place absolutely abstract phonemes closer to the “too abstract” end of the spectrum than restrictedly abstract phonemes, but where abstract diacritics fall on that spectrum is a largely unresolved matter. We must thus conclude here that “a phonological analysis, independently of its ‘degree’ of abstractness, is (only) as adequate as the motivation and evidence that can be produced in favor of it and against substantive alternatives” (Baković, 2009, p. 183). This section and the last focused on the issue of phonological abstractness in the mental lexicon at the level of the phoneme and of the distinctive feature. This is primarily because most relevant research has been focused here, with the possible exception of the syllable (Goldsmith, 2011; Strother-Garcia, 2019). However, it has been argued that other representational levels and systems also play a role in the mental representations of words, notably tonal representations (Yip, 2002; Jardine, 2016) and metrical stress (Hayes, 1995; van der Hulst, 2013). These representations have also been the subject of intense study, including the psycholinguistic and neurolinguistic literature, to which we turn in the next section.
2.6 Types of evidence The preceding sections discussed the perspective of phonological abstractness using structural linguistic arguments from the typology of spoken language. However, it is often claimed that structural arguments are far from a definitive proof, constituting one type of evidence. This is especially true for the psychological reality of these abstract forms. Ohala (1974) makes this point: It seems to me that the important question should not be whether phonology is abstract or concrete, but rather what evidence there is for the psychological reality of a particular posited underlying form. If our aim is simply a descriptive one, then of course abstract forms can be posited without any further justification than the fact that they make it easier to state certain regularities. However, if we are interested in reflecting the competence of a native speaker, then we must provide evidence that what we are claiming as part of a speaker’s knowledge is indeed such. (Ohala, 1974, p. 234)
What other types of evidence are there? This section describes various other types of evidence bearing on the question of abstractness. The first type of evidence comes from expanding the typology to consider non-spoken language, that is, signed and tactile language. This evidence is still structural but provides an important window into the lexicon. How dependent is the long-term memory representation of a word on the physical system that externalizes it? At the same time, Ohala and many others take
26 Eric Baković, Jeffrey Heinz, and Jonathan Rawski behavioral and psychological tests as providing additional necessary, if not sufficient, evidence for the reality of these forms. Recent advances in neuroimaging and increasing ease of use allow for intricate looks into the biological underpinnings of phonological representations, combined with computational models and simulations. However, literature in all these topics is vast, especially with regard to experimental work. Here we provide a sample of work that bears on the questions described earlier.
2.6.1 Evidence from signed and tactile phonology An important source of evidence for the existence and nature of phonological abstraction comes from languages without a spoken modality—namely, signed and tactile languages. Sign languages arise spontaneously in Deaf communities, are acquired during childhood through normal exposure without instruction, and exhibit all of the facets and complexity found in spoken languages (see Sandler and Lillo-Martin, 2006 for a groundbreaking overview). However, Sandler (1993) argues that if human language evolved without respect to modality, we should find hearing communities that just happen to use sign language rather than spoken language, and we do not. Sign language, she argues, is thus “an adaptation of existing physical and cognitive systems for the purpose of communication among people for whom the auditory channel is not available.” Sign languages offer, as Sandler (1993) puts it, “a unique natural laboratory for testing theories of linguistic universals and of cognitive organization.” They give insight into the contents of phonological form, and conditions on which aspects of grammar are amodal and which are tied to the modality. They also offer unique opportunities to study the emergence of phonology within a developing grammar (Goldin-Meadow, 2005; Senghas, Kita, and Ozyürek, 2004; Marsaja, 2008; Sandler, Meir, Padden, and Aronoff, 2005). One crucial contribution of non-spoken phonology is that it switches the issue of “how abstract is phonology?” to “where does abstraction lie, and to what extent is it independent of the modality?” There are generally two directions answers can take. To the extent that a given phonological form or constraint is present across modalities, one may make the case that it is truly abstract, in the sense that it exists without regard to the articulatory system which realizes it. On the other hand, to the extent that a given form or constraint differs, one can ascribe that difference to the modality, to the nature of the articulatory system. In this way, non-spoken phonology provides nuance into the relationship between the pressures of abstraction and the pressures of realization in the mental lexicon. Sign languages, despite their rich history, were almost totally dismissed as natural and structured languages until the latter half of the 20th century (see van der Hulst, to appear for the history of sign phonology). Stokoe (1960) divided signs into meaningless chunks, showing that sign languages display both morpho-syntactic and phonological levels, a “duality of patterning” or “double articulation” considered previously as a unique property of spoken languages (Martinet, 1960; Hockett, 1960). Stokoe’s phonological system
Phonological Abstraction in The Mental Lexicon 27 specified abstract representations for parts of the sign: the handshape, the movement of the hand, and the location in front of or on the body. van der Hulst notes that Stokoe’s division of signed forms was designed for transcription, but he regarded these symbols as abstract representations of the discrete units characterizing the signs. Much work on the psychological reality of these phonological abstractions in sign came from the Salk Institute (see Klima and Bellugi, 1979 for an accessible overview, and Emmorey, 2001 for more recent developments). Studying production errors, they demonstrated compositionality in both perception and production of sign forms. The Bellugi and Klima group showed that cross-linguistically, signers make acceptability judgments about what they consider well-formed or ill-formed, evidence that they possess intrinsic knowledge of how these smaller units can be combined. As van der Hulst (to appear) notes, Klima and Bellugi’s (1979) accessibility and cognitive scope convinced many linguists and non-linguists of the importance of sign language, and sign linguistics as a proper study in ways Stokoe’s analysis was unable to. These works galvanized the phonological study of signs, increasingly focused on sequential aspects of signs and away from simultaneous aspects, which Stokoe had emphasized (Liddell and Johnson, 1989; Liddell, 1984; Newkirk, 1981; Supalla and Newport, 1978). Researchers discovered, among other evidence, the salience of the beginning and end points of the movement of signs for inflectional purposes where referents are marked by discrete spatial locations, as well as phonological processes like metathesis, a switch in the beginning and end point of the movement (see van der Hulst and van der Kooij, 2021 for discussion). The notion of both simultaneous and compositional structure in a sign, as well as sequential structure, raises a big question: how modality-dependent are these properties? Languages in both modalities have sequential and simultaneous structure but exhibit differences in relative centrality of such structure. Spoken languages vary in syllable structure, word length, and stress patterns among syllables. Sign languages appear limited in all these aspects. They are overwhelmingly monosyllabic, have no clusters, and show extremely simple stress patterns, due to few polysyllabic words apart from fully reduplicated forms (see Sandler and Lillo-Martin, 2006 for general discussion, and Wilbur, 2011 for discussion of signed syllables). A further complication arises from the fact that sequential phonological structure in sign language appears mostly from morphosyntactic operations concatenating morphemes and words (e.g. affixation, compounding, and cliticization; Aronoff, Meir, and Sandler, 2005). In general, sequential affixation is rare across sign languages (Aronoff et al., 2005), and sign exhibits a strong tendency to express concatenative morphology through compounding (Meir, 2012). Aronoff et al. (2005) show that affixation usually emerges from the grammaticalization of free words, through a series of diachronic changes affecting both phonological and semantic factors. They cite the relative youth of sign languages as a major factor in their lack of affixes. No known sign languages are over 300 years old, with some like Nicaraguan Sign Language as young as 40 (Woll, Sutton-Spence, and Elton, 2001), and many others in development. The curious lack of sequential structure in sign languages does not imply structural degeneracy or simplicity, however. Sign languages routinely demonstrate
28 Eric Baković, Jeffrey Heinz, and Jonathan Rawski nonconcatenative morphology (Sandler, 1989; Meier, 2002), incorporating morphological material simultaneously in the phonology alongside the restricted sequential form. Simultaneous phonological structure exists in all languages but differs across modalities in the amount. For example, while the simultaneous “autosegmental” representations for tone or harmony patterns (Goldsmith, 1976) typically consist of one or two features, the autosegmental representation of hand configuration alone in sign language contains around half of the distinctive features comprising a sign organized in an intricate feature geometry (van der Hulst, 1995; Sandler, 1996). Such tradeoffs in sequential and simultaneous centrality have been argued to stem from a computational restriction that may be realized via different representations in different modalities (Rawski, 2017). This converging line of evidence suggests that the phonological grammar may leverage the representational abilities of the particular articulatory/perceptual system. Brentari (2002) and Emmorey (2001) argue that visual perception of signs (even with sequential properties) is more “instantaneous” than auditory speech perception. This leads van der Hulst and van der Kooij (2021) to adapt Goldsmith’s (1976) division of phonology in terms of the notions of “vertical and horizontal slicing of the signal.” They state: an incoming speech signal is first spliced into vertical slices, which gives rise to a linear sequence of segments. Horizontal slicing then partitions segments into co- temporal feature classes and features. In the perception of sign language, however, the horizontal slicing takes precedence, which gives rise to the simultaneous class nodes that we call handshape, movement, and place. Then, a subsequent vertical slicing of each of these can give rise to a linear organization.
Perhaps the most intriguing evidence for or against abstractness comes from studying the degree to which phonetics and phonology differ across modalities. Lillo-Martin (1997) cites Blakemore’s (1974) result that exposure to vertical and horizontal lines in the environment affects development of feline visual perception, and asks “why shouldn’t exposure to the special acoustic properties of the modality affect perception, especially auditory perception?” Sandler and Lillo-Martin (2006) note, for example, that unlike spoken syllables in many languages, sign syllables prohibit location clusters analogous to consonant clusters, as well as diphthong-like movement clusters, and sign physiology requires movement between locations. They additionally note that sign syllables do not have onset- rhyme asymmetries, which affects syllable structure and stress assignment. Many more such differences have been studied, and further work in this area will bring important issues to bear on the nature of abstract representations in and across articulatory systems. The similarities and differences in phonological abstraction across modalities means that signed languages continue to play an important role as evidence of abstraction in the lexicon. This holds equally true for language expressed by the DeafBlind through the tactile modality, often called tactile or pro-tactile sign languages (see Edwards, 2014 for a recent phonological analysis).
Phonological Abstraction in The Mental Lexicon 29
2.6.2 Psycholinguistic and neurolinguistic evidence As mentioned, behavioral testing has long been argued as necessary evidence for the mental reality of phonological abstraction, in addition to typological evidence. The introduction and improvement of neuroimaging methods enabled correlations between behavioral tasks and gross neural excitation levels associated with them. In addition, recent simulation tools allow for modeling of phonological representations in an idealized in silico setting. Here we overview several results via behavioral and neural methods bearing on phonological organization of the lexicon by its salient features into abstract phonemes, as well as work on the temporal abstraction of speech. One salient question concerns experimental evidence for abstract phonemic and featural representations of words. For example, in Russian, the sounds [d]and [t], which featurally differ in voicing, are contrastive members of different phonemes. In Korean, these sounds do not contrast and are members of a single phoneme. Kazanina, Phillips, and Idsardi (2006) used the neuroimaging method of magnetoencephalography (MEG), which tracks the timecourse of gross, large-scale neural activity to show that Russian and Korean speakers react differently to these sounds. The Russian speakers separated the sounds into two categories corresponding to /t/and /d/. On the other hand, the Korean speakers did not separate the sounds, again corresponding to the analysis of a single underlying phoneme. From this result, the authors conclude that both phonetic and abstract phonemic analyses necessarily shape the perceptual analysis of speech sounds. There is much evidence supporting a vast neuronal ensemble for phonological representations in speech production and perception (see Eickhoff, Heim, Zilles, and Amunts, 2009; Jueptner and Krukenberg, 2001 for an extensive overview). In particular, various portions of the superior temporal sulcus are suggested to encode the phonological representations discussed in this chapter. Scharinger, Isdardi, and Poe (2011) used a combination of MEG imaging and statistical modeling to map the entire vowel space of a language (Turkish) onto three-dimensional cortical space, organized by lateral-medial, anterior-posterior, and inferior-superior axes. Their statistical model comparisons showed that, while cortical vowel maps do reflect acoustic properties of the speech signal, articulator-based and featural speech sound information “warps the acoustic space toward linguistically relevant categories.” Scharinger, Monahan, and Isdardi (2012) used MEG to localize three vowel feature variables (height, frontness and roundness) to the superior temporal gyrus. Mesgarani, Cheung, Johnson, and Chang (2014) used cortical electrode placement to show that the superior temporal sulcus encodes a “manner of articulation” parameter of speech sounds. Intriguingly, different electrodes responded selectively to stops, sibilant fricatives, low back vowels, high front vowels and a palatal glide, and nasals, respectively. Bouchard, Mesgarani, Johnson, and Chang (2013) showed similar results, that the superior temporal gyrus encodes a “place of articulation” parameter, confirming labial, coronal, and dorsal place features across various manner classifications. These results match with Hickok and Poeppel’s (2007) hypothesis that “the crucial portion of the STS
30 Eric Baković, Jeffrey Heinz, and Jonathan Rawski that is involved in phonological-level processes is bounded anteriorly by the most anterolateral aspect of Heschl’s gyrus and posteriorly by the posterior-most extent of the Sylvian fissure.” Additionally, recent experimental evidence using aphasic patients even supports the existence of abstract phonological rules in processing. Linguistic analysis posits that the English words pit and spit both contain the segment /p/in their underlying representations. The surface representation of pit has aspirated [ph], because the /p/is in word-initial position, while the surface representation of spit has unaspirated [p], because it is preceded by /s/. Buchwald and Miozzo (2011) constructed an experiment using the productions of two aphasic patients who were unable to produce an /s/in relevant consonant clusters like /sp/or /st/, and compared them with correctly produced consonants. They wanted to test whether an aphasic would aspirate the /p/(marking phonological fricative-deletion) or not (the fricative deleted after successful application of the phonological rule). To analyze it, they compared the voice onset time (VOT) of the two patients on the two instances. VOT provides an acoustic measure of the relative aspiration of the consonant by seeing how much the following voicing is delayed. One patient had a long VOT ([ph]) while the other had a short VOT ([p]), confirming the divide between two distinct levels of phonological and phonetic influences in processing. Follow-up work by Buchwald and Miozzo (2012) showed similar results for nasal consonants that deleted in clusters like /sn/and /sm/. The conclusion drawn from these studies points to abstraction in the units being processed, mentally divorced from their phonetic realization but ultimately driving it. Apart from the biological underpinnings of the atomic representations characterizing speech units, there is also much work focused on the biological underpinnings of temporal abstractions in speech. Specifically, much of this work focuses on the insula, basal ganglia, and cerebellum, where temporal information is speculated to be filtered in a cortical-subcortical loop for the purposes of motor planning (Eickhoff et al., 2009). In this process, motor sequence plans are filtered through basal ganglia, while the cerebellum converts the sequences into fluent, temporally distributed articulations. Ackermann, Mathiak, and Riecker (2007) underpin this by describing drastic negative effects on speech production consistent with damage to the cerebellum. While neuroimaging methods brought many advantages, they simultaneously introduced a uniquely thorny issue to phonological abstraction, which Poeppel (2012) divides into the “Maps Problem” and the “Mapping Problem.” The Maps Problem concerns descriptive analysis of the behavioral and neural underpinnings of mental representations, say, the effects and brain areas associated with a particular phonological phenomenon. The Mapping Problem concerns how to take a particular mental representation, perhaps known to correlate with some behavioral entity or brain network, and mechanistically connect it to neuronal function. Neither the Maps Problem nor the Mapping Problem have easy answers (Buzsaki, 2019), and decades of work have led to many open and nuanced questions. Attempting to address the Mapping Problem, some work seeks a somewhat more explanatory approach to the forces underlying the temporal segmentation of the signal
Phonological Abstraction in The Mental Lexicon 31 produced and perceived. One emerging insight is that the perceptual system divides an incoming auditory stream into two distinct time windows (Giraud and Poeppel, 2012; Chait, Greenberg, Arai, Simon, and Poeppel, 2015). What phonological entities do they map to? As Poeppel and Idsardi (2011) put it, there are [t]wo critically important windows that appear instantiated in spoken languages: segments and syllables. Temporal coordination of distinctive features overlapping for relatively brief amounts of time (10–80 ms) comprise segments; longer coordinated movements (100–500 ms) constitute syllabic prosodies. (Poeppel and Idsardi, 2011, p. 182)
A more fundamental question concerns the neural mechanism that drives these windows and their coordination. Oscillatory activity within and between neural populations has been posited (Giraud and Poeppel, 2012). Neural populations that comprise a certain type of neuron may show a stable neural oscillation at certain frequency bands, which varies depending on their excitatory and inhibitory properties (see Buzsaki, 2006 for an accessible overview). Evidence suggests that pyramidal interneuron gamma oscillations, as well as theta oscillations, comprise the segmental vs. syllabic time distinction. These two oscillations funnel an incoming speech signal into time windows of different sizes, computationally represented by the waveform of the population activity. In silico modeling work reveals another interesting property. While these oscillations do track the signal, a crucial feature is that rhythms of distinct frequencies show specific coupling properties, termed cross-frequency coupling (Hyafil, Giraud, Fontolan, and Gutkin, 2015b). Briefly, this means that the stable oscillatory populations innervate each other, allowing different timescales to track one another, effectively parsing a temporally complex signal efficiently. Specifically for speech perception, Hyafil, Fontolan, Kabdebon, Gutkin, and Giraud (2015a) showed that when a network exhibiting gamma oscillations coupled to a network showing theta oscillations, it was able to effectively segment a corpus of phonological words much better than a network where this coupling was absent. These results reflect an explosion of work using neurolinguistic and psycholinguistic tests to describe the sorts of representations speakers have. The intersection of experimental results with theory promises many new insights into the mental content of the lexicon (see Poeppel and Sun, this volume). For a further discussion on the particular biological substrate underlying phonological abstraction and how they impact the phonetics-phonology interface see Volenec and Reiss (2017).
2.7 Conclusion In this chapter, we have endeavored to motivate and review a central idea of modern generative phonology that the fundamental representational units of words in languages are abstract and psychologically real. In particular, the systematic patterning
32 Eric Baković, Jeffrey Heinz, and Jonathan Rawski of the pronunciation of morphemes motivates an abstract mental representation of a morpheme’s pronunciation. These underlying forms are largely regarded as sequences of phonemes, which are themselves abstractions, and which are organized along featural dimensions. These abstract mental representations find support not only from the patterning in morpho-phonological paradigms but also from language change, from sign language linguistics, and from psycholinguistic and neurolinguistic study. There are many open and fascinating questions regarding the nature of abstract mental representations of words. Many are among the most basic and fundamental. How abstract can they be? How are they learned? How are they realized in the brain? The fact that we simultaneously know both so much and so little about phonological abstractness in the mental lexicon sends a clear message that this will continue to be a fertile and exciting area of research for many years to come. To conclude this chapter, we can do no better than to repeat the concluding sentences of Labov (2020, p. 57, emphasis added): “We have a common ground in our understanding of what it means to know a language. It involves knowing a vast number of particular things. But at bottom it is reaching down to something very deep, very abstract, and very satisfying.”
Acknowledgments We thank Aniello De Santo, Charles Reiss, Veno Volenec, and two anonymous reviewers for valuable feedback on earlier drafts. And we dedicate this chapter to Lila Gleitman, whose pioneering work on linguistic abstraction, especially for language learning, has had a major impact on our thinking.
Chapter 3
Phonol o gical va riat i on and lexica l form Ruaridh Purse, Meredith Tamminga, and Yosiane White
3.1 What is phonological variation? The mental lexicon is where we store our knowledge of the words in our language. A reasonable starting point is to think of entries in the mental lexicon as form–meaning pairs. The lexical entry for the word cat, for example, pairs (a) some semantic information about a household pet that meows (meaning), with (b) some information about the speech sounds used to refer to it (form). As a first pass, we could say that this form information is stored as a string of phonemes: /k æ t/.1 The form side of this pairing is what allows speakers to externalize meaningful messages to the people around them and allows listeners to retrieve the intended meanings in turn. But a string of phonemes is, of course, an abstraction: what comes out of a speaker’s mouth is sound waves shaped by articulatory gestures, and what a listener encounters is a continuous and complex acoustic signal. The physiological demands of this phonetic implementation mean that no two instances of a word in real speech are ever exactly the same, an observation known as the lack of invariance problem. Both speakers and listeners face the challenge of connecting a word’s abstract lexically stored form with the continuous and multidimensional space of the phonetic implementation. Often, however, words surface with multiple forms in ways that cannot be explained by the physiological demands of speech production. In the basic case, accounting for these forms is the domain of phonology. For example, words sometimes appear to change form when they are combined with certain suffixes. Consider the English word confess, which can be combined with the suffix -ion to make confession. The stem confess 1 Phonemes are the distinctive sound units of a language; see Bacović et al. (this volume). These symbols are from the International Phonetic Alphabet, a system of representing speech sounds in writing.
34 Ruaridh Purse, Meredith Tamminga, and Yosiane White shows up in different forms: confession has a palatal [ʃ] where confess has an alveolar [s]. Do English speakers store both forms in their lexicon and know to choose the [ʃ] form with the -ion suffix? Baković et al. (this volume) lay out the standard arguments that many linguists give for saying no, and instead analyzing the [ʃ] as an allophone (a predictable pronunciation alternant) derived by phonological rule rather than stored in the lexicon. A single rule can capture the generalization that [s] becomes [ʃ] in other stems that combine with -ion2 (e.g., express/expression, compress/compression, and so on). On this view, phonological rules intervene between the lexicon and the phonetic implementation, editing the target segments that go on to be articulated in speech. This is not the only available model of the relationship between the lexicon, the phonology, and the phonetics,3 but because it is a widely accepted framework, we will build our discussion around it. This chapter is about phonological variation. But the aforementioned phonological alternation between [ʃ] and [s]is not generally referred to as phonological variation4 because the rule is obligatory whenever the linguistic conditions that trigger it arise. Thus, the [ʃ] allophone is fully predictable from the linguistic environment. Pronouncing confession as [kənfɛsjən] (without having applied the rule) is simply not a well-formed option for English speakers. What, then, is phonological variation? First, while many phonological rules are obligatory, there are also cases where seemingly similar alternations are not fully predictable. To continue our current example, the phrase impress you can be pronounced as either [ɪmprɛsju] or [ɪmprɛʃju] in connected speech. The two forms look suspiciously like the input and output of our [ʃ]-deriving phonological rule, but in this case the speaker has a choice between the options. To give a more intuitively familiar example, words ending in unstressed /ɪŋ/can optionally be pronounced with [ɪn]: morning~mornin’,5 pudding~puddin’, jumping~jumpin’, hypothesizing~hypothesizin’, and so on. The same logic that led us to posit a general rule capturing the obligatory pattern of [s] alternating with [ʃ] might lead us to conclude that this optional variability in [s]~[ʃ] and [ŋ]~[n] is also the product of phonological rules—just not obligatory ones. This intraspeaker variation, where a given speaker may say the same thing in different ways, is one kind of phonological variation that we will cover in this chapter. The choice a speaker has between the different options is called a variable, and the options themselves are called variants. Phonological variables are often influenced by social and situational factors; for example, most English speakers will share the intuition that mornin’ is a more casual way of saying morning. However, quantitative 2
More precisely, a rule that palatalizes coronals before /j/-initial suffixes. A class of prominent alternatives is usage-based approaches to phonology, such as Exemplar Theory, in which episodic traces or “exemplars,” prototypically word-level exemplars, are stored in memory and form the basis for the emergence of phonological categories or generalizations. The possibility of “hybrid” abstractionist/episodic frameworks has attracted increasing attention in recent years. See Pierrehumbert (2002), Pisoni and Levi (2007), and Hay (2018) for overviews. 4 Even though colloquially the words “vary” and “alternate” seem to mean approximately the same thing. 5 We use the ~ notation to mean “varies with.” 3
Phonological variation and lexical form 35 sociolinguistic research supports the premise of inherent variability (Weinreich, Labov, and Herzog, 1968): that variant choice is not fully predictable, even with a hypothetically exhaustive understanding of an utterance’s social context. In addition to intraspeaker phonological variation, phonology can also differ across speakers, even when they are nominally speaking the same language. We refer to these differences as interspeaker variation. Speakers can differ from each other in many ways, but we will give special attention to interspeaker variation that involves phonological structure and lexical form. For example, two speakers may have different phonological rules in their grammars, different stored forms in their lexicon, or different phonemic inventories (the set of distinctive sounds in a language). One familiar source of interspeaker variation that we discuss at some length in Section 3.1 is regional dialects. Interspeaker variation may also reflect other aspects of a language user’s background, such as gender, class, or race. From the point of view of language production, one might think that interspeaker variation need not be treated as variation at all: an American English speaker is probably never going to entertain the option of pronouncing the word got with a retroflex consonant, [ɡɑʈ], the way an Indian English speaker might.6 But from the perspective of language comprehension, differences across speakers are a major contributor to the phonological variation in the input that listeners must accommodate. Both intra-and interspeaker phonological variation can create a range of mismatches between real utterances and the stored forms in the lexicon. These mismatches pose a substantial challenge for lexical processing, including processes of word recognition (see Magnuson and Crinnion, this volume), word production (see Kilbourn-Ceron and Goldrick, this volume), and word learning (see Creel, this volume). In addition to its processing consequences, phonological variation raises new questions of representation. We have already pointed out that there are some parallels between obligatory phonological rules and intraspeaker phonological variables, but it does not necessarily follow that a non-obligatory phonological rule is the right analysis for any given variable; in some cases, there might be reason to believe that the options are stored in the lexicon, or arise in the phonetic implementation. Given the complexity of the challenges posed by phonological variation, it is unsurprising that models of the mental lexicon have largely set aside variable phenomena and have instead developed on the basis of invariant forms of words in isolation. However, as many other authors in this volume point out, variation is one of the major hurdles standing in the way of modeling the comprehension and production of words in their real-world context of continuous speech. As psycholinguists set their sights on increasingly realistic and dynamic models of how the mental lexicon is structured and used, it will become correspondingly worthwhile to tackle issues of phonological variation.
6 This excludes speakers who command both varieties and may code-switch between them, as well as perhaps a narrow set of circumstances that we think can reasonably be set aside as marginal, such as deliberately performing a different accent.
36 Ruaridh Purse, Meredith Tamminga, and Yosiane White
3.1.1 The scope and aims of this chapter We have already touched on a number of important themes, each of which alone could each easily fill a chapter on phonological variation: lexicon vs. phonology vs. phonetics; intraspeaker vs. interspeaker variation; comprehension vs. production; processing vs. representation. These dimensions could also be crossed with each other to form a very large space of interacting topics: the processing of intra-vs. interspeaker variation, the mental representation of intra-vs. interspeaker variation, and so on. An exhaustive treatment of this space is clearly beyond what could be accomplished in a single chapter. We therefore pursue a smaller set of more narrowly directed aims. Our overarching goal is to make explicit the relevance of phonological variation to the study of the mental lexicon. To do so, we connect the linguistic properties of phonological variation with their observed or potential consequences for lexical representation and, especially, processing. First, in Section 3.2, we review experimental work showing that phonological variation comes into play during lexical access, when language users “look up” words in their mental dictionaries for use in production or comprehension. This literature, which focuses mostly on spoken word recognition, demonstrates that variation sometimes but not always interferes with word recognition, that listeners are able to rapidly accommodate even unfamiliar variation, and that listeners may in fact use phonological variation as extra information to guide lexical access. While these results show that there is some relationship between phonological variation and lexical access, they leave many questions about the nature of this relationship unanswered. We suggest that the path forward should take the linguistic properties of different types of phonological variation into account. Different variables may be represented differently, and have different structural consequences, as we outline in Section 3.3. In Section 3.4, we elaborate on how such differences have the potential to impact lexical processing and conclude by sharing our optimism about the advantages that incorporating phonological variation may offer for models of the mental lexicon.
3.2 How does phonological variation affect lexical access? Although we have noted that psycholinguistic models of the mental lexicon have largely developed on the basis of isolated citation-form words, there are a number of lines of experimental work about how different pronunciations of words might impact lexical access. The evidence from this body of work provides a number of insights. Phonological variation does not necessarily disrupt lexical access; in fact, listeners are quite tolerant of licit variation in isolated words, and can flexibly adapt to novel accents characterized by many co-occurring phonological variables. Further, there is emerging evidence that
Phonological variation and lexical form 37 listeners can use phonological variation and their knowledge of a speaker’s accent to facilitate spoken word recognition, suggesting that understanding phonological variation may ultimately help us understand lexical access processes.
3.2.1 Variation need not impede word recognition Variation creates mismatches between the form in a listener’s lexicon and the phonetic form they perceive. Canonical forms are careful (or even hyperarticulated) pronunciations perceived as matching the “dictionary” pronunciation, while non- canonical forms diverge from that ideal in some respect.7 A reasonable hypothesis following from this point is that a non-canonical pronunciation might delay how quickly a listener can access a word’s lexical entry if the input diverges from the corresponding stored lexical form, whereas canonical forms might diverge less and thus be easier to access. An early example of a study supporting the hypothesis of a canonicality advantage in processing is Andruski, Blumstein, and Burton (1994). This study manipulated the voice onset time (VOT) of voiceless initial consonants in English, which tends to be relatively long in isolated words compared to connected speech. In a semantic priming task, where prior presentation of a semantically related prime word speeds recognition of a target word, primes with longer (i.e., more canonical) VOTs generated more priming than primes with short VOTs. Interestingly, this advantage only arose when the time between the prime and target was very short, pointing to a role for variation in the early stages of spoken word recognition. LoCasto and Connine (2002) found a similar advantage for words like camera with a canonical pronunciation vs. a non-canonical reduced schwa; Racine and Grosjean (2000, 2005) and Racine, Bürki, and Spinelli (2014) reported an advantage for canonical word forms in French with a schwa in the first syllable (e.g., genou = knee). Tucker and Warner (2007) found a small facilitation effect for words pronounced with a canonical word-medial /d/or /g/in comparison to reduced forms of those consonants. Conversely, a number of studies using similar methods with other phonological variables have not found support for a canonicality advantage, instead finding apparently equivalent facilitation for canonical and non-canonical forms. One such variable is place assimilation, where a word-final consonant adopts the place of articulation of the following segment. For example, a word ending in a coronal stop /d/such as wicked can (but need not) be pronounced with a labial stop [b]when the following word is labial-initial, as in [wɪkɪb pɹæŋk] for wicked prank. Gaskell and Marslen-Wilson (1996) show that non-canonical place-assimilated pronunciations yield just as much priming as the non-assimilated forms, contra the predictions of the canonicality advantage. Similar effects in which non-canonical pronunciations did not impede word
7 Ultimately,
discussion.
a pronunciation’s “canonicality” is a social construct; see Section 3.4.1 for additional
38 Ruaridh Purse, Meredith Tamminga, and Yosiane White recognition have been found for nasal flapping (Ranbom and Connine, 2007; Pitt, Dilley, and Tat, 2011; Sumner, 2013, but cf. Pitt, 2009), voicing assimilation (Snoeren, Segui, and Hallè, 2008), and final /t/allophony (Deelman and Connine, 2001; Sumner and Samuel, 2005). Even studies failing to support the canonicality advantage, though, have generally found that listeners’ tolerance for non-canonical pronunciations is not unbounded. Gaskell and Marslen-Wilson (1996) argued that their place assimilation results did not just reflect listeners’ tolerance for mismatch: when the [wɪkɪb] pronunciation of wicked was followed by game, where [b]could not have been generated through place assimilation, it did inhibit lexical access (cf. Gow, 2001). In a slightly different vein, Sumner and Samuel (2005) found that various /t/allophones produced naturally by speakers were accessed equally by listeners, but there was no priming from a minimally contrastive nonword prime (i.e., [flus] compared to non-canonical forms of flute). These results suggest that successful recognition of non-canonical forms depends on the variants being (a) possible pronunciations that are (b) licensed by context. In other words, the variation needs to represent surface patterns that should be familiar to listeners from their real-world listening experiences. There is some evidence that the effect of context in facilitating lexical access of different pronunciations may be gradient. Tucker (2011), while finding a general advantage for canonical pronunciations, also found that the predictability of a non-canonical pronunciation in context improved its acceptability and speeded response times. This is supported in the domain of production by findings that French speakers produce words with non-canonical schwa omission faster as the relative frequency (i.e., predictability) of this form, compared to the canonical form, goes up (Bürki, Ernestus, and Frauenfelder, 2010). Another aspect of context that appears to modulate the recognition of non-canonical forms is speech style. In a study probing the conflicting results on whether nasal flapping (e.g., [splɪntɚ]~[splɪnɚ] for splinter) exhibits a canonicality advantage, Sumner (2013) observed equivalent priming from naturally produced canonical and non-canonical primes, but no priming from a non-canonical variant spliced into a carefully articulated word frame. In the latter case, the non-canonical variant was not licensed by the context of other acoustic cues to speech style within the word; Sumner proposes that careful and casual speech styles induce different processing modes, which may not be efficient for dealing with variants that are incongruous with the style. However, Bürki, Viebahn, Racine, Mabut, and Spinelli (2018) found that lexical decision latencies for words with and without schwa-reduction were equivalent across careful and casual speech styles. The full set of results in this literature are challenging to reconcile completely, but that is to be expected for an area of such active inquiry. What we can take away at present is that hearing a word pronounced in a non-canonical way does not necessarily disrupt recognition of that word, and that listeners’ ability to reconstruct or predict the variants from the surrounding context may facilitate the processing of phonological variation. However, this ability is contingent on the variation being consistent with listeners’ social, stylistic, and linguistic experiences in the real world.
Phonological variation and lexical form 39
3.2.2 Listeners adapt even to pervasive and unfamiliar variation The mixed support for the canonicality advantage hypothesis suggests that, on the whole, listeners are quite good at coping with variation so long as it is limited and familiar. But we might further inquire how variation influences word recognition processes when listeners encounter many different variable features at once, or when those features are not part of a listener’s own production repertoire. Both of these situations are a consequence of interspeaker variation, particularly between regional varieties of a language. These regional sub-varieties of a language are commonly called dialects, and we can refer to the full set of a dialect’s pronunciation features as an accent. Of course, the term “accent” can also be used to talk about the pronunciation of second-language speakers of a language. The notion of accent variation is intuitively familiar to listeners, who generally think of them in holistic terms; for example, a speaker might be said to have “a Southern accent” or “a French accent” even though the listener is unlikely to identify the cluster of features that give rise to that percept. Naïve listeners have shown in perceptual categorization experiments that they are broadly able to identify where speakers are from, and that they can use specific acoustic- phonetic properties of talkers’ speech to make their judgments (van Bezooijen and Gooskens, 1999; Clopper and Pisoni, 2004, 2006, 2007). This skill is not limited to adult listeners. Children as young as 12 months are sensitive to dialect differences while listening to speech (Schmale, Cristià, Seidl, and Johnson, 2010). In fact, by four years of age listeners are able to group similar talkers together and distinguish them from dissimilar talkers, with further major developmental improvements in classifying talkers by region happening between the ages of 7 to 11 (Jones, Yan, Wagner, and Clopper, 2017; Evans and Lourido, 2019). A number of studies have shown that listeners have highly flexible word recognition processes that allow them to quickly adapt to both regionally accented (Best, Shaw, and Clancy, 2013; Maye, Aslin, and Tanenhaus, 2008, i.a.) and foreign accented speech (Clarke and Garrett, 2004; Bradlow and Bent, 2008; Witteman, Weber, and McQueen, 2013, 2014; Vaughn, 2019; Imai, Flege, and Walley, 2003; Bent and Frush Holt, 2013). Specifically, the evidence suggests that listeners experience a temporary disturbance when encountering a new accent, but this is normalized at an early stage of processing and improves over time (Floccia, Goslin, Girard, and Konopczynski, 2006; Goslin, Duffy, and Floccia, 2012). Adaptation is even possible for laboratory-created accents, so long as listeners are given enough exposure to them (Weatherholtz, 2015). Further, listeners’ expectations about what accent they are going to hear from a novel talker can impact how successfully they adapt (Vaughn, 2019).
3.2.3 Listeners may use phonological variation to guide lexical access If phonological variants do not dramatically inhibit word recognition and listeners can quickly come to understand different accents (even if we are not sure how they do it),
40 Ruaridh Purse, Meredith Tamminga, and Yosiane White does the study of the mental lexicon really need to deal with phonological variation? The conclusion that real-world, appropriately contextualized variation never derails lexical access is probably premature; in Section 3.4 we will discuss cases where we think such issues are quite plausible. However, the canonicality advantage hypothesis does not capture the only possible way in which phonological variation might be relevant to lexical access. In fact, in this section we turn to the mounting experimental evidence that phonological variation can actually provide information to listeners that may help guide their lexical access processes. Many of the studies in Section 3.2.1 not only fail to show negative consequences for non-canonical pronunciations but also that listeners actually use phonological variants to anticipate upcoming words (Bürki, 2018; Gow, 2001, 2002; Lahiri and Marslen- Wilson, 1991; Tucker, 2011, i.a.). In an eye-tracking study, Mitterer and McQueen (2009) presented Dutch listeners with words in sentential context, and with a deleted word- final /t/to make them ambiguous (e.g., tast = ‘touch’ sounds like tas = ‘bag’). Listeners used probabilistic knowledge of the likelihood of /t/deletion in different following contexts to anticipate which image to look toward. Besides cues from intraspeaker variation in the input, listeners can also use knowledge of a speaker’s background to guide lexical access. Listeners presented with a word like pants, which has a different dominant meaning in the United Kingdom (pants = undergarment) vs. the United States (pants = trousers), used the accent of the speaker to facilitate semantic access to the congruent meaning of the word (Cai, Gilbert, Davis et al., 2017). Accents can also help listeners decode non-canonical pronunciations in isolated words. American listeners hearing /r/-final words pronounced either with consonantal /r/in a General American accent (e.g., slender as [slɛndɚ]), or with vocalized /r/in either a British or New York City accent (e.g., [slɛndə]) were able to use their knowledge of the British English accent to support their understanding of /r/-vocalized words in isolation in a priming task. However, /r/-vocalized words were much harder to recognize in the New York City accent (Sumner and Kataoka, 2013). Sumner, Kim, King, and McGowan (2014) argue that this is because certain language varieties are more salient or idealized than others, which facilitates lexical access.
3.2.4 Dimensions of phonological variation in lexical access The body of work discussed in this section makes a strong start at examining the ways in which phonological variation affects lexical access. We suggest that a useful next step in understanding how variation is represented and processed could be to consider the ways in which phonological variation is not monolithic. One possible reason that the experimental studies discussed in this section have not reached a consensus on these issues is that they manipulate different phonological variables, which in turn have different representations and may interact differently with lexical access processes. While the
Phonological variation and lexical form 41 canonicality advantage literature has acknowledged and explored the point that gradient phonetic variation—such as the VOT manipulation of Andruski et al. (1994)—is probably different in some respect from variation between different phonemes—such as the variation between /d/and /b/in Gaskell and Marslen-Wilson (1996)—there is much more that could be said about the dimensions along which we could classify different kinds of phonological variation. In the next part of this chapter, we provide a non- exhaustive overview of some such dimensions, with an eye to how incorporating a more linguistically complex understanding of phonological variation might facilitate psycholinguistic research on the mental lexicon.
3.3 The complexity of phonological variation We have just proposed that in the study of the mental lexicon, our ability to surmount the difficulties of phonological variation may be dependent on a detailed empirical understanding of the variability itself. In this section, we therefore outline the ways phonological systems can differ between speakers, as well as how the types of variation produced by individual speakers can be classified based on their representation. We primarily draw our examples from varieties of English due to the abundance of work on variation in Englishes and the fact that the majority of readers will have some frame of reference for these varieties. However, the typology of phonological phenomena that we describe is relevant to all language varieties. Taking representational differences into account can better position us to make realistic predictions about how phonological variation impacts lexical access. We begin by focusing on the linguistic descriptions in this section, and then expand on their possible processing consequences in Section 3.4.
3.3.1 Interspeaker variation Different people speak differently, even when they are speaking what is ostensibly the same language. Many of the studies mentioned in Section 3.2 explored how different accents are processed. However, research of this type has not generally asked about the specific linguistic and (with some exceptions) social properties of accents. Here we cover four elements that characterize accents and can differ between them: representations for specific words, the phonetic realization of phonemes, the structure of the phonemic inventory, and allophonic processes. In some cases, interspeaker phonological variation is lexically specific. A person from New York City and a person from London could both say the English word tomato, but the New Yorker is likely to say something like [təˈmeɪɾoʊ], while the Londoner is likely to say something like [təˈmɑ:toʊ]. Both are speaking English, and both are using the same
42 Ruaridh Purse, Meredith Tamminga, and Yosiane White word in reference to the same (prototypically) edible red fruit, but the pronunciations are different: in most varieties of American English, the second syllable of tomato has a vowel like that in the word face,8 but in most British Englishes, the vowel in this syllable matches the one in palm. This is not a phonological difference that generalizes to other words; the British English pronunciation of mate is not [mɑ:t]. Therefore, the same word must be represented with different phonemes depending on the variety. Lexically specific phonological differences can be found at a smaller scale, too. So far, we have juxtaposed accents such as “American English” vs. “British English.” However, the Atlas of North American English (Labov, Ash, and Boberg, 2006) actually identifies seven major dialect regions across the United States and Canada (and British English may be even more internally diverse). Similar distinctions can be made between many of these accents for certain words; for example, for speakers in the northern half of New Jersey, the preposition on contains the lot vowel while in southern New Jersey it has the thought vowel (Coye, 2009). In addition to some lexically specific differences, a common difference between accents concerns the across-the-board phonetic realization of phonemes. In fact, the primary criteria used in the Atlas of North American English (Labov et al., 2006) to draw boundaries between dialect regions concern differences in vowel quality. For example, speakers from the eastern Great Lakes region in the north of the United States produce lot with a fronted vowel compared to speakers from elsewhere. A fronted lot is one that has, over time, developed into a vowel more like the one that most other American English varieties use in trap. In this lot-fronting variety, trap is also shifted from an earlier position, so that lot does not overlap with trap and the two vowels are still distinct. When two vowel shifts seem to push or pull each other along in the same direction like this, it is called a chain shift. Chain shifts can result in dramatic differences in the realization of phonemes while keeping all the categories intact. The two chain links we have described here (lot fronting and trap tensing) are part of a larger chain called the Northern Cities Shift, which characterizes the accent of the Inland North dialect region (see Figure 3.1). A number of US dialects are involved in chain shifts like this one, which Labov (2012) argues are leading to greater regional differentiation than homogenization in the United States. Beyond the phonetic properties of various phonemes,9 we can also ask about the number of phonemic categories in an accent and how they are organized. The main mechanisms by which a language variety comes to have a different number of phonemes are mergers and splits. There are many dialects in which some two phonemes have
8 In
dialectology and sociolinguistics, words in small capital letters are conventionally used to represent vowel phonemes, whatever the actual pronunciation of this vowel in a given variety. This kind of representation is called a lexical set (Wells, 1982) because it picks out the set of words that, historically, share the same phoneme. 9 It could be argued that even if all the phonemes of two systems have quite different phonetic properties, it does not constitute a phonological difference so long as every phoneme in one system finds a structural equivalent in the other.
Phonological variation and lexical form 43
KIT
STRUT
DRESS
THOUGHT
TRAP LOT
Figure 3.1 Vowel change trajectories comprising the Northern Cities Vowel Shift.
merged to form a single category. A well-known example in English involves the vowels in lot and thought. In some dialects, the vowels in these words remain distinct. However, many dialects have undergone a merger in which these lexical sets have combined into a single large lexical set, reducing the overall number of contrastive phonemic categories in the inventory (for American English, see Labov et al., 2006, p. 58). Conversely, when one phoneme splits into two,10 the number of contrastive phonemic categories increases. While mergers and splits constitute qualitative differences in the structure of the phonemic inventory, these effects are often not salient to the speakers themselves. Two speakers of American English varieties might disagree about whether words like cot and caught are pronounced the same, but they are unlikely to comment on or even notice this disagreement (Labov, 1994, p. 344). Without reorganizing the underlying phonemic inventory, accents may exhibit different allophonic processes that affect how phonemes are realized in certain contexts (Labov, Fisher, Gylfadottir, Henderson, and Sneller, 2016; Sneller, 2018). This means rules like the variable palatalization process outlined in Section 3.1 may or may not exist in a given dialect. A useful example may be found in rhoticity: the pronunciation of /r/when there is no following vowel (Scobbie, 2006). Most varieties of English spoken in the United States, Canada, Scotland, and Ireland are rhotic, meaning /r/is always a consonant, typically an approximant like [ɹ]. However, many varieties of English spoken in England, Wales, Australia, and New Zealand are non-rhotic, meaning there is an obligatory allophonic rule turning /r/into a vowel when it is followed by a consonant 10 The mechanism for splitting is more complicated. Normally, it takes place in two stages: (1) an allophonic alternation stage, in which one phoneme is realized two different ways according to linguistic context, and (2) the loss of the triggering environment for allophony, so the different realizations are no longer in complementary distribution and must be reanalyzed as contrastive.
44 Ruaridh Purse, Meredith Tamminga, and Yosiane White or a pause.11 For other varieties still, most notably certain regional varieties of English from areas of the United Kingdom (Wells, 1970; French, 1989; Blaxter, Beeching, Coates, Murphy, and Robinson, 2019) and the United States (Labov, 1972, 2001; Feagin, 1990; Carmichael and Becker, 2019), consonantal or vocalic /r/are both possible in instances of the same context. For speakers of these dialects, it makes sense to posit a variable rule that probabilistically turns non-prevocalic /r/into a vowel. Every person who has acquired language has acquired a particular accent, even if that accent happens to be held up as a prestigious or “standard” way of speaking. Prestigious language varieties are often assigned properties of neutrality or universality, but there is no objective linguistic basis for such a designation. For example, “standard” or “mainstream” American English is a non-uniform collection of varieties typically associated with white speakers from the Midwest or non-urban Northeast of the United States. Not only is the perspective that these varieties are objectively “neutral” born from racist and classist ideology, it conceals assumptions about linguistic representation and processing that should be interrogated. In Section 3.4.1, we unpack some possible consequences of failing to account for interspeaker variation.
3.3.2 Intraspeaker variation We have just seen that a group of people all speaking the same language cannot be treated as uniform. It would also be inaccurate to characterize each individual as an invariant member of this heterogeneous group. Indeed, it is not an exaggeration to say that every utterance involves a number of decisions to produce words in one way and not another. We have already encountered a number of examples where an individual can produce the same utterance in different ways in Sections 3.1 and 3.2, noting that the issue of how these options are represented may not be straightforward. In reality, different types of intraspeaker variation probably work differently in this respect. In a recent review of phonological variation from a cognitive perspective, Bürki (2018) lays out some dimensions for classifying variable phonological phenomena, such as distinguishing between deletions, insertions, and substitutions. Here we build on that foundation by outlining some dimensions of structural classification that could prove relevant for word recognition and other elements of processing. Phonological variation is complex and pervasive, and attending to these complexities may prove fruitful in designing experiments and interpreting seemingly-incongruous results. In at least some cases of intraspeaker variation, the most parsimonious analysis seems to be to allow for variation inside the lexicon itself, disrupting the basic notion of a form–meaning pair by ascribing multiple forms to a single meaning. For example, many speakers of American English can pronounce the word economic with 11 We could entertain the possibility that some set of vowel phonemes (near, square, start, north, force, cure, letter) just have different phonetic properties in these dialects. However, this analysis is not as successful at capturing patterns of intervocalic /r/(e.g., he is vs. here is).
Phonological variation and lexical form 45 an initial vowel like that in fleece or dress. Since those same speakers are not free to interchange those vowels otherwise, we might think that their lexicon contains two different forms for economic. This resembles the tomato example in Section 3.1 in that it is lexically idiosyncratic, but speakers actually produce both forms. On the other hand, many patterns of variation apply more generally, affecting all words with the relevant phonological properties. The same representational desiderata that motivate abstraction in the invariant phonology can be applied here. In Section 3.1, we suggested that such patterns could be accounted for through phonological rules that are stipulated to apply with some probability rather than obligatorily. Such a rule is called a variable rule and has long been a prominent way of thinking about phonological variation in quantitative sociolinguistics (Weinreich et al., 1968; Cedergren and Sankoff, 1974; but cf. Fasold, 1991). While the original variable rules were patterned on the phonological rule formalisms of Chomsky and Halle (1968), the modeling of phonological variation across a range of formal frameworks is a vibrant area of active research (for overviews, see Coetzee and Pater, 2011; Nagy, 2013). In addition to the possibility of representing variation within the lexicon, we must contend with other non-phonological levels of structure. For example, there is some debate around whether phenomena like variable palatalization really constitute the same kind of process as their invariant counterparts, or instead come about as a consequence of how a sequence of segments is sometimes executed in the phonetics. More specifically, when producing the sequence press you, speakers may not always perfectly separate the alveolar articulation of [s]and the palatal articulation for [j]. Instead, these articulatory movements are produced simultaneously and the segments are coarticulated in a way that can acoustically resemble [ʃ]. In order to consider whether a process is phonological or phonetic, we must consider the properties of these two modules (see e.g., Pierrehumbert, 1990; Cohn, 1993 for in-depth discussions). One generally held distinction between phonological and phonetic operations concerns the properties of categoricity and gradience.12 Phonological processes are typically held to take, as both input and output, some finite number of discrete categories that the language user stores in memory, e.g., /s/, [s], [ʃ]. The phonetics, on the other hand, control all the continuous and infinitely subdivisible dimensions of the physical instantiation of language, e.g., all possible configurations of an idealized target [s]. As mentioned earlier, this is one difference between Andruski et al.’s (1994) manipulation of VOT and the other, more categorical, variables laid out in Section 3.2.1. Exploring the properties of categoricity and gradience in production, Zsiga (1995) uses electropalatography to investigate how speakers articulate underlying / ʃ/ (fresher), word-internal derived [ʃ] (pressure), and word-final derived [ʃ] (press you). She concludes that variable palatization does not result in precisely the same [ʃ] as in underlying /ʃ/or obligatorily derived [ʃ]. The latter two were indistinguishable from each other, although this is not necessarily always the case for obligatory derivation
12
Not to be mistaken for invariance versus variability.
46 Ruaridh Purse, Meredith Tamminga, and Yosiane White either (e.g., Port, Mitleb, and O’Dell, 1981; Ernestus and Baayen, 2006). Similarly, Ellis and Hardcastle (2002) looked at variable place assimilation in sequences like green card, which could be produced with a velar nasal [ŋ] or an alveolar nasal [n](compare obligatory word-internal assimilation, e.g., enter, amber, prank). They find that while some speakers variably produce a fully velar [ŋ], others retain some residual alveolar articulation. Importantly, just because the phonetic realization of, for example, palatalization in press you is different from that in pressure, it does not necessarily follow that the former is not phonological. The different types of palatalization cannot be represented with a single process anyway, since one is obligatory and one is variable. As a larger point, phonological processes need not be structure preserving. That is, they do not have to result in sounds that are already part of the underlying inventory (Scobbie, 1995; Bermúdex-Otero, 2010). For instance, most speakers of American English will produce postvocalic /t/and /d/as a flap, [ɾ], before unstressed vowels (e.g., city, writer, rider), but /ɾ/is not generally considered to be a sound that is available for underlying lexical representations because it only occurs in certain predictable environments (Kiparsky, 1979; Kahn, 1980). Structure non-preserving allophony is licit whether it neutralizes the contrast between multiple underlying segments (e.g., latter and ladder become homophones), or there is only one possible underlying representation of a surface segment, like the word-final /t/manipulations overviewed in Section 3.2.1 (Deelman and Connine, 2001; Sumner and Samuel, 2005). A proper diagnosis of categoricity in this sense, then, does not require the recreation of phonological categories that occur elsewhere in the system. Rather, it is only necessary that the various instances of forms be distributed in discrete categories, whatever the precise nature of these categories turns out to be. To further exemplify this point, consider British English /t/-glottaling and TH- fronting.13 Most speakers of British Englishes can optionally realize /t/as [t]or as a glottal stop, [ʔ], (Stuart-Smith, 1999; Fabricius, 2002) word-finally or before unstressed syllables in the same word (e.g., bottle, butter, bat). Like for flapping, /ʔ/is not typically considered part of the underlying inventory, that is, the process is structure non- preserving. Plus, instances of British English /t/-glottaling are not equally distributed across phonetic continua like the gradual reduction of a coronal gesture or constriction of the glottis. Rather, there is variable, but categorical, selection between discrete coronal and glottal closure options (Heyward, Turk, and Geng, 2014). In contrast, TH- fronting is a variable feature of many varieties of British English (Kerswill, 2003) and African American Language (Green, 2002; Sneller, 2020), whereby underlying interdental fricatives /θ ð/can be realized as labiodental fricatives [f v]. Since labiodental fricatives are required for the representations of other words (e.g., free, vine), TH- fronting is a structure preserving process. As with all structure preserving processes, it is also neutralizing, since the contrast between interdentals and labiodentals is lost when it applies (e.g., three becomes homophonous with free). The dimensions of structure
13
TH here stands in for both voiced /ð/and voiceless /θ/interdental fricatives.
Phonological variation and lexical form 47
Table 3.1 Examples of variables classified by structure preservation and neutralization Structure preserving
Structure non-preserving
Neutralizing
TH-fronting
Flapping
Non-neutralizing
—
/t/-glottaling
preservation and neutralization are relevant for processing because they impinge on matters of the nature and number of stored segments, and potentially the readiness with which underlying forms are retrieved. Of course, the invariance problem means that it is often not a straightforward task to identify categories. Phonetic variation is a constant, even if it is not solely responsible for a pattern of variation. In practice, this means that phonological categories manifest as individual distributions and not individual points in phonetic space, since instances where some category is the intended target will inevitably be perturbed by phonetic variation. Moreover, phonological processes are often developed from the stabilization of more gradient phonetic variation (Bermúdez-Otero, 2007), and these physiologically motivated phonetic phenomena can remain and co-exist even after a phonological process is established from them (Bermúdez-Otero, 2013). This means that phonological variables are often accompanied by diachronically related phonetic variation that resembles and even conceals them. Another factor in determining the structural properties of a potential phonological variable involves the contexts in which it occurs and, specifically, how it interacts with morphology. As mentioned in Section 3.1, words ending in unstressed -ing can variably be pronounced with [ɪŋ] or [ɪn]. As it turns out, the form with a coronal nasal appears much more often in verbal forms, where -ing is a suffix (e.g., working~workin’), than in nouns (e.g., awning~awnin’). Even without detailed evidence of categories in phonetic space, many grammatical theories do not allow the phonetics to be directly affected by morphology like this (Fodor, 1983; Bermúdez-Otero, 2010). This kind of a relationship between morphology and phonetics is prevented by the concept of modularity, which relegates different kinds of operations to separate parts of the grammar that are strictly ordered: by the time a word or utterance gets to the point of phonetic implementation, its internal morphological structure is no longer relevant. If we apply modular reasoning to variable processes, we have to conclude that variation in -ing is not phonetic. However, there is a different source of ambiguity present in these kinds of variables. Specifically, we can also account for morphologically conditioned patterns in variation by saying the morphology has some capacity for variation itself. Just as [t]and [ʔ] are possible allophones of /t/, perhaps /ɪŋ/and /ɪn/are possible allomorphs of -ing. Thus, it is not clear whether an instance of the word workin’ was rendered in this form by way of phonological process or if the speaker selected an alternative form of the -ing suffix with a coronal nasal. Ultimately, these levels of structure (phonetics, phonology, morphology) are not always neatly separable, and what look like cases of a single variant may actually have a
48 Ruaridh Purse, Meredith Tamminga, and Yosiane White number of different sources. Furthermore, different phonological processes interact with the underlying inventory of sounds in different ways. Understanding these structural properties, more broadly, helps us to discern the kind of mechanics that are at play when someone produces phonological variation, and the nature of the processing tasks required to interpret the linguistic signal as it is perceived.
3.3.3 Non-linguistic factors in phonological variation Our overview thus far may have given the impression that phonological variation merely injects uncertainty into multiple levels of structure. In reality, phonological variation is systematically conditioned by a number of factors. Weinreich et al. (1968) refer to this systematicity as orderly heterogeneity. There are observable linguistic differences correlated with many social dimensions such as gender identity (Eckert, 1989; Kiesling, 2004; Zimman, 2009), race/ethnicity (Fought, 1999; Hoffman and Walker, 2010; Bucholtz, 2011; King, 2016; Holliday, 2019), class or socioeconomic status (Labov, 1966; Rickford, 1986; Eckert, 1988; Labov, 1990), and sexual orientation (Gaudio, 1994; Moonwoman-Baird, 1997; Bucholtz and Hall, 2004). All of these elements of a person’s identity are relevant to the language they produce and are negotiated with regard to the particular setting and audience of an utterance. The interplay between various social dimensions is particularly well exemplified in classic observations of the social stratification of phonological variation. Variants whose rate of use is correlated with a language user’s socioeconomic status are typically also correlated with formality. An early demonstration of this effect can be found in Labov’s (1966) investigation of the social stratification of rhoticity in New York City. Speakers increasingly used consonantal realizations of non-prevocalic /r/, the variant associated with American English speakers of a higher socioeconomic status, as they performed tasks inducing more linguistic self-monitoring and a more formal style. Results like these suggest that speakers understand how phonological variation is socially stratified and can draw on this knowledge to inform their linguistic choices. As such, inter-and intraspeaker variation are not divorced from one another but closely intertwined. This basic premise is the foundation of “Third Wave Sociolinguistics” (Eckert, 2012), which focuses on individuals’ capacity to dynamically construct and perform their identities by using linguistic features that have garnered particular social meanings according to the context they appear in. The fact that individuals have detailed knowledge of how phonological variation is socially organized is further demonstrated by studies that explicitly elicit listener judgments of an interlocutor (Campbell-Kibler, 2009, 2011). These show broad agreement in terms of what social information can be inferred from certain linguistic behaviors. All of this is to say that language is inherently social, and any linguistic processing must occur in tandem with processing of social information.
Phonological variation and lexical form 49
3.4 Consequences of phonological variation for models of the mental lexicon Section 3.3 provided a detailed, though not comprehensive, look at the possible grammatical underpinnings and structural outcomes of different types of phonological variation. Equipped with this background, we now turn to consider some additional ways in which these structural aspects of phonological variation might be relevant to processing (that is, beyond the widely explored questions of non-canonicality that we discussed in Section 3.2). We find it useful to frame some of these questions in terms of their methodological implications, in order to highlight the inescapability of these issues even for studies that are not designed to address phonological variation. However, the issues we spell out are not merely methodological, since the potential differences between a model talker and a participant that we will outline could equally well be thought of as potential differences between two real-world interlocutors.
3.4.1 When processing meets mismatching representations A practical consideration that we have only touched on briefly so far is that experimental research in spoken word recognition must choose what form of each word to use as a stimulus. The usual practice is to create stimuli using a model talker whom the researchers judge to sound “standard,” producing word stimuli with the form that is taken to be “canonical” in some respect. As we discussed in Section 3.2, canonical forms may or may not have a special representational status, but even if they do, we must recognize that standard accents and canonical forms are inherently socially constructed. And even once we acknowledge that these are social constructs, there are cases where different options are equivalently standard or canonical in different varieties of the language. Choices that feel like neutral defaults to a researcher, then, simply have no guarantee of either matching what is in any given participant’s mental lexicon or reflecting the bulk of their real-world listening experience. And as our discussion in Section 3.3 makes clear, there are many dimensions along which the form chosen by the researcher might not align with participants’ mental representations. Many of the under-explored issues we identify have to do with mismatches between language users with or without some merger: in other words, representational differences across individuals. Consider the word competitors that are entertained and eliminated over time in cohort- based word recognition models (Marslen- Wilson, 1987; see Magnuson and Crinnion, this volume, for more detail). In these models, when a listener hears an initial sound such as [k], they generate a list of possible word candidates starting
50 Ruaridh Purse, Meredith Tamminga, and Yosiane White with /k/: capture, kick, cotton, continent, caution, cough, and so on. But if the second sound is [ɑ], a listener with a merged lot-thought class (most often realized phonetically as [ɑ]) might eliminate only capture and kick from this (partial) list, while the listener with a lot-thought distinction might additionally eliminate caution and cough (because that listener expects those words to contain /ɔ/, not /ɑ/). A related lexical property like neighborhood density,14 which has been shown to influence word identification ease and accuracy (Vitevitch, Luce, Pisoni, and Auer, 1999; Vitevitch and Luce, 2016), might exhibit similar types of differences depending on whether it is calculated over lot-thought merged or lot-thought distinct lexicons.15 Interestingly, this suggests that dialects differing along phonemic inventory lines such as the lot-thought merger might offer a useful opportunity to study homophone representation (Swinney, 1979; Caramazza, Costa, Miozzo, and Bi, 2001) because they may allow a minimal comparison between speakers for whom some word pairs are and are not homophones. Another merger- related question is whether nonwords used as experimental stimuli might sometimes have an unrecognized real word status to listeners with different lexical or phonological mental representations. A researcher might construct an intended nonword frind without taking into account that to a listener who has a pin/pen-merger (where dress and kit are merged before nasal consonants) this would be a perfectly good instance of the real word friend. The real-world flipside of this question is whether a pronunciation like [frɪnd] from a Southern-accented talker might be processed as a nonword by a non-Southern listener. Is it plausible that such representational mismatches intervene in processing in this way, given the literature we already surveyed in Section 3.2? This premise is supported by the observation that misunderstandings arising from phonemic inventory differences are not uncommon in everyday conversational interaction (Labov, 1994, p. 324–327), even in seemingly disambiguating communicative contexts. It is also supported by a number of cross- dialectal comprehension studies targeting specific sources of representational mismatch (Labov, Karen, and Miller, 1991; Flanigan and Norris, 2000; Labov, 2010). In other words, although our Section 3.2 discussion emphasized evidence that listeners are eventually able to overcome the challenges of phonological variation, we should not conclude that those challenges do not arise at all. And methodologically speaking, we cannot safely assume that these issues are minor in scope or easily avoidable. For example, the presence vs. absence of the lot-thought merger divides the geographic territory of US English approximately in half across a number of non-contiguous dialect regions (Labov et al., 2006, p. 59). Beyond these interspeaker differences, we highlighted issues of structure preservation and neutralization in Section 3.3 because intraspeaker variation can give rise to parallel issues when it is structure preserving or neutralizing. For example, the variable deletion of word-final /t/and /d/in consonant clusters, which happens in every English dialect we are aware of, can generate homophony (past/pass, mold/mole) and erase morphological
14 Neighborhood
density measures capture how many other words in the lexicon have a similar phonological shape; see Magnusson and Crinnion, this volume. 15 Exactly how these differences play out will depend on the calculation method and the treatment of homophones within that method.
Phonological variation and lexical form 51 information (jumped/jump) but does not always do so (act, spent). These inconsistent lexical consequences pose challenges for listeners and researchers alike. Intraspeaker variation that is structure non-preserving, on the other hand, introduces variants that inherently signal their derived status by virtue of not existing in the phonemic inventory, a signal that could in principle trigger shifts in how lexical access processes proceed. Nonetheless, structure non-preserving intraspeaker variation can additionally generate homophony when it is neutralizing (latter/ladder), reintroducing some of the issues around cohort competitors and neighborhood density that we discussed for more basic merger examples. Empirical questions about how any particular variable is represented, such as whether it involves incompletely neutralized phonetic variants or an alternation between discrete phonemic units, take on new importance in light of the possibility that they may give rise to different processing consequences.
3.4.2 When social information comes into play Work in the emerging area of sociolinguistic cognition (Campbell- Kibler, 2010; Loudermilk, 2013; Chevrot, Drager, and Foulkes, 2018) suggests that listeners’ navigation of inter-and intraspeaker phonological variation during lexical access is guided by their experience with the social influences on variation such as we discussed in Section 3.3. There is robust evidence that listeners can use social information to make inferences about a speaker’s likely linguistic system.16 They can then take into account whatever knowledge they have of that system, instead of relying on their own system, when identifying sounds and words produced by that speaker (Niedzielski, 1999; Strand, 1999; Hay and Drager, 2010; D’Onofrio, 2015; Hay, Walker, Sanchez, and Thompson, 2019). This kind of reverse-engineering appears to be possible based on the presence of linguistic features that tend to co-occur (such as different features of a Southern accent) without extra social information (Dahan, Drucker, and Scarborough, 2008), although it remains an open question whether these effects are mediated by social inference (Wade, 2020). For example, in the frind/friend example above, even a Southern-accented participant might be able to use the model talker’s non-Southern accent to infer that the model talker did not intend the word friend. Of course, the fact that this listener might ultimately reach the correct conclusion about the intended nonword status of the stimulus does not rule out the possibility that they first retrieved friend and then backtracked, or were otherwise delayed in ways that a non-Southern listener might not have been. The time course of reasoning about an interlocutor’s differing linguistic system, and how it interacts with possibly more basic lexical access mechanisms, is certainly not well established. Furthermore, while there is evidence that listeners can also use linguistic and social information to help guess the intended word when intraspeaker variation is in play (Mitterer and McQueen, 2009; Casasanto, 2008), inherent variability means that incorporating such information can favor, but not guarantee, the correct outcome. 16 It should be noted that listeners’ ability to use such information is modulated by listeners’ social attitudes (Kang and Rubin, 2009).
52 Ruaridh Purse, Meredith Tamminga, and Yosiane White The fact that phonological variation encodes social information in speech, then, makes it an exciting frontier for understanding the mental lexicon. On the comprehension side, phonological variation is not simply noise that listeners must factor out. Rather, it is a rich source of structured information about speakers themselves and their likely behavior. Recent approaches to incorporating this kind of social information into processing models include the dual-route approach to socially weighted encoding proposed by Sumner et al. (2014) and the ideal adapter model of Kleinschmidt and Jaeger (2015; see also Kleinschmidt, 2019). On the production side, producing phonological variation demands that speakers dynamically shape their own speech to be fluid, connected, and socially and contextually appropriate; Babel and Munson (2014) give a useful overview on production issues related to variation. However, this same social sensitivity is a reason that experimental work on variation needs to proceed with caution. It is easy to lose sight of the fact that a laboratory on a college campus is itself a social setting, one that for most people is far-removed from everyday life. Research that attempts to investigate socially meaningful phonological variation in the lab, but does not take into account the social properties of the experimental context itself, runs the dual risk of not only drawing scientifically unwarranted conclusions but also propping up the marginalization of “nonstandard” varieties.
3.4.3 Toward new advances in modeling the mental lexicon The issue of phonological variation arises throughout this volume. Magnuson and Crinnion (this volume) point to the many sources of talker variation and a wide range of phonological processes producing deviations from canonical form as major challenges for current models of spoken word recognition. Creel (this volume) highlights the difficult challenge that pervasive variability poses to word learners. Kilbourn-Ceron and Goldrick (this volume) end their survey of word production by noting that we know very little about how words are produced in sentential contexts (as, indeed, words nearly always are). It appears that phonological variation poses one of the major obstacles preventing word recognition and production models from being able to cope not only with diverse talkers and social contexts but also with connected speech at all. Improving our understanding of phonological variation and its relationship to the mental lexicon thus promises to facilitate the modeling of word recognition in connected speech input and word production in context. The problems at hand are far from simple, but we believe that turning toward a view of lexical access as an inherently socially situated process offers the promise of bringing models of the mental lexicon and its use into a more detailed alignment with what human language users know and do.
Chapter 4
N eu ral enc odi ng of spe ech and word forms David Poeppel and Yue Sun
4.1 Introduction The encoding of speech signals and the mapping of speech to words are addressed in this chapter in the context of two parallel research agendas: the processing of speech as the initial step in spoken language comprehension, that is, the traditional domain of speech perception, and the neural implementation of the computations that underpin the encoding of speech and lexical access. The first question has been investigated extensively in cognitive science research, specifically in experimental psychology and psycholinguistics, focusing on how listeners retrieve linguistic information from the physical speech signal. The second question has been examined in cognitive and computational neuroscience, involving studies of the anatomical and functional infrastructure of the human brain to facilitate the processing of speech. This chapter reviews the literature from both research fields and provides a perspective on how to achieve explanatory connections between the two lines of research.
4.2 Computational-representational theories of speech perception “Speech perception” refers to a set of operations that transform an auditory signal into representations (phonetic, phonological) of a form that can make contact with linguistic information stored in a listener’s mental lexicon (morphemes, roots, words). If we take the end goal (or, in David Marr’s terms, the “computational goal”) of speech perception to be the mapping of sounds to stored representations like words (whether in isolation— “virus”—or in context—“this virus stinks”), then the study of the encoding and decoding of speech needs to make reference to how stored information is represented. For now, we
54 David Poeppel and Yue Sun do not have compelling evidence on exactly how words or morphemes (or any other types of knowledge, for that matter) are stored in memory. But while the details of how linguistic information is stored must remain speculative, one must bear in mind that the mapping of speech to words ultimately requires a commitment to the representational format of the input signals and the pieces of stored linguistic knowledge. Two major concepts form the basis for much of cognitive science: representation and computation. Embick and Poeppel (2015), referring to computational-representational (CR) theories in language, suggest that existing theories typically make claims about the following questions: first, what representational format do different types of linguistic information take in a speech signal, and how is linguistic information represented in the mental lexicon? Second, what are the computations that are needed to transform acoustic events in a speech signal into phonological representations that subsequently interact with other linguistic dimensions (morphological, syntactic, etc.)? Figure 4.1 summarizes the conceptual architecture of the problem. (a)
(c)
(b)
Figure 4.1 Representations, transformations, computations: from the physical speech signal to phonological and lexical representations. Solid arrows: logically required feedforward steps. Dotted arrows: hypothesized feedback mappings. (a) The peripheral auditory system encodes continuously varying acoustic waveforms (left side: x-axis, time; y-axis, amplitude) decomposing the input signal into different bands (right panel: cochleagram) and conveying spectro-temporal information to auditory cortex via the afferent auditory pathway. (b) A series of intermediate representations may be necessary to map from a spectro-temporal representation of the acoustic input signal to the abstract representation of the word. Accordingly, multiple types of computations are required to assure transformations between different intermediate representations. (c) The hypothesized representation of the word ‘cat’ (/kæt/) in the mental lexicon of a speaker/listener. Each of the three segments of this consonant-vowel-consonant word is built from distinctive features that, as a bundle, are definitional of phonemes that comprise the lexical item.
Neural encoding of speech and word forms 55 Unsurprisingly, the concepts of representation and computation are inherently linked, as theoretical commitment to the format of the mental representations should align with the nature of the relevant computations. To be concrete, two opposing views exist with regard to the format of lexical representation. On the one hand, many linguistically motivated theories argue that words are represented in the mental lexicon in terms of sequences of discrete, abstract segments or phonemes (Dresher, 2011; Jones, 1967), which can be further subdivided and described as bundles of distinctive features (Jakobson, Fant, and Halle, 1951; Stevens, 2002). On this view, the mapping from speech to words involves computations that translate continuous acoustic input into discretized segments, with each assigned a specific label within the phonemic/featural space. On the other hand, psychologically inspired theories propose that words are stored in the mental lexicon as auditory “episodes” that contain fine phonetic details and non-linguistic indexical information, such as voice of the speaker (Goldinger, 1998; Johnson, 1997; Port, 2007, 2010). On that view, speech perception requires accessing acoustically detailed exemplars of words, with little or no need for mapping onto abstract phonological representations. We first provide a short perspective on the evidence from psycholinguistic research regarding this debate. We then present findings from neurobiological research and explain how these findings complement those from psycholinguistics and help unify features from both approaches to reach a more comprehensive understanding of speech perception.
4.3 Behavioral foundations for competing cr theories 4.3.1 Behavioral evidence of phonological abstraction Phonological abstraction is a central research question in the perception of individual speech sounds. In such studies, listeners are typically presented with isolated speech sounds such as vowels or syllables and execute behavioral tasks (e.g., identification, discrimination). This line of research has convincingly demonstrated the existence of abstract phonological categories as part of listeners’ linguistic repertoire that drives perceptual processes. A major, well-established finding is the categorical nature of speech sound perception: listeners are perceptually more sensitive to acoustic differences between two speech sounds that belong to two distinct phoneme categories than to the same amount of acoustic difference between two speech sounds from the same phoneme category (e.g., Liberman, Harris, Hoffman, and Griffith, 1957). This phenomenon, referred to as categorical perception, reflects the tuning of listeners’ perceptual system to perceive acoustic differences that contrast different phonemes, which consequently allows the assignment of physically varying sounds to a single, functionally equivalent phonological representation. Categorical perception supports the relevance of abstraction in the mapping from a continuous spectro-temporal space to a discrete phonological space (i.e., phonemes
56 David Poeppel and Yue Sun and distinctive features) that underwrites lexical representations. Extensive research shows that listeners’ perceptual systems are tuned to the set of phonemes (and distinctive features) used in their native language, such that the perception of phonemes and phonetic features that do not exist in their native language can be compromised (Strange and Shafer, 2008). In fact, it is commonly observed that listeners have difficulty perceiving non-native phoneme contrasts (Best, 1994; Best, McRoberts, and Goodell, 2001). A well-known example consists of Japanese listeners’ persistent difficulty to distinguish between the English liquid consonants /r/and /l/(Goto, 1971; Miyawaki, Jenkins, Strange et al., 1975). Specifically, Japanese listeners map tokens from both English consonants onto the same consonant category in Japanese (/r/or / w/: Best and Strange, 1992; Takagi, 1995; Yamada and Tohkura, 1992) while becoming insensitive to the acoustic difference between them (Iverson, Kuhl, Akahane-Yamada et al., 2003). Such perceptual difficulties for non-native phonemic contrasts, widely observed for both consonant and vowel contrasts across numerous languages (Strange and Shafer, 2008), demonstrate the impact of native phonemic categories in shaping listeners’ perceptual space. The data highlight the efficiency with which listeners’ speech perception system filters out acoustic variability that is not informative for distinguishing native phonemic categories, such that it becomes problematic to perceive and learn new phonemes from non-native languages. Note that this kind of systemic optimization applies to the perception of individual phonemes and to the level of distinctive phonetic features (Brown, 2000; Hancin-Bhatt, 1994; McAllister, Flege, and Piske, 2002). For instance, McAllister and colleagues (2002) observed that learning non-native Swedish long versus short vowel distinctions is difficult for native English and Spanish speakers, whose native languages do not employ “length” as a distinctive feature for vowels, while it is easier for Estonian speakers, who do make use of the length feature to distinguish between Estonian vowels. That is, although none of the participants from the three linguistic backgrounds were previously exposed to the specific Swedish phonemic contrast, Estonian speakers generalized their perceptual sensitivity to vowel length to distinguish between non-native vowel pairs with the same featural difference. Decades of behavioral research on speech perception provide convincing evidence for the existence of abstract phonological representations in the mental lexicon (Figure 4.1c) as well as for the optimization of listeners’ perceptual systems to robustly map acoustic signals onto phonemic and featural categories used in the construction of the mental lexicon of their native language. This line of empirical evidence supports a correspondence between linguistic theories that describe the structure of language and the set of functional operations that putatively underlie the processing of speech. Findings from such empirical work may not be decisive in showing the necessity for any particular abstract phonological unit to be the foundational primitive representational unit to form words in the mental lexicon (see Kazanina, Bowers, and Idsardi, 2017, for a review of various phonological units that have been proposed for that role and see Samuel, 2020, for potential shortcomings of psycholinguistic work for the purpose of validating linguistic theories). However, the experimental evidence unequivocally demonstrates
Neural encoding of speech and word forms 57 that phonological abstraction at multiple levels is part of the functional architecture that maps the speech signal onto internally stored lexical representations. It is worth noting that the functional necessity of a phonological unit in encoding lexical forms can also be argued for from a theoretical perspective. For instance, using a modeling approach, one can show that the use of a certain phonological unit presents compelling computational advantages in the organization of word-forms in the mental lexicon in order to assure robust mapping between word-forms and the acoustic signal as well as with articulatory gestures (see Nozari, this volume, for a similar argument in the case of speech production).
4.3.2 Behavioral evidence for episodic encoding in speech perception In contrast to phonologically-centered theories, episodic theories posit that detailed acoustic variations of a phoneme (a category that still exists) are functionally informative for the recognition of words and are therefore retained in the sub-lexical and lexical representations (Goldinger, 1998; Johnson, 1997; Port, 2007, 2010). These variations mainly include fine phonetic details, which result from specific articulatory realizations of a phoneme in various phonological environments, and indexical information, which includes acoustic cues for the identity and emotional state of a speaker, speech rate, as well as other prosodic properties of speech. Building on initial proposals of episodic theories, which assumed that the mental lexicon is solely composed of acoustically rich exemplars of different words (Goldinger, 1998; Port, 2007), modern revisions acknowledge the existence and functional relevance of abstract phonological representations in the lexicon while they maintain that acoustic details are encoded as part of the lexicon (Pierrehumbert, 2016). Here we highlight some evidence that supports the inclusion of phonetic variation and indexical information in the mental lexicon, alongside abstract phonological units. The maintenance of fine-grained phonetic details of speech sounds during speech perception and mapping to words is mainly supported by experimental findings that challenged the notion of categorical perception (e.g., Liberman et al., 1957). Specifically, it has been argued that the appearance of categorical perception is largely due to the binary nature of response alternatives in identification and discrimination tasks (Lotto and Holt, 2000; Massaro and Cohen, 1983). Indeed, Massaro and Cohen (1983) found that when participants were asked to rate the degree to which each consonant instance from a /ba/-/pa/continuum resembled one or the other category, their ratings appeared to be continuous rather than categorical. Subsequent studies confirmed listeners’ ability to judge the goodness of acoustic variants of the same phonemic category (e.g., Iverson and Kuhl, 1995; Kuhl, 1991). Moreover, recent studies using eye-tracking have revealed that the degree of perceptual goodness of a speech sound instance affects the process of word identification (McMurray, Tanenhaus, and Aslin, 2002; McMurray, Aslin, Tanenhaus, Spivey, and Subik, 2008). Such findings contradict the view that fine-grained
58 David Poeppel and Yue Sun acoustic details of speech are discarded during perception and highlight the impact of task demands on the behavioral outcome of underlying perceptual phenomena. Beyond the retention of within-category acoustic-phonetic variation, the impact of indexical information on the recognition of spoken words is more commonly taken as evidence for episodic theories. Studies in this domain often employ some form of long-term priming paradigm that includes two experimental phases. During a first “exposure” phase, participants listen to words that present some specific acoustic property, such as speaker identity (Goldinger, 1996), gender (Schacter and Church, 1992), emotional state (Church and Schacter, 1994), or speech rate (Bradlow, Nygaard, and Pisoni, 1999). During the second “test” phase, participants are presented with a mixed list of previously presented and new words and are asked to conduct either an explicit word recall task (Bradlow et al., 1999; Goldinger, 1996) or an implicit lexical task, such as word identification (Church and Schacter, 1994; Goldinger, 1996; Schacter and Church, 1992) or lexical decision (Luce and Lyons, 1998). The main finding of these studies is that participants perform better—that is to say, exhibit larger priming effects—when a word is presented with the same indexical property in the exposure and test phases. Such data provide evidence for the impact of episodic acoustic details on the perception of speech sounds and the recognition of word-forms. Fine-grained phonetic variation is shown to be retained during speech perception and impacts word recognition. Speaker-specific indexical details, which intrinsically do not contain linguistic information, have been demonstrated to facilitate lexical processing. These findings challenge a view of phonological and lexical encoding that is solely based on discrete, abstract units such as phonemes and distinctive features. In conclusion, rich findings from behavioral research support the relevance of abstract representations such as distinctive features and phonemes as well as “signal- near” episodic representations carrying speaker- specific information and other contextual cues. We now turn to a more extensive treatment of neurobiological evidence to investigate what those data suggest about the infrastructure supporting speech perception.
4.4 Neurobiological findings and their contribution to theory development 4.4.1. Early neurophysiological studies One influential type of neural data derives from event-related potential (ERP) approaches. These studies measure either electric or magnetic neural activity using non-invasive neurophysiological recording techniques such as electroencephalography (EEG) or magnetoencephalography (MEG). ERP (or, in MEG, event-related field ERF)
Neural encoding of speech and word forms 59 approaches are ideal to study aspects of perceptual processes that either cannot be revealed by behavioral measurements or are even altered by the execution of behavioral task demands. One of the most frequently used approaches to study speech is the mismatch negativity (MMN) design. The MMN, and its magnetic counterpart (i.e., MMNm), is a component of the auditory evoked response that reflects the brain’s detection of a change. It is elicited in an oddball paradigm, wherein a repetitively presented “standard” stimulus is occasionally and unexpectedly replaced by a “deviant” stimulus, which typically differs from the standard stimulus in certain acoustic aspects, such as frequency (Näätänen, Gaillard, and Mäntysalo et al., 1978; Sams, Paavilainen, Alho, and Näätänen, 1985), intensity (Näätänen, Paavilainen, Alho, Reinikainen, and Sams, 1987), or duration (Näätänen, Paavilainen, and Reinikainen, 1989). The deviance results in a more negative-going ERP compared to the standard stimulus (Näätänen, 1992). Importantly, the elicitation of the MMN does not require any behavioral task that focuses listeners’ attention on the stimuli (e.g., Näätänen et al., 1978; Sams et al., 1985). The pre-attentive nature of the MMN allows one to examine properties of perceptual processes while minimizing the specific effect (or bias) of a behavioral task. (We mentioned above that categorical perception was criticized due to the binary nature of response alternatives that are given in identification and discrimination tasks, which could create an appearance of perceptual discretization in participants’ behavioral outcome.) The MMN has two properties that make it unique to address questions in speech. First, the magnitude of the MMN depends on listeners’ ability to perceive the difference between the standard and deviant stimuli (Lang, Nyrke, Ek et al., 1990; Näätänen, Schröger, Karakas, Tervaniemi, and Paavilainen, 1993). Second, the MMN does not merely reflect the detection of physical acoustic differences between standard and deviant stimuli but can also be elicited by the violation of any statistical regularities that can be established in the standard sequence (Paavilainen, Simola, Jaramillo, Näätänen, and Winkler, 2001; Saarinen, Paavilainen, Schöger, Tervaniemi, and Näätänen, 1992; Tervaniemi, Maury, and Näätänen, 1994). Various MMN/m data address the debate between abstractionist and episodic theories of speech perception. First, in what way are listeners’ neural responses to acoustic differences between speech sounds modulated by abstract phonological representations? Second, what are the neural responses to manipulations of indexical information associated with speech? To address the first question, many studies have adopted the stimulus design from classic behavioral investigations of categorical perception. Two stimuli from a synthesized continuum between two phonemic categories are presented in an oddball paradigm. The two stimuli can belong either to the same phonemic category (and hence elicit within-category acoustic differences) or to different categories (and elicit across-category differences). These studies have provided a range of important findings. Some studies report that only across-category differences elicit an MMN, while within-category changes resulted in no apparent MMN (Maiste, Wiens, Hunt, Scherg, and Picton, 1995; Phillips, Pellathy, Marantz et al., 2000; Sharma and Dorman, 1999).
60 David Poeppel and Yue Sun In contrast, other studies find that the MMN can also be elicited by acoustic differences within the same phoneme category and that the MMN properties reflect the magnitude of the acoustic deviance (Aaltonen, Niemi, Nyrke, and Tuhkanen, 1987; Aaltonen, Eerola, Lang, Uusipaikka, and Tuomainen, 1994; Dalebout and Stack, 1999; Lago, Scharinger, Kronrod, and Idsardi, 2015; Sams, Aulanko, Aaltonen, and Näätänen, 1990; Sharma, Kraus, Mcgee, Carrell, and Nicol, 1993). Interestingly, while neural sensitivity to within-category variability is observed for a wide range of phoneme types, that is, vowels (Aaltonen et al., 1987, 1994), plosive consonants (Dalebout and Stack, 1999; Sams et al., 1990; Sharma et al., 1993), and fricative consonants (Lago et al., 2015), strong categorical outcomes in the MMN response usually occur with plosive consonants (Maiste et al., 1995; Phillips et al., 2000; Sharma and Dorman, 1999). The most convincing evidence for the impact of phonemic categories on the MMN response comes from studies using cross-linguistic designs (Bomba, Choly, and Pang, 2011; Dehaene-Lambertz, 1997; Kazanina, Phillips, and Idsardi, 2006; Näätänen, Lehtokoski, and Lennes et al., 1997; Sharma and Dorman, 2000; Winkler, Lehtokoski, Alku et al., 1999). For instance, Winkler and colleagues (1999) conducted a study in which Hungarian and Finnish listeners were presented with oddball sequences made of two sets of standard-deviant vowel stimulus pairs. One pair presented an across- category difference in Hungarian, but a within-category difference in Finnish, while the other pair presented a within-category difference in Hungarian but an across- category difference in Finnish. The authors found that while both within-and across- category contrasts elicited an MMN, the amplitude of the MMN was larger for across-than within-category contrasts in listeners from both language groups. This demonstrates that —although the perceptual system preserves sensitivity to acoustic differences within vowel categories, consistent with previous behavioral findings showing more continuous perception of vowel continua (e.g., Aaltonen et al., 1987, 1994) —its response is intensified if acoustic differences contrast vowel categories in a listener’s native language. Moreover, the reversed neural responses to the same sets of acoustic deviance in listeners from two linguistic systems emphasize the perceptual tuning of the listener’s native phonemic system to the perceptual process. The native category effect was shown to be stronger in studies using consonant contrasts, which found little presence or total absence of the MMN for acoustic difference of non- native phonemic contrasts (Bomba et al., 2011; Dehaene-Lambertz, 1997; Kazanina et al., 2006). Regarding the neural responses to manipulations of indexical information, most studies tackled the issue by constructing oddball sequences with speech stimuli that were uttered by multiple speakers (e.g., Dehaene-Lambertz, Dupoux, and Gout, 2000; Shestakova, Brattico, Huotilainen et al., 2002; Sun, Giavazzi, Adda-Decker et al., 2015). For instance, Shestakova and colleagues (2002) provided a clever illustration of this approach: they recorded MEG while participants listened to oddball sequences made of 450 stimuli of three different vowels in Russian. Each of the 150 stimuli from each vowel category was produced by a different male speaker. The study revealed prominent MMNm responses to vowel category change despite the
Neural encoding of speech and word forms 61 presence of wide acoustic variations due to speaker change between every consecutive stimulus. In other words, in order to even generate a standard stimulus category that would license the elicitation of a deviant response, the abstraction over tokens of widely varying acoustic features is required. This finding clearly demonstrates that listeners can form auditory memory traces that represent a single phonemic category that is invariant to acoustic details associated with specific voices. Note that, unlike the arguably reduced sensitivity to within-category variations along an acoustic dimension, the human brain is believed to be highly efficient in recognizing and differentiating human voices (Belin, Fecteau, and Bédard, 2004). It is therefore expected that infrequent changes of the speaker’s voice in an oddball sequence of speech sounds can elicit an MMN, as shown in several studies (e.g., Knösche, Lattner, Maess, Schauer, and Friederici, 2002; Titova and Näätänen, 2001). Since the main behavioral findings supporting episodic theories consist of facilitation effects of specific speech-voice associations on word recognition, searching for neural correlates of such associations becomes an elementary topic for neurophysiological studies. Indeed, Beauchemin and colleagues (2006) found that deviant speech sounds uttered by a familiar voice elicited an MMN with higher amplitude compared to those by an unfamiliar voice. Their finding hints at the existence of acoustic long-term memory traces for speech-voice associations. In summary, ERP research confirms that listeners’ perceptual system is highly tuned (i) to perceive acoustic properties that contrast phonemic categories in their native languages while (ii) still preserving some sensitivity to fine-grained acoustic details of speech sounds. So, on the one hand, the majority of studies provide neurophysiological indications for an automatic perceptual mapping of speech onto abstract phonological representations, which is achieved without explicit behavioral demands and in the presence of large amount of phonetic and indexical variation. These data are consistent with the existence of abstract phonological units as an intermediate representation between a spectro-temporal representation of the acoustic signal and an abstract representation of lexical entities (Figure 4.1). On the other hand, neurophysiological responses to fine-grained acoustic details of speech are revealed as well. The magnitude of these responses tends to vary across listeners and as a function of phoneme types. These responses reflect neural processing of low-level acoustic features in the speech signal, corresponding to the transformation of the physical input acoustic waveform into a neurally encoded spectro-temporal representation (Figure 4.1, top). Such a process presumably takes place before the mapping to an abstract phonemic/featural representation (Figure 4.1, bottom). The ERP/ERF neurophysiological approaches have provided psycholinguistic and neurobiological researchers with tools to study the cortical processing of speech without the need to engage participants in behavioral tasks, which has turned out to be useful in refining aspects of cognitive models. However, due to the limited spatial resolution of EEG (and even MEG) signals, these techniques can only provide limited insights into the functional-anatomic organization of speech processing in the human brain. We turn next to these studies.
62 David Poeppel and Yue Sun
4.4.2 The functional-anatomic architecture of speech processing in the human cortex Two major approaches have dominated this line of research: neuropsychological studies testing (causal) relations between brain areas and cognitive deficits; and functional neuroimaging experiments establishing (correlational) relations between the activation of brain regions and cognitive functions. Both approaches have motivated the establishment of large-scale models of the cortical basis of speech perception (Binder, Frost, Hammeke et al., 2000; Hickok and Poeppel, 2007; Rauschecker and Scott, 2009). Despite differences in important details, these models share essential characteristics regarding the functional-anatomic foundations for speech processing (adopted and adapted from research on vision and auditory perception). There are three broad principles: localization of function, suggesting the spatial segregation of neural assemblies that perform different operations; hierarchical processing, which indicates that computations carried out by cell assemblies further along a processing stream reflect more integrated and abstract properties; and concurrent pathways, according to which analyses of different aspects of the signal occur in parallel. Figure 4.2 illustrates the dual stream model proposed by Hickok and Poeppel (2007), in which ventral stream structures underpin sound-to-meaning mapping (a)
Via higher-order frontal networks
Articulatory network pIFG, PM, anterior insula (left dominant)
Dorsal stream
Spectrotemporal analysis Dorsal STG (bilateral)
Combinatorial network aMTG, alTS (left dominant?)
Ventral stream
Sensorimotor interface Parietal-temporal Spt (left dominant)
Phonological network Mid-post STS (bilateral)
Input from other sensory modalities
Conceptual network Widely distributed
Lexical interface pMTG, pITS (weak left-hemisphere bias)
(b)
Figure 4.2 The dual-stream model. (a) “Functional boxology” of the dual-stream model. From the analysis of the input, the system diverges into two streams, a dorsal pathway that maps onto articulatory representations, and a ventral pathway that maps onto lexical and/or conceptual representations. (b) Schematic of anatomical locations of the dual-stream model components, corresponding to functional characterization in (a).
Neural encoding of speech and word forms 63 and dorsal stream structures facilitate the mapping from sound input to articulatory representations (among other computations).
4.4.2.1 Hierarchical processing within auditory cortices: from acoustic to phonological representations Large-scale cortical models of speech perception posit that the earliest stage of speech processing is carried out in primary auditory cortices bilaterally (Heschl’s gyrus, or A1, sometimes also called core auditory cortex; Hackett, 2011). Neurons in Heschl’s gyrus (HG) exhibit selective tuning to both spectral and temporal modulations in the acoustic signal (e.g., Schönwiesner and Zatorre, 2009). The sensitivity to combined spectro- temporal modulations is an essential component of the analysis of complex acoustic patterns that constitute phonetic features, such as “onset asynchrony” or “fast transition between different frequency bands,” which approximately corresponds to voice- onset-time (VOT) and place-of-articulation (POA), respectively. The spectro-temporal analyses in primary auditory cortex yield representations that are inputs to the closest downstream regions—superior temporal gyrus (STG) and superior temporal sulcus (STS), which have been demonstrated to be involved in processing phonological representations (Yi, Leonard, and Chang, 2019). Many functional neuroimaging studies have reported stronger neural activation in STG/STS to speech stimuli compared to non- speech stimuli with similar spectro-temporal complexity (e.g., Davis and Johnsrude, 2003; Obleser, Eisner, and Kotz, 2008; Overath, McDermott, Zarate, and Poeppel, 2015; Scott, Blank, Rosen, and Wise, 2000). Recent electrophysiological studies show that the selective activation of STG neurons for linguistic aspects of the speech input reflects encoding of phonologically relevant features via specific spectro-temporal tuning (e.g., Chan, Dykstra, Jayaram et al., 2013; Mesgarani, Cheung, Johnson, and Chang, 2014). For instance, Mesgarani and colleagues (2014) measured ECoG signals of STG neurons while participants were presented with naturally spoken English sentences. They found selective brain responses to phonetic features of consonants (strongest for the manner of articulation) and vowels (e.g., height and backness). Likewise, using single-and multi- unit recordings, Chan and colleagues (2013) reported that many neurons in STG are tuned to specific sets of phonemes that shared particular phonetic features. Interestingly, they discovered that some neurons in this region exhibit similarly selective responses to the same phoneme sets when they were presented visually -supporting processing at the phonological rather than the acoustic level. Another approach to demonstrate the response sensitivity of STG neurons to phonological representations instead of “merely” acoustic ones consists of using categorical perception. For instance, using fMRI, Joanisse and colleagues (2007) examined neural adaptation effects for within-and between-category variation along a speech continuum between /da/and /ga/. They observed greater activation for between-category than within-category variation in left superior temporal sulcus (STS) and middle temporal gyrus (MTG) and inferior parietal cortex (IPC). Similar responses to phonological categories were also found in ECoG-recorded high gamma activity from neurons in posterior STG (Chang, Rieger, Johnson et al., 2010). In a recent study, Gwilliams,
64 David Poeppel and Yue Sun Linzen, Poeppel, and Marantz (2018) recorded MEG while participants listened to meaningless syllables as well as words of which the onset consonant presented various levels of ambiguity between two phoneme categories (e.g., barakeet —parakeet versus barracuda —parracuda). They showed that while response sensitivity to fine-grained acoustic details of speech sounds was primarily observed in HG, only STG was sensitive to discrete values of the corresponding phonetic features of the consonant (i.e., voiceless or voiced for VOT; labial, labiodental or velar for POA). Findings of the specialized encoding of acoustic and phonological representations respectively by HG and STG/ STS neurons demonstrated a hierarchical cascade in the processing hierarchy between these two regions. Figure 4.3 schematizes the model advanced by Gwilliams et al. (2018) to capture both the regional, hierarchical, and computational aspects. The establishment of a cortical region along the auditory pathway that is sensitive and specific to encoding discrete phonemic/featural representations supports the hypothesis that abstract phonological representations are necessary for linking acoustic representations in the speech signal to different types of internally stored linguistic knowledge. This hypothesis is further supported by the functional organization of the downstream connections of STG/STS regions within the cortical network for speech. The large-scale cortical models agree on the existence of two processing
Lexical retrieval Phonological mapping
MTG
STG/STS
Recurrent activation HG Spectro-temporal analyses Time
Figure 4.3 Processing in hierarchical stages in the ventral pathway. Schematic of model architecture (left), broad functional characterization (middle), and anatomic description (right). Stage 1, carried out in Heschl’s gyrus (HG) bilaterally: (i) analysis of spectro-temporal information from feedforward auditory pathways; and (ii) recurrent activation of sub-phonemic information corresponding to each speech sound unit. Stage 2, in STG/STS bilaterally: mapping of sub-phonemic representations onto phonological representations, such as distinctive features and phonemes. Stage 3, involving mainly left MTG: activation of lexical candidates based on the temporal sequence of phonological units. Recurrent activation of acoustic representations in HG alongside phonological mapping and lexical retrieval, facilitating concurrent access to sub- phonemic and phonological-lexical information.
Neural encoding of speech and word forms 65 streams (Gow, 2012; Hickok and Poeppel, 2007; Rauschecker and Scott, 2009): a ventral pathway that maps auditory/phonological representations to lexical/conceptual representations (Figure 4.2) and a dorsal pathway that transforms auditory/phonological representations onto articulatory/motor representations (Figure 4.2). While broad consensus has been reached with regard to the nature of linguistic representations that are encoded in the two pathways (i.e., largely lexical-semantic-related for the ventral pathway and sensorimotor-related for the dorsal pathway), there is debate around several issues, including the involvement of the two hemispheres in processing different types of auditory and linguistic representations, the relative functions of anterior and posterior parts of the ventral stream, and the existence of a dorsal lexicon for speech production.
4.4.2.2 Parallel processing and computational asymmetry between the hemispheres 4.4.2.2.1 Involvement of the two hemispheres in speech processing An unresolved question concerns the functional involvement of the two hemispheres (Hickok and Poeppel, 2007; Rauschecker and Scott, 2009). It is now widely assumed that the initial acoustic and phonological analyses of speech involve both hemispheres. One uncontroversial observation arising from functional neuroimaging is that listening to speech activates HG and STG/STS bilaterally (Hickok and Poeppel, 2000, 2004, 2007). Bilateral activation has been found when the neural responses to speech stimuli were contrasted with a resting baseline (e.g., Binder et al., 2000; Poeppel, Guillemin, Thompson et al., 2004) and with various non- speech control stimuli (Agnew, McGettigan, and Scott, 2011; Diehl and Kluender, 1989; Liberman and Mattingly, 1989; Overath et al., 2015; Vouloumanos, Kiehl, Werker, and Liddle, 2001). A recent study using intracranial recordings reported neural responses to perceptual invariance of vowels in parabelt auditory cortex (STG) from both hemispheres (Sjerps, Fox, Johnson, and Chang, 2019). Solely left-lateralized activation patterns were reported in some studies (Heinrich, Carlyon, Davis, and Johnsrude, 2008; Liebenthal, Binder, Spitzer, Possing, and Medler, 2005). A closer look at these studies reveals an interesting pattern: a left-lateralized activation pattern is more often observed when participants performed metalinguistic tasks involving explicit retrieval of phonological categories as well as working memory (e.g., Poeppel, Yellin, Phillips et al., 1996). The task demands, as we discuss later, involve the speech articulation network along the dorsal pathway, which is more strongly left-lateralized. We argue that these functions, although being part of the cognitive apparatus for language processing, go beyond the computations that are essential for acoustic, phonetic, and phonological analyses of speech sounds. Following the initial acoustic analyses, mapping of auditory/ phonological representations to other types of linguistic attributes shows a more diverse profile regarding the involvement of the two hemispheres. The mapping from sound to meaning, which is carried out by the ventral pathway, is bilaterally organized (leaving open the question of how bilaterally lexical information is stored). Clinical research
66 David Poeppel and Yue Sun shows that patients with unilateral damage to either hemisphere (Poeppel, 2001), split- brain patients (Zaidel, 1985), and individuals undergoing Wada procedures (selective anesthesia of one cerebral hemisphere) (Hickok, Okada, Barr et al., 2008; McGlone, 1984) are generally able to perform sufficiently well in tasks that require access to the mental lexicon. These findings are accompanied by functional neuroimaging data showing bi- hemispheric contribution in sound- to- meaning mapping when listening to isolated words (e.g., Bozic, Tyler, Ives, Randall, and Marslen-Wilson, 2010) as well as connected sentences (Friederici, Meyer, and von Cramon, 2000; Humphries, Willard, Buchsbaum, and Hickok, 2001; Humphries, Love, Swinney, and Hickok, 2005; Vandenberghe, Nobre, and Price, 2002). In contrast to the ventral stream, the sound-to-motor mapping (Figure 4.2) has been argued to be left-dominant (Hickok and Poeppel, 2007). (But see Cogan, Thesen, Carlson et al., 2014 for evidence of bilateral sensorimotor transformations in speech.) Neuropsychological and neuroimaging data show crucial involvement of left hemisphere regions, such as parietal-temporal junction (Spt) and posterior frontal regions in speech production (Buchsbaum, Hickok, and Humphries, 2001; Hickok, Buchsbaum, Humphries, and Muftuler, 2003) as well as in speech articulation related functions including verbal working memory (Buchsbaum, Olsen, Koch, and Berman, 2005; Hickok et al., 2003). Beyond lexical access and sensorimotor mapping, speech comprehension (in particular for the case of connected speech) also requires higher order morphosyntactic and compositional semantic analyses. Existing research has established that these analyses involve peri-Sylvian regions linking inferior frontal cortex with temporal and parietal cortices that typically show left-dominance.
4.4.2.2.2 Computational asymmetry of auditory cortices from the two hemispheres Although sub- lexical acoustic/ phonetic/ phonological analyses are bilaterally organized, the hemispheres present important computational asymmetries. Neuropsychological and neuroimaging studies have stimulated several theoretical frameworks that aim to describe the nature of this asymmetry (Obleser et al., 2008; Poeppel, 2003; Zatorre, Bouffard, Ahad, and Belin, 2002). One model posits that the left auditory cortex is more sensitive to temporal cues in speech whereas its right hemisphere homolog is more sensitive to spectral cues (Obleser et al., 2008; Zatorre et al., 2002). A related and complementary framework characterizes the computational hemispheric differences in terms of their preferred “sampling rate,” or the size of temporal integration windows of cell assemblies (Poeppel, 2003). According to this view, the left auditory cortex preferentially integrates signals over shorter timescales (20–80 ms)—and hence has a fast sampling rate (25–50 Hz)—while right hemisphere neural ensembles integrate over longer time windows (150–300 ms) and therefore have a slower sampling rate (4–8 Hz). These proposals are not mutually exclusive, as integration over longer temporal windows enhances the resolution of spectral analysis. Building on these models, recent work examined the computational sensitivity of the two hemispheres within spectro-temporal modulation space (Flinker, Doyle,
Neural encoding of speech and word forms 67 Mehta, Devinsky, and Poeppel, 2019). Based on MEG and ECoG recordings, and by manipulating the modulations in temporal and spectral domains, stimuli were created that resulted in various levels of intelligibility (by varying temporal modulations) and voice-pitch ambiguity (by varying spectral modulations). Crucially, results from neurophysiological measures revealed a clear leftward lateralization for the neural response to temporal modulations and a weak rightward lateralization for spectral modulations. A new study also supports this perspective. Albouy, Benjamin, Morillon, and Zatorre (2020), based on fMRI, showed that the left versus right auditory regions differ in the extent to which they encode temporal (more leftward) versus spectral (more rightward) modulation. The left hemisphere is able to integrate over a wider range of temporal modulations and can extract acoustic information from short and longer temporal windows; the right hemisphere preferentially integrates from long temporal windows due to its limited sensitivity to temporal modulation. With regard to spectral integration, the weak right lateralization indicates that the left hemisphere also presents computational sensitivity to the spectral content of the signal. This refined hypothesis accommodates previous findings that showed robust encoding of spectral-related phonological features in the left auditory cortex, such as vowel types (Formisano, De Martino, Bonte, and Goebel, 2008; Mesgarani et al., 2014) and lexical tones (Gandour, Tong, Wong et al., 2004). A study reporting intracranial neural data from a large patient cohort (Giroud, Trébuchon, Schön et al., 2020) lends further support to temporal asymmetry of the auditory cortices. The computational asymmetry of the two hemispheres can be viewed as a precursor for a functional specialization between the two hemispheres in processing phonetic and non-phonetic information. Acoustic features that underlie non-phonetic information in speech usually concern spectral properties manifesting over long timescales (e.g., intonation, voice identity, emotional prosody) and have a tendency to be preferentially processed in the right hemisphere (Zatorre et al., 2002). It is plausible to posit a functional segregation between the left and right hemispheres, building on the bilateral involvement of the auditory cortices. Functional segregation cannot be accounted for by a pure bottom- up (feedforward) model based on computational properties of the auditory cortices in the two hemispheres. A clear illustration of “non-feedforward” factors is provided by studies that manipulate the linguistic status of the same acoustic property, such as pitch contour variation: it is used in tonal languages to contrast words, while it is not used contrastively in non-tonal languages. These studies show that processing lexical tones recruits left- hemisphere structures, including left STG and IFG, in speakers of tonal languages, whereas homologous regions in the right hemisphere are more activated when the same pitch contour variation is processed by non-tonal speakers (Gandour, Wong, Hsieh et al., 1999; Hsieh, Gandour, Wong, and Hutchins, 2001; Klein, Zatorre, Milner, and Zhao, 2001; Wong, Parsons, Martinez, and Diehl, 2004). Moreover, after being exposed to lexical tones in Mandarin Chinese, native English listeners who are good tonal learners showed increased activation in the left pSTG, while poor learners showed larger activation in the right STG and IFG (Wong, Perrachione, and Parrish, 2007). This suggests
68 David Poeppel and Yue Sun stronger engagement of the left hemisphere in linking spectral/pitch information to lexical representations, and that involves not only feedforward connections from the auditory cortices but also feedback (top-down) connections from the latter to the former. A top-down (feedback) account is also supported by studies showing the impact of task- specific attention to linguistic versus non-linguistic aspects of the speech signals on the direction of hemispheric lateralization in cortical activations (Von Kriegstein, Eger, Kleinschmidt, and Giraud, 2003).
4.4.3 Neural encoding of speech in action 4.4.3.1. Temporal interactions: auditory speech processing at different hierarchical levels What is the fate of the low-level acoustic information after being mapped onto more abstract phonological representations? This question is particularly relevant for processing continuous speech, due to the transient nature of the acoustic input. The auditory system, in this context, faces two challenges: (i) the speed of incoming linguistic input, which reaches on average 10–15 phonemes (or 4–6 syllables) per second (Pellegrino, Coupé, and Marisco, 2011; Studdert- Kennedy, 1986); and (ii) limited memory span for maintaining auditory traces, argued to have a capacity of 50–100 ms (Elliott, 1962; Remez, Ferro, Dubowski et al., 2010). These challenges motivate a view according to which efficient encoding of speech requires the system to largely discard low-level representations after mapping them onto higher-level categories (Christiansen and Chater, 2016). This view is aligned with classic abstractionist theories, which generally grant little linguistic relevance to fine-grained acoustic details of speech sounds for encoding words in the lexicon (Dresher, 2011; McQueen, Cutler, and Norris, 2006). However, as mentioned, psycholinguistic studies demonstrate listeners’ perceptual sensitivity to sub-phonemic acoustic variation (Kuhl, 1991; Massaro and Cohen, 1983). Moreover, such sensitivity offers some functional benefit in speech perception, given that the maintained acoustic details from earlier sounds can interact with linguistic information that arrives later in time (Dahan, 2010; McMurray, Tanenhaus, and Aslin, 2009). The study schematized in Figure 4.3 addressed this debate by examining the spatial organization of acoustic and phonological processing in human auditory cortex and the functional interaction between levels of processing in time (Gwilliams et al., 2018). The authors recorded MEG data while participants listened to spoken words with onset consonants that presented varying levels of ambiguity between phoneme categories. The eventual perceptual ambiguity of the onset consonant would be resolved by phonemes before the end of the word. The results confirmed that HG is sensitive to the detailed acoustic properties of the onset consonant, both in terms of its spectro-temporal characteristics and the resulting levels of ambiguity with respect to the phonemic categories. Crucially, they found that these acoustic details, for both ambiguous and unambiguous sounds, are preserved in HG over rather long timescales and re-evoked at
Neural encoding of speech and word forms 69 subsequent phoneme positions. Moreover, the recurrent emergence of the maintained acoustic representations in HG was not dependent on the perceptual commitment of the ambiguous phoneme to a given category, which involves STG. This study is the first to provide neurobiological evidence for temporal maintenance of sub-category acoustic information in primary auditory cortex, which occurs in parallel to higher-level phonological processing in STG. The authors introduced a processing model (Figure 4.3) that allows for dynamic interactions among sub-phonetic, phonetic, phonological, and lexical levels of analysis.
4.4.3.2 Segregation and integration of phonological and indexical information Extensive research confirms the existence of parallel processing streams. Parallel processing has been demonstrated along two anatomical dimensions: a dorsal-ventral dual stream pathway in each hemisphere (Figure 4.2) and a computational asymmetry between the hemispheres for the processing different acoustic attributes in speech. First attempts to extend this computational asymmetry to a hemispheric specialization in processing phonetic and non-phonetic properties was suggested to be less straightforward due to the evidence showing bilateral processing of phonological representations (Hickok and Poeppel, 2007) and hemisphere-specific involvement of non-sensory regions in linking acoustic properties to higher order linguistic and non-linguistic representations (e.g., Gandour et al., 2000; Wong et al., 2004). Although a functional dissociation between processing linguistic and non-linguistic information cannot be characterized as a straightforward hemispheric split, the existence of such a dissociation for processing phonological and indexical information in speech has received support from neuropsychology and neuroimaging. It has been shown that, similar to the process of speech sound and spoken word recognition, the identification of speaker’s voice also follows a stream of hierarchical processing (Belin et al., 2004) which distinguishes regions involved in analyzing voice-related acoustic properties, such as bilateral HG and STG/STS (Andics, McQueen, and Petersson, 2013; Bonte, Hausfeld, Scharke, Valente, and Formisano, 2014; Von Kriegstein et al., 2003) from those that are sensitive to specific, arguably ‘abstract’ speaker identity, such as the right anterior temporal lobe (Andics, McQueen, Petersson et al., 2010; Belin and Zatorre, 2003; Campanella and Belin, 2007) and right inferior parietal regions (Andics et al., 2013; Van Lancker, Cummings, Kreiman, and Dobkin, 1989). The existence of hierarchical organization for both phonological and indexical processing in the human auditory cortex raises an interesting question. At what level does the processing of the two types of information become dissociated? While functional dissociation is expected for the higher-order processing modules of phonological categories and speaker identity, it is less obvious to posit dissociations for lower-level acoustic analyses of phonological and indexical related properties; phonetic and indexical information may be carried by acoustic representations with similar spectro-temporal properties. For instance, the fundamental frequency F0, which is viewed as an acoustic cue for the recognition of a speaker’s voice, is intrinsically linked to the formant structure that
70 David Poeppel and Yue Sun
Retrieval of word meaning Left MTG, Left anterior temporal lobe, other regions
Phonetic specific analyses Bilateral STG/STS Left hemisphere bias
Phonetic-Voice interaction in memory Widely distributed?
Phonetic-Voice interaction at sensorimotor interface Right Spt
Retrieval of speaker identity Right anterior temporal lobe
Voice specific analyses Bilateral STG/STS Right hemisphere bias
Low-level acoustic analyses (e.g., multi-time resolution sampling, STRF) Bilateral HG, STG
Figure 4.4 Schematic model of interaction across levels and functional symmetry/asymmetry in processing speech. Separate processing of lexical and voice information in the human cortex preferentially segregated in left and right pathways, allowing concurrent analysis of abstract representations and episodic information.
determines vowel identity. Therefore, the neural populations that are sensitive to either of these properties are activated during the processing of both phonetic and indexical information, resulting in an overlapping auditory network for the two functions. In line with this viewpoint, early neuroimaging studies showed that largely overlapping areas in auditory cortex are activated during the processing of phonetic and speaker information (e.g., von Kriegstein et al., 2003). However, the account suggesting a shared neural substrate for auditory processing of phonological and indexical information was challenged by a series of recent studies that made more fine-grained assessments of the activation patterns related to the encoding of specific auditory representations of these two types of information. For instance, Bonte and colleagues (2014) applied a multivariate classification approach to reveal activation patterns within the auditory cortex that not only correlate with the type of representations being processed by the listeners (i.e., speaker vs. vowel) but also predict the specific identity of a perceived representation. The advantage of the multivariate analyses is that they can highlight brain regions whose activation patterns are functionally informative for the subsequent encoding of a specific representation while removing regions that are activated by the stimulus but provide little contribution to the encoding of the identity of the representation. The results showed that the neural encoding of vowel and speaker identity relied on partially anatomically overlapping yet largely locally distinct regions within HG and STG/STS. This demonstrates the existence of sub-regions within the human auditory cortex that are specialized in domain- specific analyses. These sub-regions may be involved in computationally dissociable operations on phoneme-versus speaker-invariant features from the low-level acoustic
Neural encoding of speech and word forms 71 representations of the speech signal. In fact, dissociable computations for phonological and speaker invariance are in line with the contrast enhancement mechanism proposed as one possible account to solve the issue of speaker normalization in speech perception, which has been demonstrated both behaviorally (Holt, 2006; Watkins, 1991) and neurobiologically (Sjerps et al., 2019). In summary, the neural data convincingly demonstrate that cortical processing of phonological and indexical information is implemented in partially dissociated pathways and based on separate computations. This observation is not compatible with the narrowly episodic view of speech perception according to which speaker-specific information is directly encoded along with the phonological form of lexical items (Goldinger, 1998; Port, 2007). However, the perspective can be reconciled with a “hybrid” view of spoken word recognition, which involves, principally, the mapping of the speech signal onto abstract phonological word forms but also incorporates the concurrent processing of speaker-specific information, stored in a separate memory structure, in the process. Such a hybrid model implies functional integration between abstract lexical representations and (presumably also abstract) representations of speaker identity. In fact, such integration is possible through associations between representations stored in different memory repositories. (Facilitation in lexical retrieval has been demonstrated upon successful association between auditory word forms and various environmental sounds, which, given their non-speech nature, cannot be considered as part of the lexicon (Cooper and Bradlow, 2017; Pufahl and Samuel, 2014). The essential challenge to a hybrid model (such as schematized in Figure 4.4) derives from the fact that any facilitation effect of speaker identity on lexical processing can only be observed via specific word-voice pairing. Put differently, given that, on a hybrid model, the objective of spoken word recognition is to retrieve the abstract phonological code of the wordform, the facilitation of this process by speaker voice recognition implies a quicker detection of phonetic details specific to the speaker, which leads to more efficient extraction of phonologically invariant representation from the raw speech input (see e.g., Zarate et al., 2015). Note that, as we mentioned above, speaker normalization can be achieved by low-level auditory contrastive mechanisms without the need to recognize speaker identity. However, in some cases, specific phonetic-voice association can occur to acoustic attributes that cannot be easily normalized by contrast enhancement mechanisms. For instance, in a recent study (Myers and Theodore, 2017), the authors exposed participants to words produced by two speakers with characteristically different VOT values and then engaged them in a phoneme categorization task on the stimuli from the two speakers. They found neural sensitivity to the specific pairing between VOT variant and speaker identity in the right parietal-temporal junction (Spt), while phonetic judgment of VOT variants activated bilateral pSTS. The involvement of right Spt in processing phonetic-voice association was in line with previous neuropsychological findings (Van Lancker et al., 1988; Van Lancker, Kreiman, and Cummings, 1989) and neuroimaging studies (Andics et al., 2010, 2013) that linked right inferior parietal regions to ability to identify familiar voices from speech.
72 David Poeppel and Yue Sun Based on the crucial involvement of the left Spt in sensorimotor interaction of phonological representations, we speculate that the right Spt plays a similar role in the “dorsal pathway” of voice processing (see also Sammler, Grosbras, Anwander, Bestelmeyer, and Belin, 2015). In particular, it is involved in linking auditory/voice representations with associated phonetic/motor features (VOT in the current case). The dorsal pathway contrasts with the ventral pathway, which presumably includes bilateral anterior temporal lobe where the semantic aspects of speaker’s voice are encoded. Mirroring to the dual pathway of speech processing, it seems that the ventral pathway for voice processing is also bilaterally organized but with preference for the right hemisphere, while the dorsal pathway for voice processing is expected to be rightward lateralized. In addition to its role in sensorimotor integration during speech/voice perception, the right Spt could also be involved in the simulation/estimation stream for sensory prediction (Tian, Zarate, and Poeppel, 2016), which generates sensory consequences of a specific phonetic/voice paring and sends it to STG/STS to be compared with the auditory input. This predictive mechanism could be useful during speech perception to create the sensory targets of speech sounds with specific phonetic details from a particular speaker, which, in turn, facilitates phonological mapping of the speech sounds.
4.4.3.3 Functional interaction between feedforward and feedback processes We emphasized in this chapter a bottom-up, feedforward perspective. That this view is limited is obvious: a number of top-down and feedback processes influence speech recognition and lexical access. (Magnuson and Crinnion, this volume, covers spoken word recognition. Nozari, this volume, discusses word production and its neural basis.) A growing body of research, for example, emphasizes the value of predictive coding in perceptual recognition (e.g., Gagnepain, Henson, and Davis, 2012). Indeed, there is a long history of interactionist accounts of speech processing that allow for the feedforward and feedback nature of the underlying processes (e.g., see McClelland and Elman, 1986 for comprehension; Dell, 1986 for production). We decided, strategically, to limit the discussion here to the central issue of abstractionist versus episodicist accounts, and we decided that the additional factors of feedback/top-down effects as well as predictive coding would go too far and detract from the specific focus. Not to abandon the topic entirely, however, we point briefly to two promising concepts in ongoing research. An idea originally developed in the 1950s that has gained considerable traction in recent work on machine learning and artificial intelligence is the concept of analysis- by-synthesis (AxS; Halle and Stevens, 1959; Poeppel and Monahan, 2011). AxS posits an active, generative, top-down mechanism that provides the perceptual system with candidate representations (synthesis) to facilitate analysis. AxS starts with (i) the extraction of coarse cues from the input signal to generate hypotheses about the input. The cues are sufficient to generate plausible guesses about classes of speech sounds (for example, plosives, fricatives, nasals, and approximants) permitting subsequent narrowing. (ii) This is followed by the synthesis of potential sequences consistent with the input. (iii) A comparison between synthesized phonemic targets and the input from the auditory
Neural encoding of speech and word forms 73 analysis of speech is made. These steps generate an output sequence of speech that can be mapped to a lexical item. This prescient theory from the 1950s capitalizes on feedforward and feedback mechanisms, access to abstract linguistic knowledge, and links to the complex relation between perceptual and motor systems. From a conceptual perspective, AxS bridges key components of both auditory theories and motor theories of perception. Given substantial feedback, prediction, and top-down processing, how are these sets of operations related to the architecture discussed above? Figure 4.2 depicts the dual stream model that focuses relatively strongly on feedforward processing. Recent experiments investigating speech production have yielded a model that is, in a schematic sense, the inverse of the dual stream model, namely the “dual stream prediction model” (Tian and Poeppel, 2013). Based on experiments investigating the internal forward models underlying speech production, it was discovered that the processing stages typically observed in the dual stream forward model can be readily “inverted” for local prediction in speech. That suggests that the dual stream architecture is not just effective for feedforward recognition but can also subserve predictive, top-down processing, lending broader her support for this important notion about the cortical architecture of speech processing and word recognition.
4.5 Conclusion Speech recognition involves transformations from acoustic features onto phonetic representations, phonetic representations onto phonological representations, and ultimately access of the lexical items based on their phonological structure. The process is influenced by indexical information such as speaker identity. There are also important top-down components in speech processing. To date, the combined data arising from behavioral research and neurobiological approaches show that the process can be decomposed into a number of subroutines that are differentially implemented in the brain. Some of the progress that has been made reflects that speech processing and lexical access are the kinds of problems that yield to a combination of approaches from linguistics, experimental psychology, and neuroscience. Not all questions are amenable to interdisciplinary inquiry, but this domain has been successfully elucidated by incorporating critical insights across approaches.
Pa rt I B
MEANING
Chapter 5
Morphol o gy a nd T h e M ental Le x i c on Three questions about decomposition David Embick, Ava Creemers, and Amy J. Goodwin Davies
5.1 Introduction The most basic question for the study of morphology and the mental lexicon is whether words are decomposed: informally, this is the question of whether words are represented (and processed) in terms of some kind of smaller units, that is, broken down into constituent parts. Formally, what it means to represent or process a word as decomposed or not turns out to be quite complex. One of the basic lines of division in the field classifies approaches according to whether they decompose all “complex” words (“Full Decomposition”), or none (“Full Listing”), or some but not all, according to some criterion (typical of “Dual-Route” models). However, if we are correct, there are at least three senses in which an approach might be said to be decompositional or not, with the result that ongoing discussions of what appears to be a single large issue might not always be addressing the same distinction. Put slightly differently, there is no single question of decomposition. Instead, there are independent but related questions that define current research. Our goal here is to identify this finer-grained set of questions, as they are the ones that should assume a central place in the study of morphological and lexical representation.
5.1.1 Morphology A great deal of research on words and morphology in theoretical linguistics takes an architectural focus, asking questions like Is morphology part of the syntax? Is there
78 David Embick, Ava Creemers, and Amy J. Goodwin Davies a “lexicon” in which words are derived and stored?1 Very different answers to these questions have been developed and continue to be debated in an active literature. For the purposes of this chapter, it is important to highlight two points of consensus that have emerged from what are often very distinct lines of investigation. The first is a matter of representation; the second concerns the direction that further research should take. The representational point centers on how morphological relatedness is encoded. The consensus that has arisen in morphological theory is that morphological features serve this purpose. While theories might say different things about how these features are represented, there is agreement that, for example, past tense verbs like played are associated with a feature like [+past] that is in turn related to both phonology and semantics. The nature of these features is examined in detail in Section 5.3. For now, we note merely that the very fact that there is a consensus of this type is striking, given that it arises in approaches that otherwise differ markedly in terms of what they say about a number of other issues. Much of our discussion will thus concentrate on the role(s) features of this type play in explaining the relationships between words—a particularly important topic in the light of recent approaches that deny the existence of specifically morphological representations (see below). Concerning the second point, directions for further research, there has been a move in recent years toward increasing integration of paradigms from theoretical, experimental, quantitative, and computational paradigms. In the past, it has been possible to find analogs across these domains. For example, the idea from early work in theoretical linguistics that words that are irregular are stored in the lexicon (cf. Aronoff, 1976; cp. Bloomfield, 1933) resonates with the position taken in Dual-Route (or Dual Mechanism) models of the mental lexicon (e.g., Baayen, Dijkstra, and Schreuder, 1997; Bertram, Laine, Baayen, Schreuder, and Hyönä, 2000; Burani and Laudanna, 1992; Caramazza, Laudanna, and Romani, 1988; Frauenfelder and Schreuder, 1992; Giraudo and Grainger, 2001, 2003; Schreuder and Baayen, 1995). More recent developments indicate a move past analogies and toward the idea that theoretical and experimental paradigms, rather than being connected indirectly or correlating with each other in certain ways, are actually different ways of looking at the same questions. According to one way of thinking about this, different options that have been identified in more “theoretical” lines of investigation are used as hypotheses to make qualitative and quantitative predictions about behavioral (and neural) responses within a single integrated research program (for one set of views on this see Poeppel and Embick, 2005; Embick and Poeppel, 2015; Marantz, 2005, 2013; and Goodwin Davies, 2018). These two points of consensus fit together in the following way. A number of research programs with very different starting assumptions and goals have adopted the idea from morphological theory that morphological features are necessary in the representation of words, in a way that, moreover, involves abandoning what will be called the “textbook”
1
For an overview of some foundational questions, and a look at morphological theories until the late 1980s, Carstairs-McCarthy (1992) is a useful source. See also Alexiadou (this volume) for a perspective.
Morphology and The Mental Lexicon 79 view of the morpheme in the discussion to come. However, the consequences of this move have not been articulated in detail. A primary goal of this chapter will thus be to show how it leads to a set of research questions that are more refined than a simple “decomposition-or-not” dichotomy.
5.1.2 Three questions The bulk of our discussion concentrates on the following three questions: (Q1) Is there independent morphological representation? The first (and most fundamental) question is whether morphological features—that is, independent morphological representations—exist in the first place. As noted above, while linguistic theories have reached a sort of consensus on this question, this agreement does not extend across the broader language sciences where the mental lexicon is studied. In ways that are made precise below, this question asks, in essence, if the putative effects of morphology could instead be reduced to other types of representations, viz. semantic and phonological ones, or whether there is indeed a kind of unit that is connected to form and meaning, but represented distinctly. (Q2) What does it mean to store (“morphologically complex”) words in memory? A second question concerns what it could mean to store a word in memory. Some theories say, for example, that a past tense form like played is derived by rule and morphologically complex, consisting of the stem play and the past tense morpheme (orthographic) -ed, but that an irregular form like sang is “simplex,” that is, it exists as an unanalyzable object in memory. As we will show, this kind of dichotomy masks several important questions that are implicated in much recent work, but which are not often posed clearly. (Q3) What does it mean to be morphologically complex, or, are morphemes pieces? The third question concerns exactly how morphology is represented. For theories that posit independent morphological representations, an important question to ask is what properties these representations have. One possibility is that morphological features like [+past] are represented independently in memory, in the same way that stems like play and sing are; on this view, played has an internal structure consisting of the two pieces [play]-[+past]. Another possibility is that morphology is not represented in this way but instead appears as part of a representation with a stem, for example, [play +past]; this representation has no internal hierarchical structure. These possible analyses are the subject of a long-standing debate in linguistic theory and have surfaced in recent experimental and computational discussions as well. Our choice to concentrate on these questions is designed to (re)direct inquiry in a way that moves past some familiar dichotomies; the most basic is that between what are
80 David Embick, Ava Creemers, and Amy J. Goodwin Davies classified as Full Listing (Butterworth, 1983; Bybee, 1995; Manelis and Tharp, 1977; Norris and McQueen, 2008) versus Decompositional approaches. The latter are of different types, including those with Full or Obligatory Decomposition (Fruchter and Marantz, 2015; Taft and Forster, 1975; Taft, 1979, 2004) of all complex words versus Dual-Route approaches (Baayen et al., 1997; Bertram et al., 2000; Frauenfelder and Schreuder, 1992; Marslen-Wilson and Tyler, 1998; Pinker and Ullman, 2002) that posit decomposition for some words but not others, according to different criteria. Zwitserlood (2018) provides a succinct overview of these and other approaches. While all of these ways of classifying and opposing approaches provide convenient reference points, we will be at pains to show that the refined set of questions (Q1–3) does a better job of posing the questions that are (or should be) in focus in investigating morphological and lexical representation.
5.2 (Independent) morphological representation What does it mean (Q1) to have specifically morphological representations? Perhaps the most straightforward way of approaching this question is by thinking about what it means for words to be related to one another; for illustration, we will employ the words play and played. In one way of talking about how they are related, these two words share the “stem” (or “root”) play; the past tense word has a feature [+past] in addition to the stem. In ways that different theories specify, this feature in turn relates form (the phonology /d/) and meaning (the semantics of “past tense”). The (informal) principle assumed in this type of work is that generalizations about form and meaning can be maximized by minimizing the amount of information that has to be memorized. Based on the existence of other present/past pairs in the language like amaze/amazed, blend/blended, rig/rigged, and so forth, it can be observed that there are a number of past tense forms that look like the present form with a /d/added; allowing for phonological devoicing, many more such pairs can be adduced (pass/passed, buff/buffed, etc.). Storing all of these words in memory would, according to this way of thinking, fail to account for the observation that the past tense is signaled by the addition of /d/. Seen in this way, the [+past] feature anchors or provides a locus for encoding the generalization that past tense has this particular form. The way of talking about morphological relatedness outlined above is an accurate though partial representation of an important line of reasoning, one that we will expand on below. It is not the only way of approaching morphological relatedness, though. It could be denied, for instance, that there is anything to be gained by minimizing the number of morphemes in memory. More relevant for our immediate purposes is the fact that given what we have considered to this point, it looks as though semantic and phonological relatedness might exhaust what there is to be said about the relatedness of
Morphology and The Mental Lexicon 81 words. In particular, given that play and played overlap considerably in meaning (they both relate to “playing”); and, given, moreover, that they are quite similar in form (they share most of their phonological content), why would we need an “independent morphological representation,”—that is, a feature like [+past]? This is the question at the heart of this section: given that phonological and semantic relatedness are independently motivated, why would specifically morphological representations be required as well?
5.2.1 Abstraction from form and meaning The answer to this question is that morphological features are required because words appear to be related in ways that go beyond what can be stated in terms of simple overlap in form and meaning. To demonstrate this point, we will review arguments showing that, although appealingly simple, the idea that “past tense means [[]] and is pronounced /d/” is inadequate for explaining the connections between words, even for this kind of apparently straightforward example. First, some further context is required. The simple view just outlined is based on the idea that morphemes connect a single form to a single meaning. It is a version of a “textbook” or classic approach to the morpheme, one that is associated with works like Hockett (1958)—see Aronoff (1976). As Aronoff discusses, the textbook morpheme requires three things: a constant form, a constant meaning, and an arbitrary link between the two. However, the reality that emerges from morphological theory shows conclusively that form/meaning connections are much more complex than this. To illustrate this point, it is useful to think about what this kind of morpheme would look like. Simplifying considerably, the approach to morphology presented in Lieber (1980, 1992) involves morphemes (“lexical items”) with the relevant properties; a past tense morpheme, for example, could be represented as in (1): (1) [ [v] past’, /d/] This type of morpheme is a single object whose underlying content includes semantics (here past’), a phonological underlying representation /d/, and an indication of the morpheme’s combinatory properties (in this case, that this affix attaches to verbs). While theories involving morphemes like (1) continue to be explored, the prevailing view in morphological theory, with roots in approaches from the 1960s and 1970s in work by Beard (1966), Matthews (1972), Aronoff (1976), and others, is that morphemes that combine syntax/semantics/form in a single item in memory as in (1) cannot account for the ways in which natural languages relate form to meaning. This idea is an important one. It continues to grow in significance as its implications have begun to have an impact on experimental and quantitative research, in approaches with very different starting points. For example, it appears both in experimental work connected to linguistic theories like Distributed Morphology (Halle and Marantz, 1993) and also
82 David Embick, Ava Creemers, and Amy J. Goodwin Davies in work exploring learning models like Baayen, Milin, Đurđević, Hendrix, and Marelli (2011) (see also Section 5.5). It is thus of central importance in understanding current discussions of morphology and the mental lexicon. The conclusion drawn in theoretical works like those cited above is that the semantic and phonological facets of the morpheme must not be represented in the same primitive object. This position has come to be known as the Separation Hypothesis. The key role that Separation plays for present purposes is as follows: when the sound and meaning parts of the morpheme are separated from one another, it is no longer necessary that there be “one form/one meaning” as there is with the classic morpheme. In terms of the example above, it says that notions like “past tense form of the verb” need not have (i) a single sound form associated with them, nor (ii) a single semantic interpretation. Second, it provides a way of encoding the idea that there nevertheless is a single coherent notion of past tense verb forms. Whether regular or irregular, all past tense verb forms (played, gave, bent, went, etc.) have the same distribution in the syntax of the language. It is this identity, one that is abstracted from a single phonological realization or semantic interpretation, that is accounted for with the feature [+past]. We will unpack the phonological and semantic components of the argument for Separation in turn, beginning with the former. In a typical way of implementing Separation, phonology-free morphological features like [+past] are provided with a form through a process of phonological realization. The details of phonological realization can be implemented in different ways that depend to some extent on the “morphemes as pieces (or not)” question (Q3). At a very general level, the idea is that for a feature like [+past], there must be a rule or object that associates it with a phonological form. The object in (2), a Vocabulary Item in the terminology of Distributed Morphology, is one way of doing this (see Embick, 2015 for an overview): (2)
[+past] ↔ /d/
Informally, (2) says that the feature [+past] is provided with the phonological form /d/. Relating form to features as in (2) allows a morphological feature to be realized in a way that is influenced by its context. So, for example, the past tense [+past] is realized as /t/in the context of bend, keep, and several other verbs. This can be accounted for by making the realization sensitive to the verb to which the [+past] feature is attached; we show this in (3), to be interpreted as “[+past] is realized as /t/in the context of LIST,” where LIST = {bend, keep, . . .}:2 (3)
2
[+past] ↔ /t/ /LIST__
A comprehensive account would also address the changes to the stem vowel that is found in forms like bent, kept, and others.
Morphology and The Mental Lexicon 83 In a standard way of thinking about (2) and (3), there are principles of competition that ensure that the more specific (3) applies when its conditions on application are met, such that bent is derived (and *bended is not); see Embick (2015) for a general discussion. Terminologically, a feature like [+past] that shows different phonological realizations in this way shows (contextual) allomorphy. The discussion above shows that a morpheme’s feature content is sometimes abstracted away from a particular phonological realization, such that the context in which a morphological feature appears plays a crucial role in determining its phonological form. The same types of considerations apply to semantic interpretation as well. Aronoff ’s (1976) arguments against the classic morpheme provide a way of thinking about this. Based on the behavior of “bound roots” in the Latinate vocabulary of English, such as MIT in e-mit, o-mit, com-mit, and per-mit, or CEIVE in con-ceive, de-ceive, and per-ceive, Aronoff concludes that the roots in question are morphemes that have no inherent/constant meaning, and that the traditional definition of the morpheme must therefore be abandoned or modified. In one way of looking at this, the argument is that the meanings of certain morphemes may be abstract in the same way that their phonological form is, and require crucial reference to their context. For discussion of this point, see Marantz (2013), Marantz and Myler (forthcoming), and references cited there, where (on analogy with the phonological side) the contextual interpretation of morphemes is called allosemy.3 Contextual effects can also be identified with inflectional morphology. While past tense forms are used in simple past tense contexts (Mary laughed), past tense forms are also used to express non-past meanings; for example, the kind of irrealis interpretation associated with certain conditional antecedents (If Mary laughed at the meeting tomorrow, we would be surprised). Past tense is also used in polite contexts (Did you want to have the cake as well?). Even if there is a way of connecting these different meanings, by, for example, abstracting a notion of “remoteness” that is common to all, the point is the same: there is not a unique association between [+past] and a single meaning like “past tense.” Taken together, these observations point to the sense in which morphological representations like “past tense” are abstractions: one and the same feature like [+past] need not have a single phonological representation, nor does it need to correlate with a single meaning. On this point, it is important to note that identical forms are realized for a given verb in the different meanings that have been identified: (4) Mary talked/left/spoke/sang/went to the beach yesterday. (5) If Mary talked/left/spoke/sang/went to the beach tomorrow . . . 3 Aronoff, and others following him in the tradition of lexeme-based morphology, speak of the situation as one in which individual lexemes like emit have meanings, even though their constituent parts do not. This approach differs in interesting ways from one in which morphemes are assigned interpretations by virtue of the morphemes that appear in their context, as is suggested later in the main text. The differences between these approaches do not detract from the point on which they agree, which is that morphemes have an identity that is abstracted from particular/consistent meanings.
84 David Embick, Ava Creemers, and Amy J. Goodwin Davies In other words, the fact that the same form of a verb is found with past tense and (certain) irrealis meanings is not an accident. Rather, it calls for an abstract identity that underlies a set of disparate contextually determined phonological and semantic realizations; and this is what [+past] provides. For reasons that have been introduced above, the feature [+past], though abstract from the point of view of form and meaning, plays a definite role in the grammar: it determines the distribution of the verb forms that we call “past tense.” In this way, it is directly connected to the syntax of the language, since this is the part of the grammar that governs where past tense forms do and do not surface. Crucially, though, the syntactic distribution of past tense forms is independent of how those forms are realized phonologically. The abstraction inherent in [+past] thus provides a way of connecting regular and irregular verbs, which are often treated in very different ways. The important point is that (abstracting away from irrelevant effects of lexical meaning) regular and irregular verb forms are identical in terms of their clausal syntactic distribution. This is a clear reason for treating both of them as involving a [+past] tense feature; although how they relate to this feature is contentious, as we will see in Section 5.3.4
5.2.2 Independent morphology in lexical processing/ representation The primary challenge for investigating the role of morphological features in lexical processing and representation is disentangling different types of relatedness between words. In particular, any given indication of relatedness that is detected in an experiment could be phonological or semantic in nature, not specifically morphological.5 The basic question to be addressed thus concerns how can it be determined what type(s) of representations are driving observed effects. The question is particularly poignant given different movements that seek to eliminate morphological representations (see Baayen et al., 2011; Gonnerman, Seidenberg, and Andersen, 2007; Milin, Feldman, Ramscar, Hendrix, and Baayen, 2017; Plaut and Gonnerman, 2000; Raveh, 2002; Seidenberg and Gonnerman, 2000, and related work). Despite differences in starting assumptions and implementations, these approaches share the idea that putatively morphological effects can be reduced to relatedness in form and meaning, and do not require morphological representations beyond these.
4
On this theme, see Pinker and Ullman (2002, p. 457), who state that “classical theories of generative phonology and their descendants . . . generate irregular forms by affixing an abstract morpheme to the stem . . . to account for the fact that most irregular forms are not completely arbitrary but fall into families displaying patterns. . . .” This is not an accurate way of describing the motivation for features like [+past], as we have shown here. For the question of how irregular forms might be realized (in terms of storage or not), see Section 5.4. 5 For issues involved in defining phonological and semantic relatedness, see Purse, Tamminga, and White (this volume) and Rodd (this volume).
Morphology and The Mental Lexicon 85 The question of how to distinguish phonological, semantic, and (hypothesized) morphological relatedness manifests itself clearly in morphological priming paradigms, which are in principle designed to probe morphological processing and representation. It is well established that priming paradigms are sensitive to semantic and phonological relatedness between words. For instance, words that are related semantically, such as cat and dog, or carrot and vegetable, typically facilitate each other in associative priming paradigms (see Neely, 2012 for an overview). Similarly, words like fog and dog are related phonologically by virtue of rhyming, a shared phonological property that produces significant facilitation in priming paradigms. Other phonological relationships have detectable effects as well; for example, gray and grape are related phonologically in that the former is a substring of the latter (e.g., Wilder, Goodwin Davies, and Embick, 2019). These and other relations may produce facilitation or inhibition in priming paradigms in ways that relate to the predictions of models of lexical activation and access. For immediate purposes, what is important is the fact that many morphologically related words (e.g., frogs and frog, or teacher and teach) are also phonologically and semantically related. This raises the question of when facilitation and other findings can be interpreted as purely morphological effects, since the effects could also be due to phonological or semantic priming (or an interaction between the two); see, for example, Feldman (2000) and Gonnerman et al. (2007). With these kinds of concerns in mind, various means have been proposed to attempt to rule out semantic or phonological factors in attempts to detect morphological relatedness. Primed lexical decision experiments typically include control conditions containing pairs of items that are similar semantically (e.g., hit → kick) or phonologically (orthographically, in visual presentation), but which are not morphologically related (e.g., bail → boil). Such conditions are then used to compare effect sizes for the different types of relatedness; in this case, to examine whether morphological relatedness in pairs like came → come can be shown to be distinct from semantic and phonological relatedness. A different way in which this issue has been addressed is by examining morphological priming effects for complex words that lack either a close phonological relation (as in English irregular past tense words like taught with respect to teach), or by employing rhyme prime to probe whether monomorphemic targets behave differently from targets with affixes (e.g., dough → code versus dough → snowed; Bacovcin, Goodwin Davies, Wilder, and Embick, 2017). On the semantic side, a growing literature asks whether morphological relatedness can be detected in words that do not share a semantic relation to their stem (e.g., Marslen- Wilson, Tyler, Waksler, and Older, 1994; Smolka, Preller, and Eulitz, 2014; Stockall and Marantz, 2006; Creemers, Goodwin Davies, Wilder, Tamminga, and Embick, 2020).
5.3 Morphological complexity/ storage of words What does it mean (Q2) to store a (morphologically complex) word in memory?
86 David Embick, Ava Creemers, and Amy J. Goodwin Davies Questions about storage have a complex history in linguistic theories of morphology. In a tradition that is often associated with Chomsky (1970), many such proposals arise in theories of the lexicon, a component of grammar whose contents, organization, and very existence are investigated and debated in a complex literature.6 Our focus in this section is on what it would mean for morphologically complex words to be stored. This is an important topic in morphological theory. Early proposals by Aronoff (1976: 43) rely crucially on the idea that at least some morphologically complex words (namely, those that are irregular in any way) are listed in the lexicon. While Aronoff (1976) is interested in derivational morphology (e.g., -ity derivatives like curiosity versus -ness forms like curiousness), the same kind of reasoning about irregularity has been extended to inflectional morphology. For example, familiar versions of the Dual-Route approach to modeling the English past tense hold that irregulars are “stored”, unlike regulars, which are “computed by rule”. What precisely this means, though, is not at all obvious; what might be stored (and how storage relates to composition of complex objects) is a complex matter, one that takes on special importance in the light of the Separation Hypothesis.
5.3.1 Storage and Separation As a starting point, note that storage in memory is typically posited as an alternative to something that is referred to as computation by rule. To see what it might mean to store a complex word, then, it is useful to begin with the question of which aspects of a word’s representation might be rule-governed in a relevant way. Illustrating with the English past tense again, the previous section motivates features like [+past], so that past tense verbs consist of the verb (here capitalized to indicate a degree of abstraction, for reasons that will become clear below) and a past tense feature: for example, [PLAY +past]. As described in Section 5.2, [+past] is realized as the “regular” past tense exponent /d/to produce played. There are two things happening in this example that could be understood as being done “by rule.” The first (R(ule)1) is the process that combines PLAY and [+past]. In standard ways of talking about this relation, the rule is combinatoric in the sense that it combines two independent objects into a complex representation. The second thing (R(ule)2) that is involved here is the process that realizes the [+past] feature, in this example as /d/. This could be effected by a morphological rule, or something similar like the Vocabulary Item [+past] ↔ /d/employed in the previous section. The form /d/is the regular realization of [+past], so this behavior is also rule-governed in the relevant sense.
6
Some additional difficulties arise due to the many different ways in which the terms lexicon, lexical, etc. are used; see Aronoff (1994) and Embick (2015) for discussion.
Morphology and The Mental Lexicon 87 With these distinctions at hand, it is clear that treating some words as stored and not derived by rule could amount to denying for those words the application of either one of (R1) and (R2), or both, as we will now show.
5.3.1.1 Storing complex objects and their forms For the purposes of this section, we will adopt a sort of operational definition of morphological complexity that says that a word is morphologically complex when it consists of more than one grammatical element; for example, two “stems,” like in compounds (e.g., blackberry), or a stem and a morphological feature like [+past], and so on. It is then possible to consider what it means to store a morphologically complex word in memory in a way that denies both (R1) and (R2). Pinker and Ullman’s (2002) “words and rules” approach takes this stance with respect to (R1)/(R2). Their view is that with irregulars, speakers “memorize a complex word outright rather than parsing it into a stem and an affix” (Pinker and Ullman, 2002: 456). For regular past tense verbs, a stem like play is associated with a feature [+past] that is combined with it by an affixation rule (as in (R1)) that produces play-ed. In notation like that used above, the output of this rule is a word [V play –ed, +past], a verb that appears in past tense syntactic contexts. On the other hand, irregular past tense verbs like gave are stored in the lexicon “as a whole,” that is, as [V gave +past]. This approach denies (R1) for gave, since this word is stored in memory with the [+past] feature, not combined with it. With respect to (R2), there is also nothing rule-like that produces the irregular past tense from the stem; rather, gave is a suppletive stem, along the lines of go/went.
5.3.1.2 Storing stem forms (R1) and (R2) are independent and thus need not be denied or accepted together. One approach that appears in the literature rejects (R2) for certain words but retains (R1). In this kind of approach, complex objects like [GIVE +past] are created by combining two separate elements, GIVE and [+past]. In the realization of this representation, though, there is a difference from what happens with regulars. In particular, gave is a special “stem” in memory that realizes both [GIVE] and [+past]; in the notation used above, this would involve an item [GIVE +past] ↔ gave. This is essentially the approach of Anderson (1992); but it could, in principle, be implemented in any number of realizational approaches with differences in implementation of stem suppletion that could potentially be investigated further.7 In this type of analysis, irregular forms like gave are stored in memory—they are thus not related (morpho)phonologically to the form realized in give. However, what is stored in memory is not a complex object; the form gave is simplex, and realized in the complex object [GIVE +past].
7 For example, rather than treating sang as realizing both [SING] and [+past], it could be treated as a contextual allomorph for SING that is realized in the context of [+past]; see Embick (2010b, 2015) for discussion.
88 David Embick, Ava Creemers, and Amy J. Goodwin Davies
5.3.1.3 Accepting (R1), maximizing (R2) The approaches considered immediately above all posit that “storage as a whole” is a property of at least some morphologically complex words. They thus stand in contrast to theories that employ (R1) across the board and maintain (R2) to the fullest extent possible, which often bear the label Full Decomposition (e.g., Stockall and Marantz, 2006). For this type of approach, the (R1) properties are those outlined in the first part of Section 5.3.1.2: all past tense verbs, whether regular or irregular in form, involve the composition of the verb with the feature [+past]: [PLAY +past], [GIVE +past], and so on. It is with respect to (R2) that more needs to be said. If (R2) is accepted as much as possible (see below), it is not the case that, for example, gave is a suppletive allomorph of give. Instead, there are (morpho) phonological rules that produce the stem change seen in gave, so that give and gave share a phonological underlying representation. An approach of this type must nevertheless put some limits on (R2): there are some cases—the extremes of form/meaning unpredictability—where storage is required. These involve the phenomenon of (stem) suppletion, which cannot be treated morphophonologically. So, for example, the verb go and its past tense went are suppletive alternants, such that went is inserted when GO combines with [+past]. What about alternations that share some segmental material but less than give/gave—bring/brought, for example, or think/ thought; are these related morphophonologically like give/gave, or suppletively like go/ went? This question is a complex one with a long history. It is, importantly, one where it appears that convergence between different lines of research is required; that is, one where experimental investigation grounded in distinctions made in the theoretical domain are required; see Embick (2010b) and Embick and Poeppel (2015) for some discussion.
5.3.2 Detecting storage in morphological processing/ representation Theories that apply (R1) across the board and maximize (R2) are at one limit of the decomposition versus storage divide; the other limit is theories that posit that all words are memorized (i.e., Full Listing). Many (psycho)linguistic approaches adopt an intermediate position, according to which particular criteria determine whether a word is stored or derived. Perhaps the most prevalent type of approach is one in which words that are regular are derived by rule, whereas words that are irregular are stored (see Zwitserlood, 2018 for a recent overview).8 The notions of regular and irregular that 8
Additional criteria that have been suggested for the “storage-or-not” dichotomy include frequency and productivity. It has, for instance, been proposed that high-frequency regular forms are stored, on a par with irregular forms (Stemberger and MacWhinney, 1988), or that words with a low surface frequency and high-frequency constituents may be processed through a parsing route (Burani and Laudanna, 1992; Chialant and Caramazza, 1995; Laudanna and Burani, 1995). In a similar vein, the Morphological Race Model posits that besides semantic and phonological regularity, morphological productivity plays a role in deciding which words are stored (the “direct route”) and which are
Morphology and The Mental Lexicon 89 are required here have manifestations in both form and meaning and can be defined in different ways. For the form side, regulars are typically identified as having default or productive morphology, while irregulars show unpredictable allomorphy, or unproductive forms. On the meaning side, regulars have meanings that are semantically transparent, or predictable from their parts, or compositionally derived, while irregulars have unpredictable or special meanings. For reasons of space, we will concentrate here on the form side, since this is where most of the work related to (R1)/(R2) has been done.9 The interpretation of experimental evidence related to the “storage-or-not” question often proceeds along lines that must be reconsidered if the (R1)/(R2) distinction is taken into account. A typical kind of reasoning is based on experiments that examine relatedness between regulars and irregulars and their stems, where prior findings have identified some ways in which regulars and irregulars behave differently, and some in which they behave the same. Many experimental studies, both behavioral and neurolinguistic, have reported that regulars and irregulars differ in at least some ways (see, e.g., Stanners, Neiser, Hernon, and Hall, 1979; Kempley and Morton, 1982; Napps, 1989; Marslen-Wilson, 1999; Allen and Badecker, 2002; Pastizzo and Feldman, 2002). In contrast, some later work that combines priming with magnetoencephalography (MEG) or electroencephalography (EEG) shows that both regular (dated) and irregular (gave) allomorphs of a verb (date, give) prime their stem (Stockall and Marantz, 2006; Morris and Stockall, 2012). Relatedly, but focusing only on irregular forms, Crepaldi, Rastle, Coltheart, and Nickels (2010) also find facilitation for irregular verb pairs (fell → fall) compared to orthographically matched (full → fall) and unrelated controls (hope → fall) in a series of masked priming experiments. When differences between regulars and irregulars are found, this is often taken to be evidence for Dual-Route accounts for inflection. However, the idea that there are differences between regulars and irregulars does not require that particular implementation of the difference. Even Full Decompositional models of the mental lexicon have to posit some difference between formations that have irregularities associated with them and those that do not, since irregulars require a type of listed information (Embick and Marantz, 2005). Moving past the idea that any regular/irregular distinction requires a Dual-Route model, we will conclude this section by looking at the question of how differences/similarities in the representation and processing of regulars and irregulars derived (the “parsing route”). In this model, productive forms (to a first approximation, those that are generalized to new forms, consider, e.g., the verbs google/googled) are parsed while unproductive forms are stored (Baayen, 1992; Frauenfelder and Schreuder, 1992). 9
On the meaning side, a number of studies look for differences between words that are semantically unpredictable, which are called opaque, and words whose meanings appear to be transparent: that is, derived predictably from the meaning of their parts. A number of studies report stem-priming from both transparent and opaque derivatives; see Smolka et al. (2009, 2014, 2019) and Creemers et al. (2020) for recent examples. One line of the literature hypothesizes that there are cross-linguistic differences in such effects, that is, that opaque forms do not prime in some languages (see Günther, Smolka, and Marelli, 2019 and references cited there). However, it is possible that differences are due to how opaque is defined across studies (see Creemers et al., 2020).
90 David Embick, Ava Creemers, and Amy J. Goodwin Davies can be approached in terms of the (R1)/(R2) distinction. The gist of this part of the discussion is that similarity or difference between these two types of words depends on what types of representations and relatedness a particular experiment is probing. The idea we put forward is that in light of the (R1)/(R2) distinction, arguments in either direction (different or similar behavior) must be examined at a finer grain. Suppose that regulars and irregulars are found to differ in some way. Is that evidence relating to (R1) for them being associated with the feature [+past] in different ways (by rule with dated, but stored with gave)? Or is it evidence related to (R2) that the ways in which their phonological forms are related differs? In the other direction, where it looks like regulars and irregulars are both related to their stems, the same kinds of questions can be asked. If regulars and irregulars both prime their roots, is this evidence (R1) that both dated and gave are related to a structure [Root +past], such that the present and past tense forms share a morphological representation, that is, a shared stem GIVE or PLAY? Or is it evidence (R2) that gave is derived (morpho)phonologically from give, that is, like played is from play? To make matters more complex, the effects of both (R1) and (R2) types of relatedness might be found in experimental paradigms that are standardly employed. The key question for further work in this area is thus how to distinguish the relative contributions of the (R1) and (R2) types of relatedness experimentally.
5.4 Representing morphology What does it mean (Q3) to be morphologically complex, or to say that morphemes are pieces? In Section 5.2 we looked at the question of morphological features. While many approaches agree that such features are required, there continues to be substantial disagreement as to how they are represented. The question at issue is whether features like [+past] are represented as pieces—so that they are similar to stems like PLAY, SING, etc. (at least to this extent). Terminologically, theories that represent morphological features as pieces are called piece-based (or morpheme-based, since the pieces are often called morphemes). The opposing type of approach is referred to as pieceless or amorphous. In the latter type of approach, there are features like [+past], but they are not represented like stems are. The primary point of contrast is that in a piece-based theory, words have internal hierarchical structure, as they are built out of a number of pieces; in a pieceless theory, on the other hand, words relate multiple features, but these are not arranged in a hierarchical structure.
5.4.1 Morphology with/without pieces The debate between these two positions has a long and complex history. We will look at recent representatives of these views here. Beginning with the pieceless point of view, perhaps the most well-developed example is Anderson’s (1992) “Amorphous
Morphology and The Mental Lexicon 91 morphology” (cp. Matthews, 1972; for a look at related approaches in this tradition see Blevins, 2016). For purposes of illustration, let us take the Latin verb lauda:vera:mus ‘we had praised.’ In terms of morphological features, this verb has an aspect feature [+perf] (= ‘perfective’), a tense feature [+past], and agreement features [+1,-2] (= ‘first person’), [+pl] (= ‘plural’).10 In Anderson’s approach, the rules that realize morphological features operate on representations without internal structure, like the one shown in (6): (6)
Amorphous representation +1 +2 + pl + past + perf LAUD
Word formation rules apply to this matrix to rewrite the stem lauda: phonologically, to produce the end result lauda:vera:mus. These rules are string rewriting rules, so that, for example, the one realizing the [+perf] feature would be stated as something like / X/[+perf] → /Xve/. Further assumptions are needed in order to ensure that these rules apply in the correct order. For our purposes, the important thing to note is that this approach does have independent morphological features, but these are not packaged as pieces. Rather, they appear in matrices like (6) and are referred to by rules that effect phonological changes to the stem. Importantly, neither features like [+perf] or its exponent /ve/are pieces on a par with the stem in this approach. In contrast to the amorphous view are piece-based theories accepting Separation; Distributed Morphology (Halle and Marantz, 1993 and subsequent work) is of this type. In this approach, lauda:vera:mus consists of a number of distinct morphemes that are arranged in a hierarchical structure of the type typically depicted with tree diagrams like (7): (7) Morphemes as pieces in a hierarchical structure
LAUDA: [+perf] [+past] [+1,+pl]
10
This verb also belongs to Conjugation class I (with “theme-vowel” a:) which we put to the side here.
92 David Embick, Ava Creemers, and Amy J. Goodwin Davies Morphological features are (parts of) objects in memory (called functional morphemes to distinguish them from the lexical vocabulary of stems or roots) that are combined into larger structures like the one seen in (7). More precisely, these morphemes are the terminals of syntactic trees, trees in which affixation operations produce internally- structured “words” like the one diagrammed in (7). Morphologically complex words thus contain multiple discrete morphemes in a hierarchical structure. This approach shares with the amorphous one the idea that morphological features are crucial. It also shares the property of adopting Separation: the morphemes in (7) are abstract in the sense that they do not possess phonological representations—these are added to them with Vocabulary Items like (8) (recall the discussion in Section 5.2). (8) [+1,-2,+pl] ↔ -mus [+perf] ↔ -ve . . . However, this approach differs from an amorphous one in two related ways. The first is that it treats functional morphemes, which are composed of morphological features, as pieces in the sense described above. A second difference concerns the realization of features. In the amorphous approach, the material introduced by word formation rules has no independent representation; apparent pieces like -ve and -mus are simply the phonological byproducts of the word formation rules. On the other hand, Vocabulary Items like those in (8) are represented as objects in memory. Both of these differences are relevant to interpreting experimental results, as we will see below. The debate between piece-based and pieceless views in theoretical morphology continues to this day. With respect to the approaches cited above, Halle and Marantz (1993) devote considerable attention to comparing the predictions of amorphous and piece-based views in the domain of blocking effects (cf. Aronoff, 1976 and Embick and Marantz, 2008; also Embick, 2015 for discussion). Another related line of work argues that pieceless theories of morphology are not able to account for different types of locality conditions that appear to restrict possible interactions among features in morphological realization (see Embick, 2010a, 2016, and references cited there). Both amorphous and piece-based theories continue to be explored. One issue that arises in making direct comparisons is a difference in research focus. By and large, research programs concentrating on connections with syntax have tended toward the piece-based view, while research concentrating on more specifically morphological issues (cf. Aronoff ’s (1994) “morphology by itself,” and related work) have tended to adopt amorphous representations.
5.4.2 Experimental directions A small but focused literature has taken the pieces-or-not debate into the experimental domain. Here it is important to distinguish between derivational morphology
Morphology and The Mental Lexicon 93 and inflectional morphology, since it is with the latter that most current discussion is associated. In Anderson’s (1992) approach, for example, the word formation rules for derivation have different properties from those that create inflected words, such that the former are arguably more piece-like than the latter.11 A number of studies address the question of whether derivational affixes are represented in a similar way to stems through priming experiments. For example, VanWagenen (2005) finds facilitation for words that share a derivational affix (e.g., darkness → happiness) relative to phonological (harness → happiness) and semantic controls (joy → happiness) in a visual primed lexical decision study. In a cross-modal primed lexical decision task, Marslen-Wilson, Ford, Older, and Zhou (1996) report facilitation for derivational prefixes (e.g., rearrange → rethink) and suffixes (e.g., toughness → darkness), and Duñabeitia, Perea, and Carreiras (2008) show derivational affix priming in Spanish (e.g., brevedad ‘brevity’ → igualdad ‘equality’) in a masked visual priming paradigm. As will be discussed in more detail for inflectional morphology below, it is important in interpreting such findings to be clear about what exactly is being primed. The facilitation that is found between two words that share a derivational affix (such as -ness in darkness → happiness) could, for instance, be driven by a feature [+N], by the phonological form of the affix /nəs/, by the semantics of “abstract noun,” or by a combination of these factors. Since this issue does not seem to have attracted much attention in the study of derivational morphology, we focus on this in more detail in our discussion of inflectional affix priming. For inflectional morphology, the most controversial topic, some initial forays into this part of the morpheme debate have emerged recently, although results are mixed.12 For instance, VanWagenen and Pertsova (2014) find priming effects for a range of verbal inflectional affixes, but no significant effects for several nominal inflectional affixes in Russian. For a group of four inflectional and derivational affixes in Polish (perfective prefixes: s-, a diminutive suffix -ek and an agentive suffix -arz), Reid and Marslen-Wilson (2000) find significant effects when all four suffixes were considered together as a group. For Czech, Smolík (2010) investigates inflectional affix priming for the nominal suffix -a (nominal) and the verbal suffix -ete (second person plural) at two different inter- stimulus intervals. Comparisons between the inflectional affix prime and phonological controls do not reach significance, whereas marginal effects are reported for verbal affix priming at a shorter ISI. Most recently, Goodwin Davies and Embick (2019) examined regular English plural suffixes, and report significant facilitation for inflectional affix priming (crimes → trees) relative to phonological (cleanse → trees) and singular (crime → trees) controls.
11 On this theme, see Borer (2013) for another perspective on how the derivation/ inflection split relates to the pieces versus processes issue. 12 There are several reasons why we would expect facilitation for inflectional affixes to be small relative to stems and/or derivational affixes (such as high frequency, prosodic weakness, and homophony, see discussion in Goodwin Davies and Embick, 2019), and as a result, studies investigating inflectional affixes may be more susceptible to being underpowered.
94 David Embick, Ava Creemers, and Amy J. Goodwin Davies There are two important questions that should be asked regarding inflectional affix priming: (i) what exactly is being primed? and (ii) what does primability imply representationally? The first question is a standard one. Taking into account the distinctions that we have focused on in Section 5.2, it is too coarse to say that crimes primes trees because both of these “involve a plural morpheme.” Instead, it must be asked if the priming is driven by the morphological feature [+pl], the phonological exponent /z/, the semantic interpretation of plurality, or some combination of these; see Goodwin Davies and Embick (2019) for discussion and for suggestions about how these factors might be disentangled. The second question concerns what can be inferred from the existence of a priming effect. Both pieceless and piece-based approaches employ features like [+past]. If features can be primed irrespective of whether they are represented “in their own morphemes” or not, then both approaches would predict facilitation like that discussed above. On the other hand, if priming were expected only of independently stored representations in memory, it is not clear that they would predict the same thing. A similar point can be raised with respect to how features are realized phonologically. In an amorphous approach, apparent affixes like plural /z/are simply the byproduct of a word formation rule that rewrites phonological strings. In a piece-based theory, a Vocabulary Item like [+pl] ↔ /z/is stored as an object in memory. To the extent that rules (and their byproducts) and pieces might differ with respect to facilitation, there is a further dimension along which the approaches might be compared. The overall point that emerges is that when priming is employed to probe the representation and processing of morphology, the specific questions that arise connect with much more general ones about the nature of priming—questions like Are features that are subparts of representations expected to show priming effect? Or Are rules capable of being primed like pieces in memory are? As should be clear, these questions apply in other areas of language (phonology, syntax), suggesting a number of points of possible connection. Another question to ask is what techniques other than priming might shed light on the pieces-or-not discussion, whether behavioral or neurological. In summary, the questions here, though fine-grained, are central to morphological theory, since they implicate the basic representational status of morphemes and hence of words. The challenge is to develop experimental probes that are capable of testing the predictions derived from the pieceless/piece-based representational distinction. While some initial steps have been made in this direction, much work remains to be done in refining and testing the predictions of the views that differ with respect to (Q3).
5.5 Discussion Our focus in this chapter is on the questions (Q1–3), which we believe to be central to the study of morphology and the mental lexicon. These are repeated here in short form to facilitate our concluding remarks:
Morphology and The Mental Lexicon 95 (Q1) Is there independent morphological representation? (Q2) What does it mean to store (“morphologically complex”) words in memory? (Q3) What does it mean to be morphologically complex, or, are morphemes pieces? At the beginning of the chapter, it was suggested that one of the main dividing lines between approaches, whether words are decomposed or not, is too coarse-grained to guide the development of competing theories. Having reviewed (Q1–3) in preceding sections, we now note that there are several distinct and logically independent senses in which the term decomposition might be applied. First, from (Q1), an approach positing specifically morphological features like [+past] could be said to have morphological decomposition, in the sense that (at least some) words consist of more than one independently existing morphological representation. Two additional senses connect with (Q2). Recall that we identified two kinds of effects: those relating to the composition of complex objects like past tense verbs (called (R1) above) and those relating to the phonological realization of the composed objects (R2). For (R1), an approach that composes [SING] and [+past] to produce the past tense of this verb could be called decompositional in comparison to one that says that [sang +past] is stored in memory “as a single object,” that is, with no operation responsible for the composition of its component elements. For (R2), an approach that says that sang is derived from a representation that is shared with sing could be called decompositional in comparison to an approach in which sang is stored as an allomorph of sing. In the former, sang is composed (morpho)phonologically, while in the latter it is not. Finally, and connecting with (Q3), an approach that says that morphological features are represented in independent pieces (i.e. morphemes), similar to what is done with stems, could be called decompositional with respect to “amorphous” alternatives, since words are composed out of independently existing pieces. In summary form, our primary argument is that work on morphological decomposition must take these distinct senses into account in framing opposing theoretical positions, and identify which of (Q1–3) is being examined. An example from the literature helps to illustrate why it is important to be clear about this level of detail. Baayen et al. (2011) investigate the properties of a Naïve Discriminative Learning model of morphology, with an emphasis on reading. They frame their investigation with reference to linguistic theories that “ . . . take the word as the basic unit of morphological analysis,” pitting this view against one that takes morphology to be “a formal calculus with morphemes as basic symbols.” These comments look very much like a stance on (Q3), as does their general claim that “the questions of whether and how a complex word is decomposed during reading into its constituent morphemes are not the optimal ones to pursue.” But the sentence immediately following this one goes in another direction, and asks “how a complex word activates the proper meanings, without necessarily assuming intermediate representations negotiating between the orthographic input and semantics.” The intermediate representations that they are talking about eliminating are independent morphological
96 David Embick, Ava Creemers, and Amy J. Goodwin Davies representations, whose existence or not (in our (Q1)) is logically independent of the question in (Q3) of how such features are represented. Baayen et al. describe their model as one that forms associations between forms and meanings. The former are represented as letter strings. The latter are represented in a “semantics” that allows certain orthographic sequences to be associated with meanings. The idea is that these form/meaning connections will be learned directly, that is, without intermediate (morphological) representations. A closer look at the details of the representations used in the learning model highlight the tension between (Q1) and (Q3) identified above. For example, hand is represented as lexical meaning = {HAND}, number = {}, while hands is stored in the lexicon as lexical meaning = {HAND}, number = {PLURAL}; the model would then succeed if it learned the connection between orthographic s and PLURAL. But what kind of representation is {PLURAL}? This question is at the center of Marantz (2013), which argues that these features are crucially morphological in nature, not semantic. This is because they are given to the learner as discrete and abstract features, independent of the semantic representation of plurality. Put simply, the system is not given the form hands and a situation in which there is more than one hand; it is given that form and an association with {PLURAL} directly. Seen in this way, it is not the case that the Baayen et al. model does away with morphological features by directly learning form/meaning correspondences. Rather, it is given the morphological features like {PAST}, and it learns an association between these features and their phonological realization(s). Far from dispensing with independent morphological representations, the model crucially assumes them. So, while the model might be amorphous in the way that is implicated by (Q3), it posits morphological features along the lines of (Q1). This point seems to us to be very much on target. At the same time, Marantz describes the Baayen et al. approach as requiring morphemes, a way of talking about things that connects closely with (Q3) in a great deal of work, not (Q1). For our purposes, whether one’s sympathies lie with the Baayen et al. (2011) type of project or with Marantz’s take on what it assumes is not at issue; what matters is that the question at issue must be properly identified. This exchange is clearly about decomposition in a general sense, but in order to determine what precisely is at issue, a framework for discussion at the grain of (Q1–3) is required.
5.6 Conclusions Questions about decomposition will continue to dominate the investigation of morphology and the mental lexicon. An important trend has brought a range of research programs in experimental and quantitative/computational areas into contact with very specific claims in morphological theory. The particular one that occupies much of the discussion above is the idea that the “classic” morpheme must be abandoned, in a way that connects with the Separation Hypothesis. If the main lines developed above are on the right track, incorporating Separation into debates about decomposition calls for
Morphology and The Mental Lexicon 97 a reassessment; the specific one that we have argued for here centers on the idea that investigation of topics implicating decomposition has proceeded to the point that it is necessary to move beyond a simple “decomposition-or-not” dichotomy. We hope to have taken a step toward doing this, by developing a framework for discussion based on questions (Q1–3).
Acknowledgments We are indebted to Hezekiah Akiva Bacovcin, Johanna Benz, Nattanun Chanchaochai, Lefteris Paparounas, Linnaea Stockall, Meredith Tamminga, Yosiane White, Robert J. Wilder, and the class of LING 575 (Spring 2020) at the University of Pennsylvania for helpful comments on this chapter and the work surrounding it.
Chapter 6
Syn tax and Th e L e x i c on Artemis Alexiadou
6.1 Introduction The relationship between syntax and the lexicon has been at the center of important discussions and debates since the early days of Generative Grammar. In this chapter, I will take as my point of departure Chomsky’s Remarks on Nominalization, which clearly delineates the division of labor between the syntax and the lexicon and offers a modular distinction between the two components, (see also Alexiadou and Borer, 2020; Chomsky, 2020). This model was later adopted in subsequent stages of the theory and is largely assumed within non-Chomskian generative models. Chomsky (1970, p. 184) put forth a grammar model, according to which the grammar contains a base consisting of a categorial component that generates phrase markers and a lexicon. The lexicon for Chomsky contained lexical entries that are built up of specified features. In the model of grammar he envisaged at the time, there is a process of lexical insertion that allows lexical entries to replace the terminal elements in phrase markers, creating a first level of representation, deep structure. Various transformations relate the first level of representation, deep structure, to another level of representation, namely surface structure. The base and the transformations form the syntax. Chomsky’s focus in Remarks was the relationship between syntax and the lexicon and the empirical domains that can be used to provide evidence for the complexity of one or the other component. In this chapter, I will go over Chomsky’s argumentation and discuss more recent approaches to the relationship between syntax and the lexicon that put the burden on the syntactic component. As Chomsky (1970, p. 185) emphasized in Remarks, “the grammar is a tightly organized system; a modification of one part generally involves widespread modifications of other facets.” My focus of investigation in this chapter is rather modest: I will be concerned with argument structure and, in particular, I will take transitivity alternations, nominalization, and adjectival passives to form my empirical basis, see Ramchand (2013) for an overview of other alternations. Because of this, I will have nothing to say about inflectional
Syntax and The Lexicon 99 morphology, partly also because in the main model I will be discussing here, namely Distributed Morphology (DM), a distinction between inflectional and derivational morphology is not being made (see Embick, Creemers, and Goodwin Davies, this volume, but cf. Borer, 2005a).1 The chapter is structured as follows. In Section 6.2, I offer a brief summary of Remarks. In Section 6.3, I turn to the way the syntax-lexicon relationship was dealt with in the model of Government and Binding (GB). In Section 6.4, I offer a brief introduction to the framework of DM, and in Section 6.5, I discuss argument structure within Distributed Morphology. In Section 6.6, I discuss Ramchand’s (2008, 2018) First Phase Syntax framework, and in Section 6.7 Borer’s (2005a, b, 2013) Exoskeletal model. In Section 6.8, I turn to a brief discussion of nominalization and adjectival passives from a DM perspective. In Section 6.9, I conclude.
6.2 Remarks on nominalization The empirical focus in Remarks is the relationship of the two nominal structures in (1b–c) to their verbal counterpart in (1a): (1)
a. John criticized the book. b. John’s criticizing the book c. John’s criticism of the book
There are important differences between gerundive nominalizations (1b) and derived nominals (1c), as Chomsky points out. These include the productivity of the nominalization process, the generality of the relation between the nominal and the verbal form, and the internal structure of the nominalization. To begin with, gerundive nominalization is very productive, and the output of this process gives nominals that are transparently related to their source verb. By contrast, the process that derives nominals of the type in (1c) is not as productive in the sense that it cannot apply to all verbs. Naturally, it could be argued that since the affixes that are involved in derived nominals as in (1c) are Latinate, they do not combine with Germanic roots. -ing, however, combines with both Germanic and Latinate verbs to yield gerunds.
1 A
disclaimer is in order: acknowledging that the linguistic schools that have studied argument structure are quite diverse, this chapter does not aim to offer a comprehensive overview of all possible models and different types of predicate decomposition that have been put forth, but see Jackendoff, this volume. Alexiadou (2015) offers such an overview, and see also Pustejovsky (1991), Levin and Rappaport Hovav (2005), Ramchand (2008), Reinhart and Siloni (2004), Borer (2017), Wood and Myler (2019), and Wechsler (2020) for further discussion and references. Moreover, the chapter does not aim to discuss all available syntactic and constructionist views on the lexicon; again, Alexiadou (2015) offers such a comprehensive review.
100 Artemis Alexiadou (2) a. John is easy to please. b. John’s being easy to please c. *John’s easiness to please. (3) a. b.
killing *killation
Moreover, the semantic relationship between the nominal and the verb is not transparent. For instance, laughter and qualifications have a range of meanings that seem unrelated to the base verb. Importantly, along these differences, the internal structure of the two nominals also differs: the gerund has the internal structure of a verb, while the derived nominal has the internal structure of a noun. This is evidenced by the fact that the gerund can be modified by adverbs, while the derived nominals are modified by adjectives (4a–b).2 The gerund does not allow any overt determiner apart from the possessor, while the derived nominals can be introduced by a variety of determiners (4c–d). (4)
a. b. c. d.
John’s criticizing the manuscript carefully *John’s criticism of the manuscript carefully *the/*that criticizing the manuscript a/the/that criticism of the manuscript
Furthermore, Chomsky argues that derived nominals do not include raising to subject and raising to object. (5) a. b.
*John’s likelihood to win the prize *our belief of God to be omnipotent
As these are rules of the syntactic component, then derived nominals are not derived in the syntax (see also Newmeyer, 2009 for a more recent discussion of this view on the organization of grammar).3 This split behavior follows, Chomsky argues, if the gerundive nominal involves a transformation that applies to an underlying verbal structure. Two positions are identified and briefly discussed as alternatives: the lexicalist position and the transformationalist position. The latter assumes that all nominal forms are derived by applying transformations (see e.g., Lees, 1960). The former entails that the base rules are enriched to accommodate derived nominals without appealing to transformations. 2 Actually,
this particular claim has been refuted in some later literature. See Alexiadou (2001) for Greek, Borer (2013) for Hebrew, and Fu, Roeper, and Borer (2001), and Bruening (2018) for English, where it is shown that derived nominals allow adverbs. 3 Bruening (2018) provides a series of empirical arguments that both these syntactic rules are allowed in derived nominals. I will come to that in Section 6.8.
Syntax and The Lexicon 101 Chomsky then proposes that in order to capture the selectional similarities between verbs and derived nominals, for example, refuse can be entered in the lexicon with strict selectional and subcategorization features but free from categorial features such as noun and verb. Thus, derived nominals but not gerundive nominals correspond to base structures and not transformed structures. In addition, ing of nominals or nominal gerunds are closer to derived nominals, Chomsky argues. Derived nominals, as non-derived verbs or nouns for that matter, can then be subjected to complex transformations. For instance, they can undergo agent post- posing, yielding, for example, the destruction of the city by the enemy and Noun Phrase (NP) preposing, yielding the city’s destruction. The nominal the city’s destruction by the enemy is, according to Chomsky, only apparently the nominalization of a passive, as the input is not a verbal structure. Agent-postposing is unspecified for gerundive nominals. If both derived nominals and gerundive nominalizations were derived by transformations, there should be no reason to expect a difference between the two. Remarks thus offered a model of the relationship between the lexicon and syntax that distinguished between nominal forms that are derived via transformations of a verbal structure and forms that are not the result of transformations. Building on this, Wasow (1977) proposes the following characterization of syntactic as opposed to lexical rules: (6)
syntactic rules do not change category labels may not be local not subject to exceptions
lexical rules may change category labels are local have idiosyncrasies
Wasow’s empirical domain was that of adjectival as opposed to verbal participles. He pointed out that adjectival participle formation seems to be a lexical rule as it shows several idiosyncrasies, morphological as well as interpretational. For instance, adjectival participles may show special morphology, while verbal passive participles always show regular morphology, (7a–b). Moreover, adjectival participles show idiosyncrasy in meaning, unlike verbal participles (7c–d): (7)
a. b. c. d.
The shaven man The man was shaved by John. The hung jury (#Someone hung the jury) *The jury was being hung.
If both types of participles were derived the same way, these differences cannot be explained. To account for this, Wasow proposed that adjectival participle formation is lexical, while verbal passives are syntactic, meaning involving transformations applying to a verbal source.
102 Artemis Alexiadou
6.3 Syntax and the lexicon in gb As stated in Borer (2017), most generative approaches have adopted this division of labor and assumed that there is a powerful lexicon that contains a list of lexical entries, each containing a set of instructions to syntax, phonology, and semantics. For Di Sciullo and Williams (1987) syntactic rules do not operate in the lexicon, that is, words are atomic (8): (8)
“Words are atomic at the level of phrasal syntax and phrasal semantics. The words have ‘features’ or properties but these features have no structure and the relations of these features to the internal composition of the word cannot be relevant in syntax” (Di Sciullo and Williams, 1987, p. 49).
Specifically, within the framework of GB, the Projection Principle was responsible for preserving lexical information in syntactic derivations. (9)
Lexical information is syntactically represented.
The behavior of verbs in argument alternations is perhaps the clearest illustration of the very complex interaction between syntax and the lexicon. However, how we view this relationship changed through the years, echoing Chomsky’s observation about the consequences that the enrichment of one module has for the organization of grammar. In GB, for example, each verbal entry carried information about the number and type of arguments that each verb can combine with as well as the thematic roles these are associated with: (10)
kill:
1 NP Agent
2 NP Patient
In other words, every verb is associated with a predicate-argument structure (AS), which not only gives information about the number of this verb has but also information about how the arguments are projected onto syntax, for example, as internal or external arguments (Marantz, 1984; Williams, 1981). AS is not a syntactic layer, rather it is part of the lexical information of any given verb. The syntactic encoding of arguments is regulated by principles such as the Universal Alignment Hypothesis of Perlmutter in (11) and Postal (1984) or the Uniformity of Theta Assignment Hypothesis (UTAH), (12) (Baker, 1988; Pesetsky, 1995), and the Linking Rules in Levin and Rappaport Hovav (1995): (11)
Universal Alignment Hypothesis There exist principles of universal grammar which predict the initial relation [= syntactic encoding] borne by each nominal in a given clause from the meaning of the clause.
Syntax and The Lexicon 103 (12)
UTAH Identical theta relationships between items are represented by identical structural relationships between those items at the level of D-structure.
While in GB, verbs are considered to project verb phrases that contain internal arguments, that is, they are not syntactically complex; it is possible to conceive of verbs as being complex in terms of event structure (see for example, Grimshaw, 1990). An independent question that arises is whether this event complexity is also structurally represented or simply part of the lexical information associated with particular verbs. Typically, in work carried out within the GB model, no such decomposition is assumed in the syntax. This is very different from what was assumed earlier by Generative Semanticists who provided very rich syntactic representations including information on event sub-components, for example, Lakoff (1971). Later work on lexical semantics has shown that we can distinguish between verbs such as kill, which have a complex event structure and sweep, which have a simple event structure. This means that different event structures must be assumed for the different aspectual classes of verbs, building on Vendler (1967). A predicate that is associated with a complex event structure, that is, an accomplishment, will always be transitive, while simple event structure verbs, that is, activities, may be intransitive. The corresponding event structures are given in (13c–d) respectively, following Rappaport Hovav and Levin (2001), where we see that verbs split into two main types, manner vs. result: (13)
a. b. c. d.
John swept. John broke the vase. [x ACT ] [[x ACT] CAUSE [BECOME [y ]]
The model illustrated here is a primitive-based model of verb meaning (see Jackendoff, this volume). Such models assume that the meaning of a given verb can be defined through the primitives in which this is decomposed (cf. Dowty, 1979). By contrast, relation-based theories (e.g., Fodor, 1975) argue that there is no need for such decomposition; see Pustejovksy (1991) for detailed discussion on the difference between primitive-based and relation-based theories of word meaning. In Rappaport Hovav and Levin, principles such as the Argument realization principle (ARP) in (14) regulate the projection of arguments: (14)
There must be one argument XP in the syntax to identify each sub-event in the event structure template. (Rappaport Hovav and Levin, 2001, p. 779)
Basically, this means that complex events translate as transitive predicates while simple events are intransitive predicates, that is, activities. Importantly, only result verbs
104 Artemis Alexiadou undergo transitivity alternations, while manner ones do not. It is crucial to point out that Rappaport Hovav and Levin (2001) did not claim that this event complexity is syntactically real; it simply enforces a transitive construal for verbs associated with complex event templates. This point was made earlier in Grimshaw’s work, which takes the aspectual properties of predicates to be one factor in determining AS realization, the other one being the thematic hierarchy. However, once VP-shells were introduced (Larson, 1988), theta-roles were no longer forced to occur in unique positions. In addition, the introduction of elaborated functional structures above the lexical layer (Pollock, 1989; Ouhalla, 1988) led researchers to rethink ways verbal meaning is put together in the syntax. The pioneering work of Hale and the Keyser (1993a, b) proposes several structural types that build the area of lexical syntax, that is, the part of the syntactic structure that excludes the functional layers. This work influenced many linguists who then proposed syntactic models of event structure where the event sub-parts in (13c–d) are associated with syntactic verbal projections. Such models will be discussed in Sections 6.5–6.7. These models echo work within Generative Semantics on the syntactic reality of primitives of word meaning. Unlike Generative Semantics, however, contemporary syntactic models of event decomposition do not rely on complex transformations to yield surface structure. Moreover, there is some debate concerning the number of primitive units that are syntactically represented, that is, CAUSE, BECOME, DO (see Dowty, 1979 and Jackendoff, this volume). The issue to which extent lexical information about verbs determines their syntactic behavior has been controversially discussed. For example, Borer (2005b) argued that the lexical semantics of the verb has limited influence on the syntactic behavior of the arguments or the semantic interpretation of the clause. If verbs have variable argument realizations, then it is unlikely that this can be captured by multiple lexical entries. In fact, the more variability a verb allows, the more likely it is that the syntactic context contributes to its interpretation (see also Borer, 2017 for a recent discussion). To this end, Borer proposed that if verbs can be used in ways that are not predicted by their lexical structure, then the lexicon must be minimized. A much-cited example to illustrate this point is the following one, from Clark and Clark (1979): (15)
a. b. c. d. e.
The fire stations sirened throughout the raid. The factory sirened midday and everyone stopped for lunch. The police sirened the Porsche to a stop. The police car sirened up to the accident. The police car sirened the daylights out of me.
As Borer (2017, p. 127) emphasizes “if the syntax was determined by listed insertion frames, we would need five different insertion frames for siren, of which at least four would convey interpretational information that cannot be deduced from sounding sirens alone.” By contrast, authors such as Levin and Rappaport Hovav argue that flexibility of verbal meaning is the result of association with different and more than one event frame.
Syntax and The Lexicon 105 In other words, a particular verb can be associated with either a mono-eventive or a bi-eventive structure, (13c–d), yielding distinct interpretations. This is also, to some extent, a feature of Rachmand’s (2008) system that allows verbs to carry specification for lexicalization for more than one event (sub-)structure, and compare with the notion of constructions in, for example, Goldberg (1995). Dowty (1979) offers a more compositionally rigorous version of types of event structures, how they are related, and how verbs with one can be derived from another (see also Pollard and Sag (1994) and Wunderlich (1997) for other approaches from a lexicalist perspective within Head- driven Phrase Structure Grammar (HPSG) and Lexical Decomposition Grammar, respectively; see also Jackendoff, this volume). As summarized in Alexiadou (2015), several authors have pointed out that semantically similar verbs may behave differently across languages (see also Doron, 2003). A case in point here is the verb kill; while in English it cannot enter the causative alternation, it does in Hebrew (Reinhart, 2002) and in Greek (Alexiadou, Anagnostopoulou, and Schäfer, 2006, 2015). Second, a given verb within a language may have multiple syntactic realizations, as is the case with the verb open (Rosen, 1996), which can have both transitive and intransitive variants. Finally, semantically similar verbs may allow different mappings, for example, kill and break both contain a complex event structure, but only the latter is licit in the causative alternation in English (see also Rosen, 1996). Moreover, already in the 1980s it was suggested that somehow external arguments are special, see for example, Marantz (1984) and Williams (1981). For instance, as Marantz (1984) points out, external arguments are excluded from idiomatic interpretations and the thematic role of such arguments is computed on the basis of the VP containing the verb and its internal argument (e.g., kill an audience, kill a bottle etc.). Building on this, Kratzer (1996) provided a series of arguments as to why external arguments should be severed from lexical meaning. Kratzer used adjectival participle formation and nominalization as an illustration that while external arguments are not part of the lexical meaning, internal ones are. Her arguments are as follows: in (17) the deverbal noun appears with its internal argument and in this particular context, it must appear with this argument. This has been extensively discussed in the work by Grimshaw (1990), who argued that deverbal nouns are ambiguous between readings that do not support AS and readings in which internal arguments are obligatory. I will come back to that in Section 6.6. The important observation here is that the external argument is not obligatory. This is so as, according to Kratzer, the external argument is not part of the verbal structure that is nominalized. (17)
The frequent examination of students is boring.
This in turn means that the nominalization can apply to a structure that lacks an external argument (Alexiadou, 2001). A similar argument was put forth for adjectival participles: in (18a) a reflexive reading of the participle is possible, that is, it can be that the children secured themselves. However, this self-action reading is out in (18b), the reason
106 Artemis Alexiadou being that (18b) is a verbal passive containing an implicit external argument. This test, discussed in Baker, Johnson, and Roberts (1989), diagnoses the presence of an implicit external argument. Since this is only possible in the verbal passive, the conclusion is that adjectival passives lack such an argument. I will come back to this in Section 6.8. For Kratzer, the availability of the self-action reading in (18a) suggests that the structure has not yet combined with an external argument. Thus, external arguments must be severed from the verbal meaning, while internal ones are part of the lexical meaning. (18)
a. b.
The children are secured with a rope. The children are being secured with a rope.
This now imposes non-uniformity on argument realization. But why should external arguments be privileged? This is a point that has been criticized in the literature (see e.g., Wechsler (2005, 2020) and Müller and Wechsler (2014)). Two points emerged from this work. First, it is argued that Kratzer’s claim that severed arguments preclude verb specific thematic roles is technically problematic since event variables and verb specific meaning postulates can deal with this issue. Second, the richness of individual thematic roles put forth in Dowty (1991) would require either a proliferation of flavors of argument taking and/or categorizing heads and meaning postulates, complicating the truth conditional semantics significantly. Indeed, the syntactic decomposition of event structure essentially forces a compositional semantic decomposition that makes it harder rather than easier to deal with fine-grained, word specific truth conditions with extra machinery. Furthermore, the specific idea that subjects (qua external arguments) are excluded from idiom formation has been challenged as far back as Nunberg, Sag, and Wasow (1994). Finally, some further work has even argued against event structural approaches at large (lexicalist or syntactic) on broader grounds (see discussion in Dowty (1979: Chapter 2), against certain aspects of Generative Semantics, and Koenig and Davis (2006), and Beavers (2010) on the broader notion).4 In what follows, I will offer an illustration of how the relationship between syntax and the lexicon is viewed from the perspective of the framework of DM. I will first introduce the basic properties of this model; see also Embick et al., this volume, and then turn to a discussion of verbal AS and alternations, nominalization, and adjectival participles. In the area of structural event decomposition, the view of DM will be contrasted to two other models, with which it shares some, but not all, assumptions about decomposition of verbal meaning. This choice of models is based on two criteria: first, work within DM shares with Borer (2005a, b, 2013) the idea that roots are the core units of word formation and combine with functional structure. Second, some work within DM, crucially the work that does not adopt the flavors of v perspective (Folli and Harley, 2005), shares Ramchand’s view of the decomposition of
4
I am indebted to an anonymous reviewer for very specific comments on this part.
Syntax and The Lexicon 107 causation, although Ramchand herself does not commit to acategorial roots being the basic units of word formation. The reason I chose to focus on DM is that it revives the basic proposal in Remarks, namely the absence of categorial features on nouns and verbs and their contextual determination.
6.4 Distributed morphology Marantz (1997) took issue with the division of labor common to GB and other theories of grammar. The basic idea is that one does not need to assume that there is a special component of the grammar such as the lexicon, where words are stored. Rather, words are created in the syntax out of the most basic units—roots—in combination with functional layers. According to Marantz, “what’s idiosyncratic is the relationship between the nominalizations and any ‘sentence’ they might be derived from” (Marantz, 1997, p. 215). Marantz reaffirms the argument in Remarks that elements could be acategorial, and the context in which they appear determines category as well as morphophonological shape, for example, destroy vs. destruction. Although that was not the claim in Marantz (1997), in later work it was proposed that roots become categorized by light heads, that is, v and n in (19), as will be detailed below. (19)
This shifts the perspective on the lexicon and processes of word formation in general in that the lexicon is distributed across three lists: list 1, which contains roots and bundles of grammatical features, list 2, which provides the phonological form for roots and other morphemes, and list 3, the Encyclopedia, which contains a list of special meanings. Within this framework, there are, in fact, two places for word formation: words can be formed out of roots, that is, involving the combination of a categorizer, or derived from other words (Arad, 2003; Embick, 2010a, b). The first level of word formation— the root cycle—is the level where morphological and semantic idiosyncrasies may be observed. The second level of word formation—the outer cycle—is the place where all
108 Artemis Alexiadou word formation is morphologically and semantically transparent. Roots obligatorily combine with categorizers at the first step of derivation: (20)
Roots are categorized by combining with category defining functional heads (Embick, 2010, p. 21).
According to Embick, gerundive nominals are a clear example of an outer-cycle derivation, while derived nominals of a root cycle derivation. As shown in (21), the gerundive nominal marrying contains a verb and thus refers to an event of marrying, while the nominal marriage is root derived: (21)
(22)
Importantly, however, if roots are involved in word formation as the basic units that combine with functional layers, categorizers first, and then other layers of extended projection, a question emerges regarding AS. As we saw in GB, a verbal lexical entry contained information about the number of arguments a verb has and how these are projected in the syntax as external or internal arguments. Within DM, however, verbs are no longer primitives; they are derived from a combination of roots and categorizers. While in English, such categorizers may be null (but cf. Borer, 2013), they often receive lexicalization by distinct pieces; for example, -en in (23) realizes a v head: (23)
In early work within DM, it was assumed, as in Remarks, that acategorial roots carry selectional restrictions and seem to introduce the internal argument of a verb/noun. In other words, internal arguments are root arguments (Harley, 2014). Ideally, however, if verbs composed out of more primitive units, it cannot be the case that roots introduce arguments. Models of syntactic event decomposition are thus necessarily led to propose that arguments are introduced by some functional layers and are not part of a lexical entry, that is, the root. To this end, Borer (2005b) and Lohndal (2014) have argued that all arguments must be severed from the lexical core. One argument
Syntax and The Lexicon 109 provided by Borer (2005b) relates to the behavior of reflexive datives as opposed to possessor datives in Hebrew adjectival and verbal passives. Specifically, as discussed by Borer (2005b), in Hebrew, the two datives behave very differently: as shown in (24– 27), the reflexive dative, lo, is licensed in adjectival passives, but not in verbal passives, while the possessor dative, li, shows the opposite behavior. In Hebrew, a dative can be interpreted as a possessor only in the case of unaccusatives, as argued for in Borer and Grodzisky (1986), who concluded that possessor datives are associated with internal arguments only. By contrast, Borer and Grdozisky (1986) argued that, as reflexive datives are associated with external arguments only, they are accepted with unergative predicates. What (24–27) show is, as Borer points out, that the licensing of a possessor/ reflexive dative is sensitive to external vs. internal nature of the argument. If adjectival passives only have internal argument as in Kratzer’s system, and internal arguments are projected within the VP both in active and passive sentences, the observed asymmetry cannot be explained: (24)
(25)
(26)
(27)
Adjectival passive, possessor dative *ha.xeder haya mequšat li bepraxim the.room was decorated.a.pass to.me withflowers Adjectival passive, reflexive dative ha.xederi haya mequšat loi be praxim the.room was decorated. a.pass to.it inflowers ‘The room was decorated with flowers.’ Verbal passive, possessor dative ha.xeder haya qušat li the.room was decorated.v.pass to.me ‘My room was decorated with flowers.’
bepraxim withflowers
Verbal passive, reflexive dative *ha .xeder haya qušat lo bepraxim the.room was decorated.v.pass to.it withflowers
Borer then concludes that internal arguments must also be severed from the verbal predicate; see also Alexiadou (2014a) for a summary of this discussion. If now all arguments are severed from verbal meaning, and actually, if verbal meaning is determined on the basis of syntactic configurations, arguments must be introduced by a particular type of structure that is included within these configurations. There are several such proposals (see Alexiadou, 2015 for an overview). In what follows, I will discuss two proposals within DM, which I will then contrast to Ramchand’s and Borer’s models.
110 Artemis Alexiadou
6.5 Argument introducing heads in dm 6.5.1 External arguments Let us now take a closer look at argument introduction within DM. Research within this framework assumes that the root does not introduce the external argument and adopts the Voice hypothesis, introduced in Kratzer (1996), who, as discussed above, severed the external argument from the VP it combines with: (28)
The Voice hypothesis: Voice is responsible for the introduction of external arguments. The same head introduces a Determiner Phrase (DP) in the active and licenses a Prepositional Phrase (PP) in the passive.
Naturally the idea that the external argument is introduced in a layer different from that introducing the internal argument has been put forth in earlier work by, for example, Larson (1988) and Hale and Keyser (1993). VoiceP combines with vP, which following Pylkkänen (2002, 2008) introduces causative semantics. As we will see, Borer (2005b), Ramchand (2008), and Alexiadou et al. (2015), among others, further subdivided the complement of the external argument introduction projection (see also Travis, 2010). Let us take the causative alternation to illustrate how this model works. The examples in (29) show that a verb like break can appear in two structural configurations: (29a) is a transitive causative variant, while (29b) is an intransitive, anticausative variant. The two variants are related, as (29a) roughly means that John caused the window to break, see Levin and Rappaport Hovav (1995). (29)
a. b.
John broke the window. The window broke.
What are arguments in favor of the view that event decomposition takes place in the syntax? Alexiadou et al. (2015) follow von Stechow (1996), who made this point on the basis of the ambiguity, with adverbs like again. As von Stechow showed, the interpretation of again is influenced by word order. Let us discuss this argument in some detail: again shows an ambiguity in causatives and anticausatives alike. In both (30a–b), we have both a so-called restitutive and repetitive reading, while in (30c), we only have a repetitive reading. On the restitutive reading, (30a) means that John causes the door to return to its previous state of being open, and in (30b) again scopes over the result state. On the repetitive reading, (30a) means that the subject opens the door and it is presupposed that John had done this before and (30b)
Syntax and The Lexicon 111 again scopes over the change of state event. (30c) has only a repetitive reading. von Stechow then argued that since the placement of again influences the interpretation of the sentence, event decomposition takes place in the syntax. This in turn means that (8) cannot be maintained as is, as the internal composition of the verb is relevant to the syntax. (30)
a. b. c.
John opened the door again. The door opened again. Again, John opened the door./Again, the door opened.
On the basis of this argument then, (29a–b/30a-b) are similar in that they contain the result of an event as well as the causative event. Building on this, Alexiadou et al. (2015) argued that the two examples are similar in that they also both contain a causative component; the difference between them is that in (29a/30a) the agent of the event is also included. To this end, they proposed that causative transitive verbs, for example, break and open next to a ResultP contain a VoiceP and a vP. Transitivity alternations of the type mentioned in Section 6.2 are actually Voice alternations in the sense that English intransitive variants of causative verbs lack Voice; see (31b), while transitive variants include Voice (31a) (see Alexiadou et al. (2006, 2015) for details). In the structures in (31), the lowest sub-component is illustrated as root/ ResultP to indicate the point that, for example, John opened the door/The door opened is built on the basis of a result root in the sense of Rappaport Hovav and Levin (2012). A ResultP is involved in the cases of secondary resultative predication as in John hammered the metal flat, where the metal flat constitutes a ResultP (Embick, 2004; Ramchand, 2008). (31)
The argument that the intransitive variant of a causative verb in (29b) contains an event sub-component as well was made on the basis of modification via prepositions that
112 Artemis Alexiadou modify the causing event. In English, the preposition from can introduce causers in the context of anticausatives, suggesting that intransitive variants of predicates entering the causative alternation are also semantically causative. (32)
The window broke from the wind.
Alexiadou et al. (2015) noted that this type of modifier is also licit with verbs that do not enter transitivity alternations, for example, as wilt. While such verbs do not appear in transitive agentive construals (33a), they can be modified by modifiers of the causing event, (33b): (33) a. b.
*The gardener wilted the plants. The plants wilted from the sun.
This suggests that causatives, anticausatives, as well as wilt type verbs, labeled internally caused change of state verbs by Levin and Rappaport Hovav (1995), all contain the layers in (31b). Only transitive causative verbs are built on the basis of (31a). Alexiadou et al. (2015, p. 50), building on Ramchand (2008) and Embick (2004), describe the structure in (31b) as follows: v can express an unspecified and unbounded event (a Process in Ramchand’s, 2008 terms). Adjectives and prepositions also introduce states as do stative roots like √open and √cool. Syntax can build complex event structures out of these atomic parts.
While the focus of Alexiadou et al. (2015) was the causative alternation, the implicit assumption is that verbs that are not causative, for example, unergative verbs such as run, only contain the Voice and v layer and lack a ResultP, and thus, a causative structure. As mentioned above, they may combine, however, with secondary resultative structures as detailed in Embick (2004). For example, John hammered the metal flat, which in turn would yield a complex event structure of the type seen in the work of Rappaport Hovav and Levin (2011). In that case, the ResultP is actually an Adjective Phrase (AP), introducing the internal argument. In sum, in the structures in (31a–b), as we will see below in Ramchand’s (2008) work, the combination of v and a result phrase yields a causative resultative structure (see also Embick, 2004). In the absence of the result phrase, the structure is interpreted as an activity. Folli and Harley (2005) put forth a rather different proposal, according to which, v is the locus of introduction of the external argument. Importantly, v can be of different types, vDO and vCAUSE. The former combines with an agentive subject and can take an incremental theme as a complement, while the latter requires a causer subject and a state as its complement. Folli and Harley argue that this explains certain restrictions in the behavior of consumption verbs in Italian and English. Folli and Harley were concerned with the contrast between *The sea ate the beach and The sea ate away the beach. The latter examples contain the particle away, which introduces a
Syntax and The Lexicon 113 result state in the structure, which in turn makes the presence of the inanimate subject acceptable. For Folli and Harley, the different types of v impose different restrictions on their subjects and their complements, and can account for this type of variability (cf. Schäfer, 2012).
6.5.2 Internal arguments In the previous section, I introduced the view that external arguments are introduced in VoiceP; the decomposition offered in terms of v combining with a ResultP suggests that internal arguments may be introduced via PPs or APs, which are alternative realizations of ResultP. However, Harley (2014) argues that internal arguments are root arguments, as does earlier work in DM. Alexiadou and Schäfer (2011) argued that the locus of internal arguments is actually variable and depends on the complex internal structure the verbal predicate may have. Specifically, the theme argument of anticausatives (and change of state verbs more generally) is introduced in Spec,vP, unlike the theme argument of pure unaccusatives (i.e., change-of-location verbs), which occupies Spec,ResultP. As is well known, there are systematic differences between the two classes of verbs when it comes to there inversion, which can be accounted for on the basis of distinct structures for the two classes (Levin, 1993). (34)
a. b.
There arrived a man. *There broke a vase.
Alexiadou and Schäfer assumed that there is merged at Spec, VP, following Richards (2006), and argued that insertion of there is blocked in (34b). Thus, in (34b) the two elements, that is, there and the DP argument compete for the same position. This is the case for break-type anticausatives, but not for arrive type unaccusatives, as shown in (35). The two structures differ concerning the position where the theme argument is merged, but they are both bi-e ventive; either this is merged as the argument of the lower-e vent small-clause or as the argument of the higher-e vent verb. (35)
a. b.
[vP there [ResultP theme]] [vP *there/theme [ResultP]]
arrive/change of location break/change of state
Their argument was based on scope data involving again and indefinite arguments, as in (36), building on Dobler (2008): (36) a. A bear appeared in Bavaria again. b. #A popsicle melted again.
114 Artemis Alexiadou While in (36a), the adverb takes scope over the indefinite, this is not possible in (36b). This can be explained by assuming that in (36b) the DP argument is above the ResultP, as in (35b), while in (36a), the DP is within the ResultP, as in (35a). We now have the following argument introducing heads, which are responsible for the two core argument roles: Voice introduces external arguments (prototypically agents), and ResultP (AP/PP) introduces “internal arguments” (themes/patients). When it comes to goal arguments, it is assumed that these are introduced by high vs. low Applicative heads; see Pylkkänen (2002, 2008), but compare with Jerro (2016) for arguments against this distinction. Figure arguments in path/location structures are introduced in pP (Svenonius, 2007 and subsequent work). More recently, Wood and Marantz (2017) suggest that there are striking similarities about distinct types of argument introducing heads, summarized in (37): voice introduces the external arguments of vPs (often agents) low appl introduces an argument related to a DP little p introduces the external arguments of PPs (figures) prepositions introduce non-core arguments in a manner syntactically distinct from high appl e. high appl introduces a non-core argument
(37) a. b. c. d.
They propose that these five heads can actually be reduced to one single argument introducer, namely i*, in distinct syntactic contexts. In their system, i*, depending on the context it is merged with, will receive a particular semantic interpretation, and in particular, it will assign to the second constituent it is merged with, the thematic role implied by the first constituent it is merged with.
6.5.3 Voice bundling On the view that Voice is the locus of the external argument, Voice and v are two distinct projections. However, this split is a matter of controversy, and it has been proposed in the literature that it is subject to parametric variation (Pylkkänen, 2008): in some languages, Voice and v are bundled. A recent discussion of the bundling parameter is offered in Harley (2017). Specifically, Harley (2017, p. 16) compared Hiaki to Persian. In Hiaki (38), verbalization and causative or inchoative semantics are simultaneously encoded by a single morpheme, which is analyzed as a clear realization of v: (38) a. Maria vaso-ta ham-ta-k Maria glass-acc break-tr-prf ‘Maria broke the glass.’ b. Uu vaaso ham-te-k The-nom glass break-intr-prf ‘The glass broke.’
Syntax and The Lexicon 115 In order to form a passive in Hiaki, the passive suffix -wa is stacked outside -ta: (39) Uu vaaso ham-ta-wa-k. The-nom glass break-tr-pass-prf ‘The glass was broken/Someone broke the glass.’ This is in line with the structure in (31a); the morphological decomposition of Hiaki verbs suggests that Voice and v are realized by distinct elements. By contrast, in Persian, passive-like structures are not built on top of agentive structures. Basically, the passive and agentive structures are in an “equipollent relation” (Harley, 2017, p. 8): instead of passivizing the light verb in (40a), a different light verb is used (40b). The causative- anticausative alternation patterns alike (see (41)): (40) a. b.
(41) a. b.
Minu bachcha-ro kotak zad Minu child-râ beating hit ‘Minu hit the child.’ Bachche kotak xord child beating collided ‘The child got hit.’ âb be jush âmad water to boil came ‘The water boiled.’ Nimâ âb-ro be jush âvard Nima water-râ to boil brought ‘Nima boiled the water.’
Based on the above contrasts, Harley argues that there are two types of languages, which come with distinct properties, illustrated in (42): (42) a. Voice-bundling language: (i) has relationship between verbalizing morphology and Agent introduction (ii) can have relationship between internal case checking and Agent introduction (iii) has a single position of exponence for verbalizing, causativizing, inchoative, and “passivizing” morphology. b. Voice-splitting language: (i) has agglutinating (“stacking”) passive morphology (ii) can have high applicatives (iii) can show causative morphology in the absence of a syntactic causer argument.
116 Artemis Alexiadou The fact that Voice and v are realized by distinct elements suggests, according to Harley (2017), that Hiaki is a Voice-splitting language. However, as in Persian, Voice and v are realized via a single exponent, Persian is a Voice-bundling language.
6.5.4 Two types of non-active Voice As we have just seen, an argument to diagnose a Voice-bundling language is the observation that in, for example, Persian, the same head is responsible for passive, anticausative, and agentive structures. Such a state of affairs has been described for Greek (see Alexiadou et al., 2015). In Greek, Voice surfaces with active and non- active morphology. Active morphology appears in transitive verbs and a sub-class of anticausatives, while NAct morphology appears on so-called marked anticausatives (43), passives, reflexives, and dispositional middles (44) (Tsimpli, 1989, 2006; Embick, 1998; Alexiadou and Anagnostopoulou, 2004; Zombolou, 2004; Lekakou, 2005; Alexiadou et al., 2006, 2015; Alexiadou and Doron, 2012; Zombolou and Alexiadou, 2014, and others): (43) a. o Janis ekapse ti supa the John-nom burned-act the soup.acc ‘John burnt the soup.’ b. i supa kegete the soup.nom burns-nact ‘The soup is burning.’
causative
marked anticausative
(44) a. To vivlio diavastike ktes the book-nom read-nact yesterday ‘The book was read yesterday.’
passive
b. I Maria htenizete the Mary-nom combs-nact ‘Mary combs herself.’
reflexive
c. Afto to vivlio diavazete efkola this the book-nom reads-nact easily ‘This book reads easily.’
dispositional middle
Applying Harley’s diagnostics to Greek, we see that Voice and v can be realized by distinct exponents: specifically, Greek has many verbalizers, such as -iz-in (45); see Alexiadou (2009), Anagnostopoulou and Samioti (2014), and Panagiotidis et al. (2017). Importantly, non-active Voice morphology is never in complementary distribution
Syntax and The Lexicon 117 with these verbalizers, rather, Voice morphology attaches above these verbalizers. This is shown in (45), where the Greek counterpart of the verb affirm contains an overt verbalizer, and the NActive affix attaches outside this verbalizer (45b): (45) a. veve-on-o certain-v-1SG
b. veve-o-thik-a certain-v-NACT-1SG
In view of the fact that Greek has a distinct position for the exponence of non- active, but behaves like a splitting language, we need an alternative explanation that captures both the single exponence as well as the Voice-v splitting properties of the language. Doron (2003) introduced a somewhat different typology that was adopted in Alexiadou et al. (2015) to account for these facts. Specifically, Doron (2003) and Alexiadou and Doron (2012) argued that there are two types of non-active Voice heads across languages: Passive and Middle. Building on this, Alexiadou et al. (2015) argued that passive attaches above VoiceP in languages such as English and German. The result is that in these languages any transitive verb can passivize; Middle, by contrast, is the non-active counterpart of Kratzer’s active Voice. A way to account for the presence of NAct morphology in Greek is put forth in Embick (1998), who proposes that in case an external argument introduced head lacks a specifier; this is realized as NAct in the morphology. The lack of a specifier is represented as [-D] in (46d): (46) a. b. c. d.
[vP [ResultP]] [VoiceP DP [vP [ResultP]]] [PassiveP [VoiceP DP [vP [ResultP]]]] [MiddleVoiceP [-D] Nact [vP [ResultP]]]
active active English/German passive Greek NAct
From this perspective, Greek NAct Voice is actually Middle Voice (see also Spathas, Alexiadou, and Schäfer, 2015), and this type of morphology appears in a variety of environments, as seen in (44).
6.5.5 Agents vs. causers I have mentioned above that external arguments are introduced in Voice. As is well known, external arguments can bear a variety of thematic roles, that is, agents, causers, or experiencers. The question is whether all types of external arguments are introduced in Voice or not. Typically, agents differ from causers in terms of intentionality, for example, while John may have the intention to break the window, the wind does not do. Ramchand (2008) subsumes both roles under the general term initiator (see Section 6.6), but see Ramchand (2018). Schäfer (2012) argued that the two have
118 Artemis Alexiadou distinct thematic licensers: causers are licensed in v, while agents in Voice, but they have an identical syntax, that is, they are both introduced in Voice. This idea that v is involved in thematic licensing of causers builds on Solstad (2009), who proposed that causers are not event participants but modifiers of the causing events. Alexiadou (2014b, 2018) and Alexiadou and Anagnostopoulou (2019) adopt this and argue that causers and agents have a distinct syntax, at least in Greek. Specifically, causer subjects are introduced in vP, while a Voice introduces agents. This conclusion was reached by looking at the properties of predicates such as wilt as well as object experiencer verbs, which allow transitive construals but only with causer subjects and resist passivization. Let me briefly illustrate the first point. I mentioned above while discussing (33) that internally caused change of state predicates such as wilt do not appear in transitive construals. However, McKoon and Macfarland (2000), as well as Wright (2001, 2002), noted that these verbs can have transitive counterparts, as shown in (47), taken from Wright (2002). (47) a. Salt air rusted the metal pipes. b. Early summer heat wilted the petunias. As Wright (2001, 2002) pointed out, in English, these transitive construals differ from verbs such as break entering the causative alternation in that they only allow causer subjects and disallow agents. A similar state of affairs has been described by Alexiadou (2014b) for Greek. In both languages, while these verbs appear in transitive sentences, they do not allow passivization. Assuming that passivization is contingent on the presence of VoiceP in the structure, to explain this, Alexiadou (2014b) argued that transitive internally caused change of state verbs that do not allow passivization involve causer subjects that are introduced at the layer of vP and not in VoiceP, as in (48). By contrast, agents are introduced in Voice, as in (49): (48)
Syntax and The Lexicon 119 (49)
This analysis was extended to object experiencer verbs in Greek, which allow causer subjects but do not allow passivization (see Alexiadou (2018) and Alexiadou and Anagnostopoulou (2019)).
6.6 First phase syntax There are other approaches that decompose verbal meaning and assume that verbal meaning is composed from the syntactic configuration, without appealing to Voice for the introduction of external arguments. Ramchand’s (2008) First phase syntax is a prominent such framework. Ramchand proposes that verbs actually contain three sub-components: a causing sub-event, a process denoting sub-event, and a sub-event corresponding to a result state. These are ordered in a hierarchical relation, shown in (50): (50)
In this system, ProcessP (proc) is present in every dynamic verb. In other words, such a phrase is present regardless of whether we are dealing with an activity, an
120 Artemis Alexiadou accomplishment, or an achievement. The InitiatorP (init) layer is present whenever there is a causational component that exists that leads to the process. Finally, the ResultP (res) layer is present if the lexical predicate explicitly realizes a result state. The structures in (31) were inspired to a certain extent by (50), the crucial difference being the layer introducing the external argument. Noun phrases introduced in the specifier positions of the phrases in (50) are interpreted as initiators, undergoers, and resultees. In Ramchand’s system, noun phrases may bear more than one role. In other words, it is possible for composite roles such as undergoer-initiator and resultee-undergoer. In the first case, the same noun phrase is the initiator of the event and the undergoer of a process, as John in (51a). In the latter case, a noun phrase is both the holder of the result state and the undergoer of a process as the table in (51b): (51) a. John ran to the store. b. John broke the table. Ramchand’s system also allows for verbs to under-associate. This means that, for example, break can lexicalize all three layers in (50), which then corresponds to the syntax of a transitive verb. It can also only lexicalize proc and res, yielding an intransitive variant thereof. This specification is part of the lexical entry of the verb break. While First Phase Syntax shares with DM the view that arguments are interpreted due to the syntactic configuration in which they are inserted (following Hale and Keyser (1993)), it does not consider roots the core unit of work formation. Especially in view of the fact that in this system, we have lexical entries that contain the specification in terms of the heads a particular verb can lexicalize, a lot of what determines the syntactic behavior of the verb is part of its lexical information. Ramchand (2018, p. 104) further elaborates on this system and proposes that next to these three dynamic heads, the higher argument is merged in an event head, Evt. Evt could be the locus of a derived external argument in the case of passives. However, when it is base generated there the encyclopedic semantics of the root within InitP will lead to an interpretation of this argument as agent or causer.
6.7 Exoskeletal model Borer’s (2005a, b, 2013) view on AS is based on aspectual structure. As in DM, Borer assumes that roots are the core units of word formation and lack categorial features. Unlike DM, Borer argues that categorizers are not part of the syntactic structure; roots are categorized by functional structure. With respect to argument structure, Borer’s system proposes that there are two aspectual layers present: ASPQP (quantity aspect) and EP (event phrase), which assigns the role of an originator (of a non-stative event). As in other syntactic work, the
Syntax and The Lexicon 121 arguments are introduced in the specifiers of these projections. For instance, the internal argument of a causative verb is introduced in Spec, AspQ. Like in Ramchand’s system, it is further allowed that arguments can have more than one theta-role. In Borer’s model, unaccusatives have a structure similar to transitives, and the argument moves from AspQ to EP; unergatives/activities, by contrast, lack AspQ, the layer responsible for telic interpretations. The flexibility in verbal interpretation seen above is the result of introduction of roots in distinct syntactic configurations: (52)
Borer (2013) assumes that there are basic, underived units—roots—which importantly, roots are not words. Roots are devoid of syntactic category, which becomes available through syntactic structure. Alongside roots, and distinct from them, there is a finite, UG-defined list of functors, where a functor defines a rigidly designating function. The set of rigidly designating functions includes the set of functional elements, for example, determiners, tense markers, plural marking and classifiers, auxiliaries, quantifiers, cardinals, aspectual markers, modals, complementizers, negation, and so forth. Categorial derivational affixes eventually to be realized as -ion, ‐able, en‐, or -hip are also functors. Borer argues that functors fall into two very distinct types along at least syntactic and semantic lines. Elements such as determiners or past tense, and which are typically assumed to be linked with the functional architecture, correspond to some
122 Artemis Alexiadou semantic formulas and are called S-functors. By contrast, functors that ultimately spell out as -ation or -able are categorial functors, or C-functors. When a root, for example, √COAST and √FACE merge with functors that are instantiated as -al or the, they are N-equivalent, but when they merge with functors that are instantiated as -able or will, they are V-equivalent. Importantly, roots are not assigned a category, nor does categorial conversion operation of any sort take place. Rather, roots are N-equivalent or V-equivalent because the categorial space has been divided by a functor, and the “space” into which these roots have been “poured”, so to speak, defines an N-or a V-space respectively.
6.8 Back to nominalization and adjectival passives Recall that in Chomsky’s Remarks, the main issue was that derived nominals are not derived via a series of transformations from a verbal structure (see also Chomsky, 2020). Nevertheless, we have seen in Section 6.3 that derived nominals may contain the internal argument of the verbal predicate they are derived from, a property that led Kratzer to assume that internal arguments are part of the lexical meaning of the verb. Assuming, however, that arguments are introduced by other heads, as we saw in Section 6.5, if nominalizations contain internal arguments, these must be introduced by the same type of heads that introduce arguments in the verbal domain (Alexiadou and Grimshaw, 2008; Borer, 2013). Alexiadou and Grimshaw (2008), following Grimshaw (1990), explicitly stated that only nouns related to verbs have AS. From the point of view of DM, the question is whether the differences discussed in Chomsky’s Remarks between gerunds and derived nominals clearly suggest that the former are verb derived, while the latter root derived, as suggested in Embick (2010) and was discussed in Section 6.4. There are, however, reasons to consider that derived nominals are also verb derived. Evidence for this comes from the observation in Harley (2009), Alexiadou (2009), and Alexiadou, Iordachioaia, Cano, Martin, and Schäfer (2013) that derived nominals contain verbalizers: (53) regular-iz-ation If that is the case, then v must be present within nominals. If v introduces arguments or if v combines with ResultP introducing arguments, then these arguments may be inherited in the nominalization. The question that arises is whether other layers such as Voice may also be present. As we saw in Section 6.3, Kratzer employed this test to severe external arguments in the context of adjectival participles. Applying this to derived nominals, she shows that we obtain similar results, suggesting that Voice may be absent. By contrast, ing of nominals are not compatible with reflexive action, suggesting that Voice is present.
Syntax and The Lexicon 123 (54) a. the enrolling of the students b. the enrollment of the students Recall that in Remarks, derived nominals were not syntactically derived because they do not feed transformations. However, Bruening (2018) recently showed that this claim is not empirically sound. In particular, both raising to subject (55a, his 3a) and raising to object (55b, his 4a) are found in derived nominals, which, according to Bruening (2018), is an argument for the syntactic treatment of derived nominals: (55) a. That is an accepted premise, the same concept should apply to the net neutrality debate and its certainty to increase consumer bills. b. Sadly a species’ name affects its likelihood to survive. Turning to adjectival passives, the picture is slightly more complex than initially presented in Kratzer. Anagnostopoulou (2003) argued that languages differ as to whether or not adjectival passives contain Voice or not. She examined two types of adjectival participles in Greek, -menos and -tos participles, which differ in systematic ways, pointing to the presence of Voice within the former. For instance, they license agentive PPs, as in (56), and instrumental PPs, as in (57), are all licit with menos-participles: (56) a. Ta keftedakia ine tiganis-mena the meatballs are fried ‘The meatballs are fried by Mary.’
apo tin Maria. by the Mary
b. *Ta keftedakia ine tigan-ita apo tin Maria. the meatballs are fried by the Mary ‘*The meatballs are in a fried form by Mary.’ (57) a. To kimeno ine gram-meno the text is written ‘The text is written with a pen.’
me stilo. with a.pen
b. *To kimeno ine grap-to me stilo. the text is written with a.pen ‘*The text is in a written form with a pen.’ Anagnostopoulou took this behavior as supporting the view that Greek -menos- participles contain Voice, unlike their German counterparts. Adjectival participles in German cannot employ agentive or instrumental PPs (58a–b): (58) a. *Der Fisch war von Maria gebraten. the fish was by Mary fried ‘The fish was fried by Mary.’
124 Artemis Alexiadou b. *Ihre Haare sind mit einem goldenen her hair are with a golden ‘Her hair is combed with a golden comb.’
Kamm comb
gekämmt. combed
But importantly, internal arguments are not obligatory in nominalization, unlike adjectival participles. In Grimshaw (1990) and work inspired by Grimshaw (1990), this was explained as follows: derived nominals are ambiguous between a reading that supports AS, and a reading that disallows AS (see also Alexiadou and Borer, 2020). The former is labeled Argument Supporting Nominals here (ASNs), following Borer (2013), while the latter Referential Nominals (RNs). In Grimshaw’s work, the nominals bearing the relevant argument supporting reading were labeled complex event nominals. Grimshaw showed that the two are distinguished on a number of several diagnostics, illustrated in (59) and exemplified in (60), but see Grimm and McNally (2013), Lieber (2016), and Alexiadou (2019) for some problems with this classification. (59) RNs no event reading no internal argument no agent modifiers no by phrases no aspectual modifiers frequent + plural N no article restrictions
ASNs event reading internal argument agent modifiers by phrases aspectual modifiers frequent + singular N only definite articles
(60) a. the frequent examination of the students by the teacher b. the examination was on the table
ASN RN
If this is the case, derived nominals can both be nominalizations of layers including Voice, where all the AS of corresponding verb is present as well as simply the root layer, excluding both external and internal arguments. However, in view of the fact that verbal morphology is present even within RN nominals, it must be the case that these may combine with v but exclude AS. Note here that if Lieber (2016) is right that even RNs allow arguments (see (57) from Lieber, 2016, p. 43), then basically, nominals derived from verbs can always have arguments, and we need to understand why the presence of AS is not obligatory with nouns (see Alexiadou (2011b) and Bruening (2018) for some discussion on this): (61) It was gigantic, nearly as big as my grandfather’s carving of Gog Magog. On the other hand, several researchers had argued that nominals lack arguments. A recent such approach is put forth in Grimm and McNally (2013), who argue that nominalizations are unlike verbs (see also Dowty, 1989). In their analysis, nominalizations contain participant variables, which are free, and can be contextually
Syntax and The Lexicon 125 valued and identified. This in turn accounts for the optionality of AS with nouns, while no such optionality exists with verbs.
6.9 Conclusions As has been emphasized several times in this contribution, Chomsky (1970, p. 185) pointed out in Remarks that “the grammar is a tightly organized system; a modification of one part generally involves widespread modifications of other facets.” We have seen some recent work that propagates a modification of this relationship, according to which the lexicon is minimized or eliminated. Naturally, there are many researchers that view the lexicon as a separate sub-component of the grammar (see, for instance, Levin and Rappaport Hovav, 2005; Müller, 2018, or Ackema and Neeleman, 2004), and point to shortcomings of the syntactic models. My aim here was not to argue that a rich syntax is better than a rich lexicon. My main concern was to discuss the consequences following from a positive answer to the question of whether the generative component that produces words is identical to the one that produces phrases has. Such a view of grammar dispenses with the lexicon and defines specific local domains that explain idiosyncrasy (see, e.g., Embick, 2010) for a detailed such proposal). If the lexicon is dispensed with, then then the rules of syntax should apply to processes of word formation as well.
Acknowledgments I am indebted to two anonymous reviewers and the editors of this handbook for their comments and input. AL554/8-1 is hereby acknowledged.
Chapter 7
Lexical Se ma nt i c s Ray Jackendoff
7.1 Introduction: the problem of lexical semantics 7.1.1 What’s the lexicon? and What’s semantics? as the topic of lexical semantics is vast and has been approached from many different points of view, this chapter necessarily presents a rather personal view of the field, addressing issues with which I have been engaged and which I find telling. Others would no doubt write quite a different chapter. This section discusses general issues: what a theory of lexical semantics has to account for; Section 7.2 sketches several prominent approaches in the literature. Sections 7.3–7.6 then present some of the basic elements involved in the content of lexical meanings. Section 7.7 raises a forbidding difficulty for all theories of lexical semantics, and Section 7.8 wraps things up. In order to develop a theory of lexical semantics, we must first ask what counts as a lexical item—what sort of things are in the lexicon—and what is intended by semantics. The lexicon is typically regarded as the “place” where words are stored; and the term lexical item is typically understood as coextensive with word. However, as stressed by DiSciullo and Williams (1987), it is important to distinguish the notion of grammatical word from the notion of lexical item—a piece of language stored in memory. For instance, purpleness can be recognized as a grammatical word, but it is probably not stored in most speakers’ memories. Rather, when being heard or uttered, its morphological structure and meaning are constructed on the fly from smaller stored parts. Hence, it is not a lexical item in DiSciullo and Williams’s sense. On the other hand, the idiom chew the fat is a VP, not a grammatical word. Because its meaning cannot be built from the meanings of its parts, it has to be learned and stored. Thus, despite not being a word, it is to be regarded as a lexical item.
Lexical semantics 127 For the purpose of studying the mental lexicon, the notion of a lexical item as a stored piece of language is the most pertinent. Under this construal, a lexical entry consists of an association between a piece of semantics, a piece of phonology, and a piece of syntactic structure, stored in long-term memory. According to this extended conception of the mental lexicon, it contains not only words but thousands of phrasal idioms like chew the fat.1 It also contains clichés such as light as a feather and red as a beet, and collocations that have literal interpretations but are known to be the “right” way to say things, for example phrases like black and white rather than #white and black. In addition to such items and a vast number of other stored multi-word expressions (Christiansen and Arnon, 2017; Culicover, Jackendoff, and Audring, 2017), the lexicon must also contain morphological affixes, some of which are meaningful (e.g., -ful and un-), and some of which are not (e.g., accusative case).2 Perhaps more surprisingly, the lexicon also includes syntactic constructions that are linked to idiosyncratic meanings. An example is the N of an N pattern illustrated in (1). (1)
a. a travesty of an experiment (≈ ‘an experiment that was a travesty’) b. that gem of a result (≈ ‘that result, which was a gem’)
The syntactic heads in (1) (travesty, gem) are understood as modifiers that offer a negative or positive evaluation of the syntactic dependent (experiment, result).3 This contrasts with the canonical interpretation of this syntactic configuration, in which the syntactic head is also the semantic head—a picture of a cat denotes a picture, not a cat. Idiomatic patterns like (1) must be learned and listed in the lexicon; they are the stock in trade of Construction Grammar (Fillmore et al., 1988; Jackendoff, 1990; Goldberg, 1995; Croft, 2001; Hoffmann and Trousdale, 2013).4 Turning now to the term semantics: Many researchers have sought to limit semantics to those aspects of meaning that are specific to grammar or specific to language. For example, Bierwisch and Lang (1989), Levinson (1997), and Lang and Maienborn (2019) make a distinction between “Semantic Form” and “Conceptual Structure”. Semantic Form includes only the aspects of meaning that are contributed directly by a sentence’s words and syntactic structure. Conceptual Structure specifies the full intended message, including implicatures, coercions, fixing of deictic reference, nonlinguistic context, and so on; these extra factors are taken to be the responsibility of pragmatics. 1 Alternatively,
idioms are sometimes thought to be stored in a different “place” from words, such as an “idiom list.” This does not change the issue of their semantics, which resembles that of words in (nearly) every respect. 2 For a lexical treatment of non-affixal morphology such as ablaut and reduplication, see Jackendoff and Audring (2020). 3 That the first noun has to be evaluative explains, for instance, why *that sailor of a violinist is no good but that butcher of a violinist is all right: butcher can be understood as an evaluation but sailor cannot. 4 If it should turn out that there is something special about the meanings of words in particular, so be it. But such a distinction can only be discovered by addressing the entire menagerie of phenomena. Given limitations of space, I will nevertheless concentrate here on the meanings of words.
128 Ray Jackendoff Pietroski (2018) proposes an even narrower use of the term: word meanings do not themselves have semantic content but are rather “instructions” for “fetching” and “assembling” concepts. Here, however, we will understand the term “semantics” in a broader sense, roughly equivalent to Conceptual Structure in the sense of Bierwisch et al.: the meaning of a lexical item is the concept that it expresses, and the study of word meanings overlaps substantially with that of concepts. Learning the meaning of a lexical item involves acquiring a concept and associating it with phonological and syntactic representations. “Pragmatics” will be regarded as the collection of processes that supply the parts of utterance meaning that do not come directly from the words, the syntactic structure, the idioms, and the meaningful syntactic constructions (see Schwarz and Zehr, this volume). For present purposes, I don’t wish to make a fuss about the terminology; I just want to be clear what I intend here. Section 7.7 returns to the issue of limiting the scope of semantics.
7.1.2 What does a theory of semantics in the mental lexicon have to account for? From the perspective of the mental lexicon, the central question of lexical semantics is: What sorts of mental representations serve as the meanings of lexical items in the extended lexicon? A theory of semantics in the mental lexicon should satisfy at least eight partially overlapping desiderata. The first is that the theory must be explicitly mental: it must relate to the way that the mind conceives of the world. In fact, if approached with this focus in mind, lexical semantics can provide one of the best sources of evidence about how humans understand the world and act in it. A second desideratum is descriptive adequacy or expressiveness: the theory must supply each lexical item5 with a meaning, such that nonsynonymous items have distinct semantic structures, and such that items that intuitively are related have related semantic structures. The account must for instance include the meaning relations involved in polysemy, such as those illustrated by the underlined words in (2). (2) a. The mirror broke. ~ Bill broke the mirror. b. the end of the rope ~ the end of the speech c. the butter on the bread ~ Bill buttered the bread A third criterion is compositionality: it must be possible for the meanings of lexical items to be combined into phrase, sentence, and discourse meanings, with the help 5
Caveat: each meaningful lexical item. A few words, such as the do of English do-support, have no meaning and are used only as “grammatical glue” to fulfill syntactic requirements.
Lexical semantics 129 of pragmatics when appropriate (see Stojnic and Lepore, this volume; Piñango, this volume). A fourth is translatability: to the extent that accurate translation is possible, translation equivalents—whether words or phrases—should have the same or at least very similar semantic structures. A fifth desideratum is that lexical semantics must provide a formal account of inference, such as the inferences in (3). (3)
a. b. c. d.
Beth owns a dog → Beth owns an animal Bill entered the room → Bill ended up in the room Jill despises chess → Jill doesn’t like chess Pam sold a book for $5 → Pam received $5
All five criteria so far concern the internal workings of the semantic system. A sixth desideratum is an account of reference—how language expresses thoughts about the world. In a theory of the mental lexicon, this issue has to be treated with special care. Meanings are encoded deep in the brain, and their access to the real physical world out there is mediated by complex perceptual computations that construct the world of our experience. Moreover, many words refer to nonperceivable entities such as gods, mortgages, and theorems, which exist only by virtue of human cognitive capacities. Therefore, in a properly mentalistic theory of reference, linguistic expressions refer to the world as construed by the language user, rather than to “the world” simpliciter.6 Similarly, a mentalistic theory of truth has to concern not absolute objective truth, but a language user’s judgments or convictions of truth. A seventh desideratum for lexical semantics is learnability: a language learner must be able to construct and store the meanings of tens of thousands of lexical items, on the basis of linguistic and nonlinguistic input. The consequences of this condition require a little more exegesis and lead to an eighth desideratum.
7.1.3 The open-endedness of lexical meanings A founding tenet of modern linguistic theory is the productivity of language—its ability to combine words into an unlimited number of sentences (Chomsky, 1957, 1965, citing Humboldt, 1836). Given the finiteness of the brain, one can neither learn nor store this unlimited repertoire, so language users must possess a productive combinatorial system for creating linguistic expressions—and their meanings—from smaller parts (Stojnic and Lepore, this volume). Moreover, a language learner has to come equipped with the ability to induce this system on the basis of linguistic and nonlinguistic input. 6 An important component of the world “as construed” is a conviction of its reality. But this is a property of the cognitive system in general, not just semantics. Presumably, dogs and monkeys likewise take the world to be real. They likely differ from us, though, in not being able to question the reality of what they experience.
130 Ray Jackendoff Less emphasized has been a parallel argument for the meanings of individual lexical items. The languages of the world express an apparently unlimited number of lexical meanings, enough to encompass all the things we and other cultures can name: kinds of objects, kinds of actions, kinds of properties, kinds of relationships, and so on, with all their intricate shadings and undertones. There is never a sense of running out of new concepts. Of course, any single person stores only some finite number of lexical items. How are they acquired? Fodor (1975) advocates that language learners come equipped with the full unlimited range of potential lexical meanings, and that learners simply activate whichever of the innate concepts correspond to words of their local language.7 But this cannot be. It is hardly plausible that the concepts expressed by telephone and carburetor are innate—and were already innate in the ancient Greeks and even in prehistoric hunter-gatherers. Rather, the space of possible lexical meanings available to the learner has to be characterized in terms of a productive combinatorial system, with a finite set of primitives and principles of combination. This finite base equips the language learner to construct lexical meanings on the basis of linguistic and nonlinguistic input.8 A final desideratum for a theory of lexical semantics, then, is to discover this system of primitives and principles of combination—what might be called the “grammar of meaning.” This does not imply that every language lexicalizes the same concepts (Landau, this volume) or even that two speakers of the same language lexicalize the very same concepts in the very same way. That is clearly not the case. Rather, as with syntax acquisition, a child comes to lexical acquisition with the tools to construct any of the humanly possible lexical meanings, given appropriate input.
7.2 Some semantic theories and their bearing on mentalist lexical semantics Needless to say, there is no theory at present that satisfies all these desiderata. Approaches differ as to which of the desiderata they engage with and emphasize. A few
7 Fodor’s main argument for the innateness of all concepts expressed by monomorphemic words is that it is impossible to specify most word meanings in terms of phrasal definitions (Fodor, Garrett, Walker, and Parkes, 1980). However, definitions fail not because meanings are innate, but because many crucial features of word-internal semantic structure (to be discussed in Sections 7.3–7.6) are simply not available in phrasal composition. Since definitions are by their nature phrasal, they cannot completely mirror the internal semantic structure of lexical items. 8 The primitives need not be binary features, though this is one option. There are also dimensions of continuous variation, for instance the three-dimensional color space. And there must also be the possibility of function/argument structure. See Sections 7.3–7.6.
Lexical semantics 131 important approaches bear mention here, at the risk of gross oversimplification of rich traditions of inquiry. By far the most influential is formal or truth-conditional semantics (Heim and Kratzer, 1998), growing out of philosophy of language and formal logic. Based on arguments of, for instance, Frege (1892), Lewis (1972), Montague (1974), and Putnam (1975), the foundations of this approach are explicitly non-psychological, grounding inference and reference either in the real world or in a set-theoretic model (which may include possible worlds). Hence, the character of the mental lexicon is rarely addressed, and neither is the problem of learnability. A principal concern of this approach is how word meanings are composed into phrase and sentence meanings; and lexical semantics tends to be characterized predominantly in terms of these compositional properties (their type structure). But aside from quantifiers, intensional predicates like believe, and functional categories such as determiners and tenses, little attention is paid to the internal semantic structure of lexical items. In principle, one could construct a mentalist version of model-theoretic semantics, by taking the model to be not sets of possible worlds, but rather the model of the world that is constructed by the speaker’s cognitive systems. The character of such a model would be an empirical issue. Truth in such a model would amount to conformance to the speaker’s construal of the world, as advocated in Section 7.1.3. Bach (1986) begins to go in this direction, and some subareas of formal semantics adopt a mentalist approach, for instance work on scalar implicature (e.g., Chierchia, Fox, and Spector, 2012; de Carvalho et al., 2016; Schwarz and Zehr, this volume). For another perspective, mainstream generative grammar (Chomsky, 1965, 1981, 1995) has had virtually nothing systematic to say about lexical semantics. Berwick and Chomsky (2017, p. 90) characterize lexical items as “atomic elements,” “the minimal meaning- bearing elements of human languages— wordlike, but not words”; these elements “pose deep mysteries.... Their origin is entirely obscure, posing a very serious problem for the evolution of human cognitive capacities.” In this approach, compositionality is yoked to syntax, lexical items have no (discernible) internal structure, and there is no approach to inference, reference, or acquisition. Quite a different approach is based on Wittgenstein’s (1953) dictum that identifies meaning with use, and on Harris’s (1957) proposal that meaning is determined by co-occurrence. Latent Semantic Analysis (Landauer and Dumais, 1997) uses a large corpus to classify words in terms of the collection of other words in their context. A more recent version is vector-space semantics or Distributed Semantic Models (Lenci, 2008; Baroni, 2013), in which a word meaning is encoded as a vector in a very high- dimensional space, again based on the distribution of its contexts of use in a very large corpus. Items that are similar in meaning are located near each other in the space. While such approaches to measuring semantic similarity may be able to mimic certain human experimental results, and may be useful for search engines (Clark, 2013), analysis based solely on statistics of co-occurrence does not lend itself to an account of semantic compositionality or an account of extralinguistic reference. Moreover, it is questionable whether the elaborate statistical techniques for deriving semantic vectors bear any resemblance to human lexical learning.
132 Ray Jackendoff More cognitively based frameworks include Conceptual Semantics (Jackendoff, 1983, 1990, 2002; Pinker, 1989, 2007); Cognitive Grammar (Fauconnier, 1985; Langacker, 1987; Lakoff, 1987; Talmy, 2000; Geeraerts, 2010); Generative Lexicon (Pustejovsky, 1995; Pustejovsky and Batiukova, 2017); Geometry of Meaning (Gärdenfors, 2014); Natural Semantic Metalanguage (NSM: Wierzbicka, 1985), and the extensive lexical decompositions of Miller and Johnson-Laird (1976). These approaches take meaning to be situated in the mind. They tend to focus on expressiveness, analyzing large families of lexical meanings, and seeking analyses in terms of psychologically plausible primitives. With the exception of NSM, they are all concerned with compositionality—and with breakdowns of strict compositionality in phenomena such as coercion, metaphor, and meaningful idiomatic constructions (in the sense of Section 7.1.1). They differ in formalism and in the extent to which they deal explicitly with inference, reference, and learnability. Oddly enough, the phenomena they concentrate on are to some degree orthogonal to those addressed in formal semantics: they are concerned more with words like chair and cup than with words like every and only. An important outlier among the mentalist theories is Fodor’s (1975, 1987) Language of Thought Hypothesis. Fodor definitely wishes to situate lexical semantics in the mind. But he rejects at least two of the desiderata for lexical semantics. First, as remarked in Section 7.1.3, he insists that word meanings are noncompositional and innate. Second, he takes a partly nonmentalistic view of reference, insisting that word meanings are intentional, that is, that they refer to entities in “the world,” rather than to the world as construed by the language user (for discussion, see Jackendoff, 2002, pp. 275–280). Fodor’s arguments are almost entirely programmatic; there is little engagement with details of linguistic analysis.
7.3 Major components of lexical semantics 7.3.1 The contribution of spatial structure We now turn to some of the issues of semantic description that bear on the character of lexical meanings. The first of these issues is posed by Miller and Johnson-Laird (1976), Macnamara (1978), and Jackendoff (1987a): How can we talk about what we see? For a simple case, consider the utterance That is a chair, accompanied by a pointing gesture. In order to construct or comprehend this utterance, a language user has to integrate information from the visual system with information from the language system, in particular identifying a visually perceived individual as the intended referent of the linguistic expression that. Furthermore, for the language user to judge the utterance true or false, this visual information has to be compared with an internal representation of what chairs look like.
Lexical semantics 133 What is the form of this information? Putative features like [+has-legs], [+has-back- support], or [+for-sitting] are not very useful: they do not generalize to any significant class of examples. In particular, chair legs and chair backs are quite different in character from human and animal legs and backs. A better hypothesis is that these characteristics of chairs are encoded in a quasi-geometric or topological format, the highest level of the visual system, in which objects are represented in a perspective-independent fashion, in spirit following Marr’s (1982) “3D model” and Kosslyn’s (1980) “skeletal image.” This format, which might be called spatial structure, is not simply a visual image of some instance or of a prototype. It has to be able to represent objects schematically, abstracting away from details, and independent of viewpoint. It has to represent the full forms of objects, not just their visible surfaces, and it has to represent the spatial layout and motion of objects in a scene. It is not exclusively visual: object shape and spatial configuration can also be derived through hapsis (the sense of touch). Moreover, the shape and spatial configuration of one’s own body can be derived though proprioception (the body senses) and altered through the production of actions. Thus, spatial structure can be conceived of as a central level of cognition, coding the physical world in relatively modality-independent fashion. It is moreover a necessary component of the mind, even in the absence of language: presumably, our primate relatives encode the physical world in much the same way as we do. Turning back to lexical semantics, the hypothesis is that part of the lexical entry for the word chair is a schema in spatial structure that delimits the variation in shape and size of chairs. Similarly, the lexical entry for the verb sit includes a link to a spatial structure that encodes the action of a schematic individual performing this action, including as context a horizontal supporting surface. The fact that a chair is used for sitting involves composing these schematic representations. In principle, many aspects of lexical meaning can be offloaded onto spatial structure— not just object shape, but for instance color, surface pattern (striped, polka-dotted), texture (smooth, lumpy), trajectory of motion (encircle, zigzag), and manner of motion (sprint, waddle). In addition, of course, faces that one knows have to be coded spatially and associated with names (in the lexicon? where else?). To the best of my knowledge, there are no formal theories of spatial representations that even begin to approach the task of supplying the range of distinctions demanded by the semantic richness of language. (Landau and Jackendoff (1993) is an attempt in the highly restricted domain of spatial prepositions; Epstein and Baker (2019) review brain localization of various features of perceived scenes.) However, it must be stressed that such a representation is necessary in any event in order to explain vision, hapsis, proprioception, and their interactions. It overlaps (and is perhaps coextensive with) the “core knowledge” of objects explored by Spelke (2000) and Carey (2009). At the same time, for the purposes of lexical semantics, it enables language to make contact with perception of the world, and it enables the theory to eliminate a plethora of unmotivated semantic features such as [+has-legs]. It should go without saying that there must be further domains of perceptual representations, namely sounds, smells, and tastes, and that these too have to be
134 Ray Jackendoff included in lexical representations where appropriate. For instance, the lexical entry for laugh must include a schematic encoding of what laughing sounds like—what Lila Gleitman (p.c.) has called the “ha-ha part”—perhaps linked to a spatial schema of what laughing looks like and/or a proprioceptive representation of what it feels like to laugh, not to mention whatever sort of representation encodes things as funny.
7.3.2 The contribution of conceptual structure Spatial structure alone cannot make all the distinctions necessary for lexical meaning. Many other elements lend themselves to a more conventional algebraic or feature-based representation that I’ll call conceptual structure. These include: (a) The type-token distinction. All perceived entities are tokens. But a stored spatial representation, say of a dog, could be either a token (‘my dog Rover’) or a type (‘dogs that look a certain way’). A binary feature is necessary to distinguish them. (b) Taxonomic relations: ‘X is an instance/subtype of Y.’ A classic case is furniture, whose instances don’t look at all alike, hence cannot fall under a common schema in spatial structure. Rather, this relation has to be encoded in conceptual structure. (c) Unobservable temporal relations: ‘Event/Situation X is such-and-such a distance in the past/future.’ The time at which something unobservable takes place is not a spatial property. (d) Aspectual predicates such as ‘begin,’ ‘continue,’ and ‘end’ draw attention to the shapes of events (Piñango, this volume). An image of an event alone cannot pick out the onset of the event as opposed to the event itself. (e) Causal relations: ‘Event X causes/enables/impedes Event Y.’ As argued by Hume (1748) and demonstrated experimentally by Michotte (1963), causation is a notion cognitively imposed on visual perception, above and beyond the motion and contact of objects.9 (f) Modality: ‘X is possible’ is not part of the appearance of X. Similarly, Sherlock Holmes’s looks are not what makes him fictional. And the ambiguity of John wants to buy a car—in which a car can be specific or nonspecific—has nothing to do with how the car looks (see Hacquard, this volume). (g) Social notions: ‘X is the name of Y,’ ‘X is dominant to Y,’ ‘X is kin to/friend of/ enemy of Y’ ‘X is member of group Z,’ ‘X is obligated to perform action W.’ Many of these have counterparts in primate societies (Cheney and Seyfarth, 1990; Jackendoff, 2007) and they have little or nothing to do with how the individuals in question look. 9 However, if one of the objects involved is oneself, either in exerting force on another entity or in being the recipient of force, causation might be perceived more directly through proprioception of exertion.
Lexical semantics 135 (h) Theory of mind notions: ‘X believes Y,’ ‘X imagines Y,’ ‘X intends action Y’ are unobservable and therefore must be encoded in conceptual structure (see Landau, this volume). Thus, the work of lexical semantics has to be divided between spatial and conceptual structure, each contributing its own characteristic forms of information. In short, lexical semantics involves multiple domains of mental representation.10
7.4 Semantic decomposition, but not into necessary and sufficient conditions The tradition in philosophy of language (e.g., Tarski, 1956; Katz, 1966; Heim and Kratzer, 1998) is that sentences are to be evaluated for truth value in terms of a set of necessary and sufficient conditions. The meaning of a word, say cat, can be thought of as the necessary and sufficient conditions for the sentence X is a cat to be true, that is, X meets the conditions associated with cat. However, this criterion must be rejected. First, consider the sentences in (4). (4) a. This object is red. b. This object is orange. As we move along the spectrum from focal red to focal orange, exactly where does (4a) stop being true and (4b) start being true? There is no fact of the matter. If the question is posed experimentally, people’s reaction times to colors in the border region between the two are slower; their judgments may differ depending on the color of previously presented examples; and they don’t all agree (Murphy, 2002). It is not that there is a “true” meaning of red to be established by science, but people don’t know it (as with Putnam’s (1975) treatment of tigers, gold, and water). Nor can the problem be solved by calling the in-between shades red-orange: we then face the same border problem, this time between red and red-orange. Rather, the meaning of red simply has vague boundaries surrounding focal red, in part hemmed in by nearby colors in color space.11
10 See
Jackendoff (1996) for discussion of how to decide whether particular phenomena belong to spatial structure, conceptual structure, or some combination. 11 If one wishes to treat a word meaning as a mentally represented prototype (Rosch, 1978), it is still necessary to establish a range of variation: scarlet is far narrower than red. Note also that particular idiosyncratic combinations can be learned and stored. For instance, one learns that a redhead does not have a head that is focal red, but rather has hair of a particular range of shades approximating red. The same goes for Fodor’s example pet fish (see Stojnic and Lepore, this
136 Ray Jackendoff Another well-known case is the sorites paradox: How many hairs can one have on one’s head and still be bald? Here again there is no exact fact of the matter—and there certainly is not going to be a science of baldness that can resolve the issue. Rather, there is a focal notion of baldness—no hair at all—plus a vague upward boundary. A more consequential example is at what point during gestation a human fetus becomes a person—as though there is a sharp boundary. It is the very indeterminacy of the answer that allows it to be politically manipulated. A different class of examples does have defining conditions, but they are neither individually necessary nor collectively sufficient. A typical example is the verb climb (Fillmore, 1982a; Jackendoff, 1985). Intuition suggests that climb denotes an action that involves (a) moving upward and (b) roughly, moving along a vertical-ish surface with effort—what might be termed “clambering.” (Notice that both conditions involve spatial structure.) A sentence like (5a) conforms to both these conditions. (5) a. The bear climbed the tree.
[upward motion + clambering] b. The bear climbed down the tree/across the ledge. [only clambering] c. The plane climbed to 20,000 feet. [only upward motion] The temperature climbed to 35o C. [only upward motion] d. *The plane climbed down to 20,000 feet. [neither] o *The temperature climbed down to minus 10 C. [neither]
But consider the other examples in (5). (5b) violates the condition of upward movement; it involves only clambering. So upward movement is not necessary, and clambering is sufficient. On the other hand, (5c) violates the condition of clambering, and involves only motion upward. So clambering is not necessary either, and motion upward is sufficient. Finally, the sentences in (5d) conform to neither condition, and they are not judged to count as climbing. Hence, at least one or the other of the two conditions is necessary, but either one is sufficient. One might be tempted to claim, along the lines of Katz (1966), that climb is polysemous: one sense denotes clambering and appears in (5b), while the other denotes upward motion and appears in (5c). But that implies that (5a) is ambiguous, and thus that one can ask which sense of climb the speaker of (5a) intends. This misrepresents what is going on: (5a) partakes of both senses at once, as suggested by the intuitive analysis of climb. Hence, the two conditions together denote stereotypical climbing, and in neutral contexts, both are invoked by default. At the same time, either condition can be dropped to yield a more “extended” or less stereotypical denotation.
volume), which is neither a prototypical pet nor a prototypical fish: one learns what sorts of fish are prototypically kept as pets, thereby overriding compositional typicality. A truly novel combination, say pet marsupial, evokes the prototype, in this case a kangaroo or possibly an opossum.
Lexical semantics 137 We therefore arrive at a configuration within a lexical meaning in which conditions are not logically conjoined. A new principle of combination is necessary in the repertoire of semantic composition. Jackendoff (1983) uses the term preference rules for conditions combined in this fashion: they are both preferable, but either is sufficient on its own. The relation is more specific than logical inclusive or, in that when a referent meets both conditions, it is judged not just acceptable, but more stereotypical (Rosch and Mervis, 1975; Murphy, 2002).12 A well- known case of such a configuration, involving more conditions, is Wittgenstein’s (1953) famous example of the word game, for which he invokes a variety of conditions in a preference rule configuration, no single one of which is necessary. With multiple conditions in play, the result is Wittgenstein’s “family resemblance” relation. A more loaded example is mother (Lakoff, 1987). A stereotypical mother is (a) related genetically to the child, (b) gives birth to the child, and (c) raises the child. However, these three functions can be distributed among two or even three individuals, in which case it is not clear which one(s) to call the mother, and the judgment depends heavily on the interests of the judger and/or legal fiat. From the perspective of traditional philosophy of language, preference rule relationships look rather exotic and perhaps even out of control. But in human cognition they are nothing special. They are ubiquitous in visual perception, as observed as early as Wertheimer (1923) (see also Labov, 1973, and Jackendoff, 1983). And they appear even in seemingly elementary distinctions such as phonetic perception, where the difference between a perceived “d” and a perceived “t” may be signaled by any combination of voice onset time, interval of vocal tract closure, and length of the preceding vowel (Liberman and Studdert-Kennedy, 1977). Thus, preference rule phenomena are characteristic of mental computation, nonlinguistic as well as linguistic. Hence there should be no objection to admitting preference rule composition as a fundamental principle of combination in lexical semantics.
7.5 Ontology: the kinds of entities we can refer to 7.5.1 Demonstratives as evidence for basic ontological categories Section 7.3.1 observed that a demonstrative (or deictic) such as that can be coupled with a spatial representation, thereby referring to an object in the (construed) environment. 12 Relation to a focal instance and characterization in terms of preference rules are not the only source of prototypicality judgments. For instance, frequency evidently plays a role: this presumably accounts for Armstrong, Gleitman, and Gleitman’s (1983) finding that people judge 4 to be a more prototypical even number than 206.
138 Ray Jackendoff However, demonstratives also have numerous other uses, apparently referring to other sorts of entities, such as the underlined terms in (6). (6) a. b. c. d. e. f.
Please put your hat here and your coat there. [pointing] [location] He went thataway. [pointing] [trajectory] Please don’t do that around here. [pointing, gesturing] [action] The fish I caught was this big. [demonstrating] [distance] This many people were at the party. [holding up 4 fingers] [amount/number] Can you walk like this? [demonstrating a waddle] [manner]
In each case, the demonstrative invites the hearer to pick out some referent in the perceptual field. What kind of referent it is determined by the linguistic expression. For instance, the locative demonstratives here and there in (6a) invite the hearer to pick out a location; the collocation do that in (6c) invites the hearer to pick out an action, and the degree use of that in (6d), preceding an adjective denoting size, invites the hearer to pick out a size or distance. These examples show that we can refer to all these sorts of entities as if they exist in the perceivable world—a much richer repertoire than is usually considered in theories of either semantics or visual perception.13 These various types of entities might be thought of, so to speak, as “semantic parts of speech” or ontological categories—the basic types of entities that inhabit the conceptualized world. The list in (6) is hardly exhaustive: there are ontological categories for other modalities of perception, such as sounds (Did you hear that?), linguistic expressions (Did he really say that?), tastes, smells, and pains, as well as nonperceivable ontological types such as information and values. However, I conjecture that there is a finite set of them that serve as primitive features in the conceptual repertoire. At the base of a lexical item’s conceptual structure, then, is its ontological category. The meanings of the demonstratives in (6) consist of little more than that; all the details of the intended referent are filled in from the visual field. But most other words do carry further structure in their lexical entries, which distinguish them from all other words in their category.
7.5.2 Polysemy that straddles ontological categories Continuing with the theme of ontological categories, one of the earliest observations in Conceptual Semantics originates with Gruber (1965): many lexical items and 13 Traditional
philosophy of language and formal semantics typically restrict the ontology to individuals, properties, truth values, and, since Davidson (1967), events and/or actions. How all of these types of entities are picked out of the visual field is largely unknown, though there has been considerable progress on event perception (e.g., Ünal, Ji, and Papafragou, 2019; Loschky, Larson, Smith, and Magliano, 2020; Zacks, 2020).
Lexical semantics 139 grammatical patterns that are used to describe objects in space also appear in expressions that describe nonspatial domains. Consider (7)–(10), especially the parallels among them in the use of the underlined words. (7) Spatial location and motion a. The cat is on the mat. b. The cat went from the mat to the window. c. Fred kept the cat on the roof. (8) Possession a. The money is Fred’s. b. The inheritance finally went to Fred. c.
Fred kept the money.
(9) Ascription of properties a. The light is red. b. The light went/changed from green to red. c. The cop kept the light red.
[Location] [Change of location] [Caused stasis] [Possession] [Change of possession] [Caused stasis] [Simple property] [Change of property] [Caused stasis]
(10) Scheduling activities a. The meeting is on Monday. [Simple schedule] b. The meeting was changed from Monday to Tuesday. [Change of schedule] c. The chairman kept the meeting on Monday. [Caused stasis] The lexical and grammatical parallelisms in these examples (and many more, cross- linguistically) suggest that various semantic domains (or “semantic fields”) have partially parallel structure. They differ in the basic semantic relation on which each field is built; this relation is expressed in the (a) sentences. In (7a), the basic relation is between an object and a spatial location; in (8a), it is between an object and who it belongs to— a sophisticated social notion involving rights and obligations (Snare, 1972; Miller and Johnson-Laird, 1976; Jackendoff, 2007). In (9a), the relation is between an object and its “location” in a “property space” such as size, color, value, or even emotional affect (such as angry and placid). In (10a), the relation is between an action and, as it were, its assigned “location” on the time-line. These disparate relations are elaborated with parallel machinery: the (b) sentences express change over time in the relation in question; the (c) sentences express this relation remaining intact over time, by virtue of the agency of someone or something. Important to this analysis is that many verbs and prepositions occur over and over again in different semantic fields—not entirely consistently, but frequently enough not to be just a coincidence. For a realist (or “externalist”) semantics, these parallelisms make little sense. In the real world, the spatial location and motion of an object have nothing to do with who it belongs to (e.g., houses change owners without changing
140 Ray Jackendoff location) or what color or size the object is; and there is no real-world reason why future actions can be shuttled around in time the same way objects are moved around in space. In contrast, for a mentalist (or “internalist”) theory, these parallelisms help reveal the grain of human conceptualization: these semantic fields share part of their structure, and the parallelisms encourage corresponding parallelisms in linguistic expression. Going back to the issue of lexical decomposition, this analysis suggests that the meanings of verbs and prepositions are not semantic primitives: they must include a “semantic field feature” that says which domain(s) they belong to. The possible values of the field feature overlap in part with the ontological categories of the previous section, in particular including spatial configurations of objects and temporal configurations of actions. This parallelism among semantic domains is central to Conceptual Semantics (Jackendoff, 1983, 1990), and to Cognitive Grammar (Langacker, 1987; Lakoff and Johnson, 1980; Talmy, 2000). In the latter framework, the parallelism is often attributed to metaphor and/or embodied cognition (Varela, Thompson, and Rosch, 1991), with spatial location and motion as the “source domain” from which the other domains are derived by analogy. Conceptual Semantics, however, takes the position that the expressions in (8)–(10) are not metaphors, in the sense of colorful extensions of spatial language and conceptualization. Rather, they are the ordinary way that English gives us to speak about these domains, and they share part of their structure with each other. Nevertheless, as in the Cognitive Grammar view, the spatial domain has extra salience, because of its richness (e.g., more than a single dimension), its perceivability, its support from spatial structure, and especially because of its role in guiding physical action. The upshot is that the choice of semantic field has to be a feature of word meanings, and at least some values of this feature are likely semantic primitives.
7.5.3. Dot-objects Multiple ontological categories not only can distinguish different senses of a polysemous word, as shown in the previous section: they can also coexist within a single sense of a word. A well-known case is the word book (Pustejovsky, 1995; Pustejovsky and Batiukova, 2017). A stereotypical book is a physical object, consisting of bound pages with writing on them. However, the book also has informational content, and in that capacity, it belongs in the conceptual domain of information. Pustejovsky notates this dual allegiance as “object • information,” hence the term dot-object. As with climb in the previous section, one might contend that book is ambiguous between the two senses. But both senses can be attributed to the same token of the word, as in That book is over 400 pages long, but it has a great ending, with no sense of ambiguity. Instead, these two conditions stand in a preference rule relationship: an e-book is not a physical object but carries an appropriate amount of informational content; a blank notebook is a physical object but has no informational content (yet); but a stereotypical book is both physical and informational.
Lexical semantics 141 Dot-objects, like preference rules, are ubiquitous. Reading is a dot-activity—physically scanning a page and receiving informational content from it. A university is a dot-object, consisting of a collection of physical objects but also performing a complex social function: Tufts University, located in beautiful Medford, Massachusetts, offers degrees in 43 subjects. In this example, the physical existence has recently become defeasible, given the ascendance of online universities, but the social function is necessary. A more complex dot-action is stealing: moving an object (physical) that does not belong to one (social/legal) from one place to another (physical) so as to conceal it (physical) from the rightful owner (social/ legal); and this action has an associated negative moral value (moral). An extremely important dot-object in human conceptualization is the concept of a person, crucial to the understanding of all social and moral predicates. All cultures (that I have ever heard of) conceptualize a person as a linkage of an animate body in the physical domain with what is variously called a mind, a self, a spirit, or a soul (Bloom, 2004; Jackendoff, 2007), in the personal or social domain. Faces, hands, livers, and so on belong to the physical, while theory of mind, moral responsibility, rights, and obligations belong to the personal/social domain. To elaborate a little: Whatever cognitive neuroscience may tell us (e.g., Dennett, 1991; Damasio, 1994; Crick, 1994), these two aspects of a person are conceptualized as separable. A zombie is a body bereft of its soul. Ghosts, angels, gods, ancestors, and souls ascending to heaven are often conceived of as humanlike but without human bodies. Souls can come to inhabit different bodies through reincarnation, metamorphosis (as in Kafka and the frog prince), body-switching (as in Freaky Friday), and spirit possession. Multiple personality disorder is reportedly experienced as multiple persons competing for control of the same body (Humphrey and Dennett, 1989). An individual suffering from Capgras Syndrome experiences a loved one as an impostor who is physically indistinguishable from the person in question but with the wrong personal identity (McKay, Langdon, and Coltheart, 2005). In each case, personal identity (and hence moral responsibility) goes with the mind/soul. Kafka’s Gregor Samsa is still himself, trapped in the body of a giant cockroach; the mother and daughter in Freaky Friday trade bodies, not minds. We have no trouble understanding such situations, bizarre as they are. They are commonplace in fiction, legend, and religion. This suggests that this very complex dot- object concept of a person is either innate or a remarkably persistent cultural invention; I would vote for the former. In any event, the notion of personhood is central not only to lexical semantics but to social and cultural cognition in general.
7.6 Combinatoriality 7.6.1 Argument structure After ontological category, perhaps the most fundamental feature of word meanings, and the one addressed by virtually every theory of lexical semantics, is argument
142 Ray Jackendoff structure. This is a basic component of semantic compositionality and a crucial part of the interface between semantics and syntax. For instance, the meaning of a verb is a conceptualized situation, event, or action involving a certain number of entities, each of which is specified by an open typed variable—the verb’s semantic arguments. These variables are instantiated in language by the meanings of the verb’s syntactic arguments. Consider the sentence The lion chased the bear. The action of chasing requires two semantic arguments—the individual doing the chasing and the one being chased—and these are expressed as the syntactic subject and object of the sentence respectively. The “chaser” argument must furthermore be typed as animate and as having the intention of catching the “chasee.” The latter is also animate, but defeasibly so—one can chase a runaway vehicle, for instance. Thus, the meaning of chase can be thought of as a schematic event in which these two individuals are characters. Some approaches classify semantic arguments in terms of thematic roles (also called theta-roles), following Gruber (1965) and Jackendoff (1972). An entity being located or in motion is termed a Theme; an entity performing an action is an Agent; an entity being acted upon is a Patient; an endpoint of motion is a Goal, and so on. Depending on the theory, thematic roles are treated as features that a predicate assigns to its semantic (or even syntactic) arguments (Chomsky, 1981; Dowty, 1991; Tenny, 1994; Levin and Rappaport Hovav, 2005), or as organized in a separate argument structure “tier” that regulates the correspondence between syntax and semantics (Grimshaw, 1990). Alternatively, as argued in Jackendoff (1987b, 1990), thematic roles can be regarded as informal terms for structural configurations in semantic structure. For instance, Theme is the informal term for the first argument of the schematic event GO; Agent is the term for the first argument of the schematic event CAUSE, and so on. The relation between the semantic and the syntactic argument structure of a verb is often assumed to be one-to-one, so one can speak simply of a verb’s argument structure. This assumption surfaces in syntactic theory as Chomsky’s (1981) theta-criterion and Baker’s (1988) Uniform Theta Assignment Hypothesis (UTAH). However, exploration of the terrain reveals that this relation is actually quite intricate; I will only sketch it here. First, a verb can stipulate anywhere from zero to four semantic and syntactic arguments: (11) a. Zero arguments It’s drizzling. [where it is a meaningless syntactic argument, necessary (in English but not in, e.g., Spanish) to fill the subject position] b. One argument Dan is sleeping. The door opened. Sue sneezed. c. Two arguments The lion chased the bear. Bill fears snakes. d. Three arguments Amy gave Tom a present. Beth named the baby Bayla. e. Four arguments Henry traded Judy his candy bar for a balloon.
Lexical semantics 143 However, some semantic arguments of some two-to four-argument verbs need not be expressed syntactically (12). For instance, one cannot eat without eating something, so eat has two semantic arguments. However, the entity being eaten can be left implicit, so that Peter is eating has only a single syntactic argument. (12) a. Two semantic arguments; one or two syntactic arguments Peter is eating (a pizza). b. Three semantic arguments; one, two, or three syntactic arguments Diane served (us) (lunch). c. Four semantic arguments; two, three, or four syntactic arguments Henry sold (Judy) a bike (for $50). Some semantic arguments must be expressed syntactically as adjectives, prepositional phrases, or clauses. (13)
a. b. c. d. e.
Beth considers Bayla awesome. The bird flew out the window. Tom put the book on the table. Sam believes that the sky is falling. Ezra managed to confuse everyone.
A few verbs have more syntactic arguments than semantic arguments. For instance, the actions of perjuring oneself and behaving oneself involve only one character. The reflexive direct object is a syntactic argument, but adds nothing to the semantics. You can’t perjure or behave someone else. Many verbs appear in multiple syntactic frames, sometimes with different semantic argument structure (14a, b), sometimes the same (14c, d) (see Levin, 1993 for hundreds of examples). (14) a. b. c. d.
The water boiled. Levi boiled the water. The tank filled. Judy filled the tank. Amy gave Tom a present. Henry sold Judy his bike.
[≈ ‘Levi made the water boil’] [≈ ‘Judy made the tank get full’] Amy gave a present to Tom. Henry sold his bike to Judy.
Verbs are not the only words with argument structure. The meaning of a noun can also involve semantic arguments. Semantically, one cannot be a friend without being a friend of someone, mentioned or not; something cannot be a part or an end without being a part or an end of something. Nouns that are morphologically related to verbs typically inherit the verb’s semantic argument structure, whether expressed syntactically or not. For instance, the construction of a building semantically implies both an entity doing the constructing and an entity being constructed; a donation is something that one entity is donating to another.
144 Ray Jackendoff Adjectives too have semantic argument structure. One semantic argument is the entity of which the adjective is predicated, for instance the house in The house is big. But some adjectives also have further semantic arguments, whether syntactically expressed or not. For instance, one cannot be polite without being polite to someone or to people in general; a hypothesis cannot be interesting without someone or people in general being (assumed to be) interested in it. An important complication is the light verb construction, illustrated in (15). (15) a. Joe took a walk. b. Kay gave Harry a hug. c. Don put the blame on Nancy.
[cf. Joe walked] [cf. Kay hugged Harry] [cf. Don blamed Nancy]
In these cases, the verb has its ordinary syntactic argument structure, but the type of action being denoted and the semantic argument structure of the sentence are determined by the nominal, as seen in the paraphrases. Finally, meaningful constructions too have argument structure. For instance, the N of an N construction mentioned above (a gem of a result) has two noun positions to be filled in in syntax. Each corresponds to a semantic argument: the first to an evaluative term, and the second to the entity being evaluated.
7.6.2 Word decomposition into more primitive components A prominent theme in theories of the mental lexicon, going back at least to Gruber (1965), is the analysis of word meanings in terms of what are purported to be more primitive components. For example, (16a) can be paraphrased fairly closely as (16b), and the same for (16c, d). (16) a. b. c. d.
John entered the room. John went into the room. John climbed the mountain. John went to the top of the mountain (in a clambering manner).
That is, enter X means approximately ‘go into X,’ and climb X means approximately ‘go upward, clambering along the surface of X, to the top of X.’ These paraphrase relations can be captured more formally by lexical decomposition, such that the word and its paraphrase come to have the same analysis. (17) illustrates the case of enter. GOspatial is a schematic event in which an individual X traverses a Path (17a). TO is a schematic Path that terminates at an object or location Y (17b); INTERIOR is a schematic location that consists of the space subtended by the interior of an object Z (17c). These three pieces of structure are combined in enter: it is a schematic event in which an individual X traverses a path that terminates in the interior of an object Z (17d).
Lexical semantics 145 (17) a. b. c. d.
‘go’ = [Event GOSpatial (X, PATH)] ‘to’ = [Path TO ([Object/Location Y])] ‘in’= [Location INTERIOR (Z)] ‘enter’ = [Event GOSpatial (X, [Path TO ([Location INTERIOR (Z)])])]
This is perhaps clearer in a tree notation, in which double lines mark schematic entities and single lines mark their arguments. Event
(18)
GO
X
Path
TO
Location
INTERIOR
Z
One can therefore think of enter as “incorporating” the meaning of the prepositions to and in. The upshot in syntax is that enter ends up as a syntactically transitive verb.14 A more complicated case is illustrated in (19), for which approximate paraphrases appear in (20): (19) a. John buttered the toast. b. John saddled the horse. (20) a. butter X ≈ ‘put butter on X (in the fashion that butter is meant to be used)’ b. saddle X ≈ ‘put a saddle on X (in the fashion that saddles are meant to be used)’ Such relationships have been claimed to be captured by a syntactic operation of “noun incorporation” (e.g., McCawley, 1968; Baker, 1987; Hale and Keyser, 1993), which moves the noun butter from direct object position up to the verb node. But the full semantics does not follow from ‘put Z on X.’ One cannot butter bread by simply putting a stick of butter on it, and one cannot butter a table at all. Rather, buttering involves using butter in the fashion it is meant to be used, that is, spreading it on bread or a pan—what Millikan (1984) calls the proper function of butter and Pustejovsky (1995) calls the telic quale of the word butter. In other words, the meaning of the verb is
14
Note that GO, TO, and INTERIOR all have counterparts in spatial structure. Note also that even if they are not themselves primitives, they nonetheless allow us to eliminate a putative primitive ENTER.
146 Ray Jackendoff constructed in part from the noun’s internal semantic structure, which is nowhere reflected in its syntax.15 The intuition behind noun incorporation can be captured instead through the structure of lexical semantics. Put decomposes as ‘X causes Y to go to location Z’ (20a). The verb butter specifies the variable Y as ‘butter’ and location Z as ‘onto object W.’ In addition, it spells out a Manner in which the butter moves, not formalized here, but likely made explicit in spatial structure. The result is (21b), or, in tree notation, (22). (21) a. ‘put’ = [Event CAUSE (X, [Event GO (Y, [Path TO [ZLocation]])])] b. ‘butterV’ = [Event CAUSE (X, [Event GO (BUTTER, TO ([Location ON (W)]); Manner . . .])] Event
(22)
CAUSE
X
Event
GO BUTTER
Path
TO
Manner
Location
ON
W
The result is that the verb butter ends up as a syntactically transitive verb whose direct object denotes the object on which the butter is placed—the endpoint of motion. The entity in motion, the butter, is incorporated into the meaning of the verb and plays no role in the syntax.
7.6.3 Other I have discussed here only one combinatorial property of lexical items, namely argument structure. Many others can be mentioned, without any intended prejudice against still others I have not mentioned: 15
Fodor (1981) similarly argues that the verb paint cannot be defined as ‘put paint on X,’ because when Michelangelo dipped his brush in the paint, he was putting paint on the brush, but he was not painting the brush. Adding proper function to the conceptual structure takes care of this problem: the proper function of paint, roughly, is to intentionally apply it as a liquid to cover a surface in a certain manner, with the intention of having the paint dry and thereby covering the surface. In turn, this splits into two senses: painting a wall or painting a picture.
Lexical semantics 147 (a) Modification adds information to a semantic head that is not dictated by argument structure, such as prenominal adjectives (brown cow, terrible theory), measure phrases (five-foot fence), relative clauses (the man who came to dinner), manner and time adjuncts to verbs (run quickly, eat often), degree markers on adjectives (extremely funny), conditional and purpose modifiers (if he comes, in order to leave), among many heterogeneous possibilities. (b) Anaphoric items require an antecedent in order to specify their full interpretation in context. These include not just definite pronouns but also the identity-of- sense pronoun one, non-identity pronouns such as someone else, the reciprocal each other, pro-VPs such as do so, and elliptical constructions such as vice versa. (c) Quantifiers such as each, some, and many create scope relations over the syntactic/semantic configurations in which they are embedded. (d) Negative polarity items such as any, yet, and the collocation lift a finger can appear only in a particular set of modal contexts. (e) Restrictors such as only and even are sensitive to information structure (focus) and create presuppositions against which the focus is judged. (f) Meaningful constructions such as cleft and pseudocleft—which are lexical items in the extended lexicon of Section 7.1.1—designate their free argument as focus in information structure. (g) Discourse connectives such as however, specifically, and the collocations by the way and for example specify a semantic relation between the clause they are in and the surrounding discourse. (h) Intonation contours convey meaning, usually of information structure but also attitudes such as sarcasm. These pairings belong in the extended lexicon (where else?). (i) Words and meaningful constructions can convey speech register, which may or may not be considered part of lexical meaning but has to be encoded in memory somehow.
7.7 Word knowledge vs. world knowledge? Returning to a point hinted at in Section 7.1.1: All theories of lexical decomposition (including my own) encounter a serious problem. Their analyses typically leave a residue of unanalyzed material, such as BUTTER and the manner of using it in (21b). Many researchers have sidestepped this problem by proposing that such bits of meaning are not relevant to linguistic semantics per se; rather, they belong in “world knowledge” or “encyclopedic knowledge.” For instance, Katz and Fodor (1963) make a distinction between “semantic markers” (the linguistically relevant material) and “distinguishers”
148 Ray Jackendoff (the residue); Katz (1972) distinguishes “dictionary” from “encyclopedic” information; Bierwisch and Lang (1989) and Levinson (1997) separate “semantic form” from “conceptual structure”; and Lieber (2019) separates a structured “semantic skeleton” from an unstructured “semantic body.” Other approaches (e.g., Bolinger, 1965; Jackendoff, 1983, 2002; Lakoff, 1987; and Langacker, 1987) argue that these residues cannot be disregarded. They are part of lexical meaning; they are intimately involved in inference and reference. These approaches therefore take the position that there can be no principled boundary between word meanings and world knowledge. In some cases, the residue can be attributed to spatial structure. That might be the case with butter, where both the substance butter and the action of buttering have spatial components. However, this is not always possible. Here are two representative cases. First consider that staple of linguistic semantics, bachelor = ‘unmarried adult male human.’ Lakoff (1987) points out that this analysis does not account for presupposed sociocultural context, for instance that the Pope would not be characterized as a bachelor. Either this context has to be part of the word’s lexical entry, or, if it is “world knowledge,” it must somehow be connected directly or indirectly to the lexical entry. Moreover, the “core” of the meaning is itself problematic: the feature ‘unmarried’ has to tie into understanding of the complex social institution of marriage. Is this information part of the lexical entry of bachelor, or is it world knowledge? How would we decide, and what difference would it make? An important consideration is that the very same information must also be part of the lexical entries for the words marry and marriage. Hence, lexical semantics cannot forgo responsibility for analyzing it. For another case, return to the verb sell and its four arguments, as in (23) (= (12c)). (23) Henry sold Judy a bike for 50 dollars. At the very least, the lexical entry has to include two subevents: Henry giving a bike to Judy (the “transfer”), and Judy giving Henry 50 dollars (the “counter-transfer”). It wouldn’t be selling otherwise. The lexical entry further has to stipulate that the counter- transferred entity is money (by contrast with trade, whose counter-transferred entity is likely not money). Should the lexical entry include all the properties of money? These properties certainly have to be specified anyway in the lexical entries for money and dollar. Should the lexical entry include all the logic of possession? In any event, this logic has to be in the lexical entry of possession verbs such as own and give away. One’s understanding of sell further has to say that the two subevents are not independent—it isn’t as though Henry and Judy, unpremeditated, happen to give each other nice presents. Rather, the two subevents constitute a joint action to which they have both agreed and from which each of them expects to benefit. In turn, the logic of joint actions (Gilbert, 1989; Searle, 1995; Bratman, 1999; Jackendoff, 2007) stipulates that each participant is obligated to the other to carry through his or her side of the deal. Defecting on an obligation is furthermore morally bad, and if one participant defects,
Lexical semantics 149 the other has the right to punish him or her in some way. In turn, punishing someone is justifiably doing something bad to them in return for something bad they have done. All this material has to be listed somewhere in the entries of lexical items like obligation, collaborate, in return for, and punish. Should it also be in the lexical entry of sell, or should it be “world knowledge” somehow associated with sell? And if the latter, how is this association encoded? Finally, other verbs can be used to convey the same event, with different syntactic argument structures, and from different perspectives. (24) a. b. c. d. e.
Judy bought a bike from Henry for $50. Judy paid Henry $50 for a bike. The bike cost Judy $50. Judy spent $50 on the bike. Henry got $50 for the bike.
This suggests that one’s repertoire of concepts includes an abstract “transaction frame” in the sense of Fillmore (1982b) (or a “strip of behavior” in the sense of Goffman, 1974), which encodes all the details of money, joint action, obligation, and so on. Such a frame is not restricted to the lexical entry of an individual transaction verb; rather, it is invoked by all of them. Again, it is not clear whether this frame is to be considered “lexical” or “world” knowledge. The upshot of these cases is that what might be considered one word’s “world knowledge” is often part of another word’s “lexical entry.” Hence, knowledge of words and knowledge of the world are apparently built out of a shared repertoire of basic units. Not only is there no principled line between them, it is not clear that one should want one, other than as a convenient limit on how deep one wants one’s analysis to go. Even if one manages to achieve plausible decompositions of word and world knowledge, there remains an intuitive distinction between “dictionary” and “encyclopedic” knowledge. What is listed in the lexical entry of cat? Presumably, no one would doubt that it includes what cats look like, encoded in spatial structure, and that cats are kept as pets, encoded in conceptual structure. But what about the (defeasible) fact that cats hunt mice and the (nondefeasible) fact that cats have livers? These seem to lie somewhere in between. And if the fact that cats hunt mice is listed, does the lexical entry for mouse include the fact that cats hunt them? I have no clear intuitions about this problem (though Dor (2015) may be on the right track).
7.8 Closing remarks The goal of this chapter has been to motivate a theory of lexical semantics that is expressive enough to account for a significant range of phenomena. We have sketched a considerable collection of machinery: geometric spatial structure, featural and algebraic
150 Ray Jackendoff conceptual structure, preference rules, ontological categories, cross-field parallels, dot- objects, argument structure, and decomposition into more primitive functions. Each of these has been motivated on the basis of rather simple examples, and each has been shown to be broadly applicable. Along the way, we have touched on the desiderata of compositionality, polysemy, reference, and the problem of primitives. Alas, we have had nothing to say here about inference, translatability, or learnability. Most important, though, has been our emphasis on lexical semantics as a mental phenomenon, deeply connected to and supported by the human conceptualization of the world. Indeed, we have argued that many of the components discussed here are necessary to mental phenomena outside of language. One might feel that the repertoire of components discussed here is an embarrassment of riches, that it is too expressive. I would respond that the main challenge semantic theory faces at present is to be expressive enough. The full range of human conceptualization is far richer and more complex than syntax and phonology. It is amazing that anything as complex as human thought can be squeezed through the relatively limited interface of syntax, phonology, and phonetics, and still be understood. Evidently, much depends on pragmatics (Schwarz and Zehr, this volume). On the other hand, vision is much the same: it’s equally amazing that the limited degrees of representational freedom in the retina can give rise to such rich spatial understanding. Nevertheless, a full decomposition of lexical meaning and associated world knowledge into cognitively plausible primitives is for the present elusive. One might therefore be inclined to reject the position that the mind encodes concepts in terms of a finite underlying system. I would regard this as a mistake. Even the most empiricist theory has to claim that concepts are mentally encoded in some format or another. This pertains even to immediate experience: one cannot experience a sunset (or an experimental stimulus) unless one’s brain represents it somehow. The task for semantics and for cognitive science as a whole is to discover the properties of these formats of mental representation, whatever they are. And all we can do is continue to pick away at the phenomena, in the hope that we are getting closer to the bottom. I have tried here to illustrate a small portion of the process.
Chapter 8
L o gic and Th e L e x i c on Insights from modality Valentine Hacquard
8.1 Introduction Formal semantics is the study of linguistic meaning using tools from modern logic. Logic is the study of inference, not language, and abstracting away from natural languages was critical to its development: “work in logic just is, to a large extent, a struggle with the logical defects of language,” wrote Frege in 1915. Yet the struggle produced tools that proved useful to linguists. This was due in part to the realization that the perceived “logical defects” arise not only from language itself but also from the complexity of its use. Speakers do more than assert premises, and in various ways, they exploit context to mean more than their words do (Frege, 1918; Austin, 1962; Grice, 1989). If these “pragmatic” aspects of meaning are stripped away, the result is generally more amenable to logical treatment. For instance, if I say, “Jo got angry and left,” I probably mean that Jo got angry before leaving. But this implication of chronology need not come from “and” itself. It might arise instead from a shared presumption that speakers will often narrate events in their natural order (Grice, 1975). If that is right, it might be that the best semantics for “and” treats it as the Boolean conjunction, a meaning drawn from the logical toolkit. In this way, the logician’s struggle has yielded not only formal models for certain lexical meanings (those that participate in general patterns of inference) but also important lessons on their tortuous relation to meaning in language use. My focus in this chapter is the modal vocabulary: words about possibility and necessity, like “might,” “can,” and “must.” These offer an especially rich study in the linguistic representation of “logical vocabulary.”1 On the one hand, logic provides a highly 1
This chapter focuses on modals, as an instance of logical vocabulary. See Jackendoff, this volume, for an overview of lexical semantics.
152 Valentine Hacquard developed formalization of modal inferences, involving operators that correspond somehow to our modal words (Carnap, 1947; Prior, 1957; Kripke, 1959; Hintikka, 1963). On the other, what speakers do in using modal words seems to go beyond what the words themselves can contribute. What lexical meaning do speakers actually assign to these words, and what guides the child to this result? Similar questions arise throughout the lexicon (e.g., quantifiers, content-word analyticities, conditionals), so we can apply the lessons of this discussion widely. The modal vocabulary instantiates a remarkable human property. We can talk about states of affairs beyond the here and now: how the world would be if dinosaurs had not mostly died out or if all of our needs or desires were to be realized. We can distinguish the merely possible from the necessary, the impossible from the unlikely. Modality is the category of linguistic meaning that enables such talk. Many linguistic expressions encode notions of possibility or necessity, including nouns (possibility), adjectives (possible), adverbs (maybe), verbs (require), modal auxiliaries (must), and semi-modals (have to). This chapter focuses on modal auxiliaries and semi-modals, around which theories of modality have been developed, but contrasts their behavior with modal expressions from other lexical categories. By using modals, speakers can express certainty or uncertainty about different possibilities (as with (1a) and (1b)), or orders and permissions (as with (2a) and (2b)). But what exactly do words like must and may contribute, and what is instead contributed by their grammatical context, or by speaker intentions and the circumstances of their use? As we will see, our modal statements involve a complex interplay of morphology, syntax, semantics, and pragmatics, which make it challenging to pinpoint the exact lexical contributions of the modal words themselves. (1) a. Jo must have done it. b. Jo may have done it. (2) a. You must go. b. You may go. First, it is not always easy to tease apart the semantic and pragmatic contributions involved in a modal statement. A natural use of (2a), for instance, seems to issue a command, similar to the one issued with the imperative “Go!”. But is the command part of the literal meaning of (2a), or might it instead arise indirectly, in virtue of the speaker merely describing a necessity? The other complicating factor is the fact that modals can be used to express different kinds of possibilities and necessities. The sentences with must and may in (1) most naturally express “epistemic” modality: what is required or allowed by the available evidence; those in (2) most naturally express obligations and permissions (so-called “deontic” modality). How many musts and mays do we have in our lexicon? Are there distinct epistemic and deontic versions of the modals, or is there just one must, and just one may, as Angelika Kratzer has influentially argued (1977, 1981)?
Logic and The Lexicon 153 This chapter starts with a survey of how possibilities and necessities are encoded in natural language, with an eye toward cross-linguistic similarity and variation. Section 8.3 introduces the framework that formal semantics inherited from modal logic to analyze modal statements. Section 8.4 turns to the division of labor between semantics and pragmatics for modal statements, and Section 8.5 zooms in on the lexical contribution of the modals themselves.
8.2 Expressing possibility and necessity in natural language 8.2.1 Notional vs. grammatical modality I will use the term grammatical modality (Traugott, 2011) to refer to words or morphemes from a dedicated lexical category that express possibility or necessity: in English, modal auxiliaries (can, must, may, might, could, should, will, would) and semi-modals (have to, ought to). This is in contrast with notional modality (Kratzer, 1981), a term that applies to words or morphemes from any lexical category that encode notions of possibility or necessity. Syntactically, modal auxiliaries belong to the functional domain of the clause; they are in complementary distribution with tense and other auxiliaries like do; they do not bear agreement morphology (4c), nor do they allow do-support (4a–b), unlike regular verbs (3a–c). Semi-modals are often grouped with modal auxiliaries because they can be used to express the same range of meanings, but they behave syntactically more like verbs: they require do-support (5a–b), and bear tense/ agreement morphology (5c). (3) a. Did you leave? b. You did not leave. c. He leaves.
a’. *Left you? b’. *You left not. c’. *He leave.
(4) a. *Did you can leave? b. *You did not can leave. c. *He cans leave.
a’. b’. c’.
Can you leave? You can’t leave. He can leave.
(5) a. Did you have to leave? b. You did not have to leave. c. He has to leave.
a’. b’. c’.
*Have you to leave? *You haven’t to leave. *He have to leave.
Notional modality, however, can be found across all lexical categories, including nouns (possibility, necessity . . . ), adjectives (possible, necessary . . . ), adverbs (maybe,
154 Valentine Hacquard necessarily . . . ), and verbs (require, allow . . . ). Grammatical modals have received the most attention in the literature and form the empirical basis for formal accounts of modality. However, these modals sometimes show quirks; it is thus useful to contrast their behavior to that of adjectives or verbs that express similar meanings, to tease apart peculiarities that stem from their lexical category from properties essential to notional modality more broadly.
8.2.2 Modal force and modal flavor The meanings of modals vary along two main dimensions: “force” and “flavor.” We will consider each dimension in turn, and briefly survey variations that we find across languages. Modal logic distinguishes two main modal forces, possibility, and necessity: possibilities leave other possibilities open, necessities do not. The modal auxiliaries of English can be split into these two categories, as shown in (6). (6) a. Possibility modals: can, could, may, might b. Necessity modals: must, should, have to, ought to Modal expressions in natural language, however, can express finer- grained distinctions than this dichotomy suggests. The example in (7), for instance, illustrates a difference between the modals must and should, which both express necessity, but a seemingly “weaker” one for should: must is associated with a mandatory requirement, should with more of a recommendation. (7) Employees must wash their hands. Everyone else should. (Fintel and Iatridou, 2008) And while grammatical modals seem restricted to possibility and (weak and strong) necessity, other lexical categories, like nouns or adjectives, can encode even finer shades of possibility: (8) a. It’s more likely that Jo did it than Al. b. There’s a slight possibility that Jo did it. Modal statements express possibilities and necessities allowed by various sorts of consideration, leading to different “flavors” of modality. Epistemic modality (from Greek episteme ‘knowledge’) expresses what is possible or necessary given what is known (the available evidence), and circumstantial modality, given certain circumstances, with abilitive modality a special subcase focused on the subject’s physical abilities. Modals can further express what is possible or necessary given different priorities (Portner, 2009), such as rules for deontic modality (Greek deon ‘obligation’), desires for bouletic modality (Greek boule ‘wish’), or goals for teleological modality (Greek telos ‘goal’).
Logic and The Lexicon 155 Finally, metaphysical modality expresses what is possible given a certain history; this is the modality involved in counterfactuality. The following examples illustrate: (9) a. b. c. d. e. f. g.
Jo {might/must} be the murderer (given what we know). Jo can lift 200lbs (she is very strong). I have to sneeze (my nose is tickling). Participants {may/have to} register online (according to the rules). You {could/should} try the bisque! (I’d love it if you did!) You {can/have to} take a cab (to get to the conference.) I {could/would} have won, if I hadn’t twisted my ankle.
epistemic ability circumstantial deontic bouletic teleological metaphysical
As we will see, epistemic modality tends to pattern differently in its interactions with elements like tense from the other flavors, which themselves tend to pattern together, and are often subsumed under the label “root” modality (Hoffmann, 1966). Modality is often distinguished from two other categories that express related notions: attitude predicates (think, want . . . ), which express mental states (belief, desire . . . ), and evidentials, which encode source of evidence in languages like Korean or Quechua. It can be difficult to know in what category a particular lexical item falls (certain words or morphemes encode more than one category) and some analyses even merge these categories. Here we will assume that modality is distinct from both: attitudes express mental states, while modals express possibilities and necessities relative to such mental states. Evidentials indicate the source of evidence, while epistemic modality expresses certainty based on this evidence. A striking fact about modals in a language like English is that they can be used to express different modal flavors. Must and have to, for instance, can be used to express epistemic, circumstantial, deontic, bouletic, and teleological necessity. This feature tends to be restricted to grammatical modals (e.g., English modal auxiliaries), and seems rather common across the world’s languages: about half have at least one modal that can be used to express epistemic and deontic flavors (van der Auwera and Ammann, 2011). In some languages, the same word can be used in situations where English speakers would use either a possibility modal or a necessity modal. Such “variable force” modals have been documented in Nez Perce (Deal, 2011), illustrated in (10), St’at’imc’ets (Rullmann, Matthewson, and Davis, 2008), Washo (Bochnak, 2015), and even Old English (Yanovich, 2016). (10) ’inehne-no’qa’ ee kii lepit take-MOD you these two ‘you can take these two blankets.’ ‘you must take these two blankets.’
ciickan. blankets
Nez Perce (Deal, 2011)
Finally, some languages only have modals that can express a single flavor in a single force (e.g., Javanese; Vander Klok, 2013).
156 Valentine Hacquard The rest of this chapter largely focuses on languages with a modal system like English. Summing up its key features, we find a dedicated grammatical category for modals, though modal meanings can be found throughout the lexicon. Grammatical modals are restricted to possibility and necessity in terms of force; however, they can be used to express different flavors. Words from other lexical categories can express finer shades of possibility, but they typically express a single flavor.2
8.2.3 Possibility and necessity modals: logical inferences Under all flavors, possibility and necessity modals show the same patterns of entailments and logical equivalences as the quantifiers some and every in the nominal domain. This is nicely illustrated in Fintel and Heim (2011) with a pair of antonyms like leave and stay (leave=not stay, stay=not leave): (11) a. You must leave. b. You may leave.
c. It’s not the case that you may stay. d. It’s not the case that you must stay.
(12) a. Everyone left. b. Someone left.
c. d.
It’s not the case that someone stayed. It’s not the case that everyone stayed.
(11a) entails (11b), just like (12a) entails (12b):3 if the (a) sentence is true, the (b) sentence has to be true as well. Thus, following (a) with the negation of (b) results in a contradiction, indicated with “#”: #You must leave, but you may not leave. (11a) and (11c) are logically equivalent (they entail each other), just like (12a) and (12c). (11b) and (11d) are too, as are (12b) and (12d). Standard semantic approaches to modals derive these equivalences via a quantificational analysis: while every and some involve quantification over individuals, must and may involve quantification over “possible worlds.” Necessity modals are universal quantifiers: they quantify over all (relevant) worlds, just like everyone quantifies over all (relevant) individuals. Possibility modals are existential quantifiers: they quantify over some (relevant) world, just like someone quantifies over some (relevant) individual. We go over this analysis in the next section.
2 As a potential counterexample, the adjective possible can be used to express epistemic and root possibility. Note, however, that the flavor depends on the finiteness of its complement: it’s possible that Jo is the murderer (epistemic) vs. it’s possible for Jo to register online (root). 3 Assuming the domain is not empty, that is, that there is at least one individual.
Logic and The Lexicon 157
8.3 Capturing force and flavor in modal logic The most standard analyses of modals in formal semantics derive from modal logic, where they are treated as quantifiers over possible worlds (Carnap, 1947; von Wright, 1951; Kanger, 1957; Hintikka, 1961; Kripke, 1963). Possible worlds can be viewed as possible “ways things could have been” (Lewis, 1973). There are countless ways things could have been, each of which represents a different possible world. We can imagine a world just like ours, but where my doctor’s appointment started on time. We can imagine another one where it started one minute late, another where it started two minutes late, or one where the doctor didn’t show up at all, and so on. Each of these possibilities can be viewed as a different world. Possible worlds raise fundamental issues as to their metaphysical and psychological plausibility, which semanticists acknowledge, but typically put aside. They assume that human languages have the capacity to represent alternative states of affairs and use possible worlds as a mere formal tool to represent such alternative states of affairs in the language of the semantic theory. Possible worlds allow us to formally model the displacing role of modals into states of affairs beyond the here and now. The truth of any statement is evaluated relative to our world, which serves as the world of evaluation. For instance, the sentence “Jo is home” is true in our world if Jo is home in our world. Modals (and other so- called intensional operators, such as attitude verbs) are special in shifting the world of evaluation: the truth of a modal statement in our world depends on the truth of the proposition expressed by the modal’s complement (its “prejacent”) in some other world(s). For instance, regardless of whether Jo actually is home, the sentence “Jo might be home” is true in our world, call it w, just as long as there is a relevant world, call it w’, where Jo is home. Necessity modals are treated as universal quantifiers: they quantify over all relevant worlds. In all of these worlds, the proposition expressed by the modal’s prejacent is true (logicians use the symbol ▫ “box” for necessity). Possibility modals are treated as existential quantifiers: they quantify over some relevant world; in some such world, the proposition expressed by the prejacent is true (logicians use the symbol ⬨ “diamond” for possibility). (13) illustrates this with truth conditions for the possibility and necessity statements in (11): (13) a. “You must leave” is true in w if in all (relevant) worlds w’ you leave b. “You may leave” is true in w if in some (relevant) world w’ you leave This quantificational treatment captures the logical inferences above. If you leave in all worlds, then there is some world in which you leave: (11a) entails (11b). If you leave in all worlds, then there is no world in which you stay: (11a) entails (11c) and
158 Valentine Hacquard vice versa. If you leave in some world, then you do not stay in all worlds: (11b) entails (11d) and vice versa. This analysis thus derives the relevant relations between possibility and necessity modal statements. However, it doesn’t allow for further graded notions of possibility or necessity, as quantification is either over some or all relevant worlds. The quantificational analysis captures the flavor dimension of modality by restricting the domain of quantification to various sets of worlds. With deontic modality, the relevant worlds are those compatible with relevant rules in our world; with bouletic modality, those compatible with relevant desires; with epistemic modality, those compatible with what is known. What does it mean for a world to be compatible with relevant rules? Imagine that the rules in our household are that children go to bed before 8pm, that they brush their teeth twice a day, that they do not watch TV, and do not whine. A deontically ideal world from our world’s perspective then is one where all children go to bed before 8pm, brush their teeth twice a day, and never watch TV, or whine. The deontic necessity statement in (14a) is true in our world w if in all such deontically ideal worlds, Jo goes to bed. The deontic possibility statement in (14b) is true if Jo reads a story in at least one of these worlds, even if she doesn’t necessarily do so in all of them. Epistemic modality involves worlds in which all of the known facts in our world hold. These may include the fact that Jo is not in the living room, that it’s 8pm, that Jo likes to go to bed after dinner, and so on. The epistemic necessity statement in (14c) is true in our world w if in all worlds in which all of these facts hold, Jo is in bed. (14) a. “Jo must go to bed.”deontic is true in w if in all w’ compatible with the rules in w, Jo goes to bed. b. “Jo may read a story.”deontic is true in w if in some w’ compatible with the rules in w, Jo reads a story. c. “Jo must be in bed.”epistemic is true in w if in all w’ compatible with what is known in w, Jo is in bed. Note that (14) gives truth conditions for the entire modal statements, without addressing yet the exact contribution of the modal words themselves. This question will be the focus of Section 8.5. Note further that truth conditions is all that is provided in (14): what these truth conditions amount to are descriptions of worlds that are ideal from our world’s perspective. But a speaker uttering a sentence like (11a) or (14a) often seems to do more than merely describe ideal states of affairs: she seems to demand that you leave or that Jo go to bed. Similarly, the use of (11b) or (14b) seem to grant a permission for you to leave or Jo to read a story. The next section addresses how such demands and permissions come about: are they part of the conventional meaning of a must or may statement, or a pragmatic by-product of an act of describing a necessity or possibility?
Logic and The Lexicon 159
8.4 Modal statements: sentence meaning vs. utterance meaning Speakers often mean more than the words they utter. We thus need to distinguish the literal meaning of sentences from the speaker meaning that speakers convey in using them. The former falls within the purview of semantics, the latter that of pragmatics (see Schwarz and Zehr, this volume). Modals are often used in textbook examples of speaker meaning going beyond literal meaning. Here we discuss two cases: the first involves so- called “scalar implicatures,” the second “indirect speech acts.”
8.4.1 Scalar implicatures Recall the logical equivalences in (11) and reconsidered in (15) and (16) below. As we saw, the (a) and (c) sentences entail each other. These entailments are a semantic matter: they follow from the literal meaning of the sentences. The (a) sentences also seem to imply the (d) sentences (which correspond to the negation of the (b) sentences). However, these inferences are not entailments: (a) can be true, but (d) false. “Someone left, in fact everyone did” is not a contradiction, and neither is “You may leave, in fact you have to.” (15) a. You may leave. b. You must leave.
c. It’s not the case that you must stay. d. It’s not the case that you must leave.
(16) a. Someone left. b. Everyone left.
c. It’s not the case that everyone stayed. d. It’s not the case that everyone left.
Grice (1975) coined the term “implicature” for speaker meanings that give rise to such cancelable inferences. Hearers grasp a speaker’s implicatures by considering what they literally said and explaining this choice in relation to what they could have said but chose not to. A speaker utters (15a). She could have said something stronger, namely, the necessity statement in (15b). By using the weaker statement in a context where the stronger statement would have been relevant, she might intend to convey that the stronger statement does not hold: if it had, she would have said it instead. The same reasoning holds for (16): by uttering (16a) in a context where (16b) would have been relevant, the speaker can imply that that stronger statement doesn’t hold. These types of implicatures are called “scalar,”4 because they involve “scales” (Horn, 1972), which are 4 Scalar implicatures are a subtype of “quantity implicatures,” where Grice’s maxim of quantity (“Make your contribution as informative as is required”) comes in apparent conflict with the maxim of quality (“Don’t say what you believe to be false or lack adequate evidence for”). The implicature is
160 Valentine Hacquard conventionalized associations of lexical items, ordered in terms of informativity (, . . . ). Because of their conventional nature, scalar implicatures tend to be fairly routinely triggered.
8.4.2 Indirect speech acts Modals are often featured in indirect speech acts, with which a speaker performs an illocutionary act by way of performing another, direct, one (Searle, 1969). A classic example is shown in (17). The direct act performed with (17)—the one that aligns with its literal meaning—is a question about whether the addressee has a certain ability. But (17) is most naturally used as an (indirect) request to pass the salt (Searle, 1975). This request arises by reasoning that the ability in question is so trivial that the speaker couldn’t possibly just want to know about it; instead she must want that ability instantiated. Speakers often use such circumlocutory ways to soften requests and avoid issuing direct commands. (17) Can you pass the salt? In examples like (17), it is rather clear that the two illocutionary acts (the question and the request) are distinct. But with some modal statements, the line between direct and indirect act is often blurred. Because modal statements typically express possibilities and necessities relative to the speaker’s beliefs or desires, it is sometimes difficult to tell whether a modal statement merely describes an epistemic or deontic possibility consistent with her beliefs or desires, or whether the speaker directly proffers some degree of confidence or issues a command. Both positions are actively debated in the literature. A speaker can use a deontic modal claim to grant a permission or issue an obligation. Are these acts part of the conventional meaning of the modal statement, or do they arise indirectly? A natural use of (18a), for instance, issues an order not unlike the imperative “Stay!”. Is this order part of the literal meaning of (18a), or does (18a) merely describe an obligation? To get a feel for the difference, consider (18b). (18b) is not a direct order: it describes an ideal state of affairs for me, the speaker. However, by using it, I can indirectly urge my addressee to stay, by letting her know that it would bring me happiness. We know that (18b) does not directly encode an order to stay because it can be followed by an imperative ordering the addressee to go (“It’d make me happy if you stayed, but go! I know you need to”). The situation is, however, less clear with (18a), as following it with the same imperative seems infelicitous (#“You must stay, but go!”). This kind of infelicity has led to proposals where modals like must have a “performative dimension,” that is, that part of their conventional meaning is the issue of a command to the addressee (see Ninan, 2005; Portner, 2009). Alternatively, under the view that (18a) merely understood by reasoning that the speaker is not in a position to assert a more informative statement, and thus, given compliance with quality, must either not have enough evidence for it, or believe it to be false.
Logic and The Lexicon 161 describes a necessity, the infelicity could be due to the described necessity being relative to rules endorsed by the speaker. (18) a. You must stay. b. It’d make me happy if you stayed. A speaker can use an epistemic modal claim to express her certainty about some state of affairs. Because of this, epistemic modals have sometimes been treated as not directly contributing to the literal, truth conditional, content of the sentence in which they appear, but as mere indicators of the speaker’s degree of certainty (Halliday, 1970; Palmer, 2001; Swanson, 2006, a.o.). Alternatively, under the view that modal statements merely describe possibilities or necessities, the expression of certainty could arise indirectly as a pragmatic side effect. For instance, by uttering (19a), the speaker would describe a necessity relative to facts that she believes, and in virtue of this, she would indicate her certainty about Jo’s guilt. (19) a. Jo must be guilty. b. Jo is guilty. A complication arises with epistemic necessity statements. A sentence like (19a) seems to make a weaker claim than its unmodalized counterpart (19b), that is the use of (19b) seems to require more confidence or better evidence than the use of (19a) (Karttunen, 1972). This fact is puzzling given a standard account of epistemic necessity like the one sketched in Section 8.2. (19a) should entail (19b): if Jo is guilty in all worlds compatible with the relevant known facts, then she should be guilty in our world, which is consistent with these facts. This apparent weakness of must statements requires amendments to the standard view (see Section 8.5.2.2), but it can be straightforwardly captured under a performative account, if epistemic modals encode as part of their conventional meaning a weaker commitment from the speaker than a bare assertion.
8.5 Decomposing modal sentences: modals in the lexicon So far, we have focused on the meaning contribution made by entire modal statements. This section focuses on the contribution of the modal words themselves. Recall that modal sentences vary along two dimensions: force and flavor. In a language like English, force remains constant, but the same sentence can be used to express different modal flavors. How does flavor multiplicity arise? Is this a case of generality (one general possibility meaning unspecified for flavor) or ambiguity (the same string of words corresponds to distinct senses)? And what do the modal words themselves contribute?
162 Valentine Hacquard Before we turn to these questions, let’s briefly discuss languages in which the same modal can be used in situations where English speakers would use either a possibility or a necessity modal. Are such modals indeterminate between a possibility and a necessity meaning, or do they represent genuine cases of ambiguity? Neither option seems likely. While proposals for such “variable force” modals differ for various languages, all converge on providing the underlying modal a stable meaning (either as possibility or necessity, or something more complicated), and where speakers make do, using their modal(s) in a wider range of situations than speakers of languages with modals of both forces.
8.5.1 Meaning indeterminacy Some linguistic expressions can be used in very different situations. This is due sometimes to genuine ambiguity, where two distinct thoughts are expressed by the same sounds, and sometimes to generality, where a single, general, thought is expressed, which is compatible with multiple situations. The words pen (writing instrument) and pen (animal enclosure) present a classic example of lexical ambiguity (two distinct lexical entries), the word teacher one of generality (one single lexical entry). Evidence for this distinction comes from Zwicky and Sadock’s (1975) Identity of Sense tests, which rely on conjoined structures, as in (20a) and (21a), and elided structures, as in (20b) and (21b). (20) a. Jo and Al have beautiful pens. b. Jo has a beautiful pen. Al does too. (21) a. Jo and Al are teachers. b. Jo is a teacher. Al is too. The sentences in (20) require that Jo and Al have the same sorts of things: either they both have beautiful writing implements or they both have beautiful animal enclosures. A “mixed reading” is impossible. This indicates that pen does not express a single general sense that applies equally to Bic pens and pig pens but is rather ambiguous between two senses. Meanwhile, while the sentences in (21) require that both Jo and Al be teachers, they permit them to teach different subjects. Thus, teacher is not ambiguous, but expresses a single sense that is simply neutral as to the subject of teaching. Polysemy shares characteristics with both generality and ambiguity. A polysemous term has several senses, somehow related to each other. For instance, we can use “book” to refer to a physical object or to its content. Is this because “book” has a single general sense that applies equally to both? If it were, why wouldn’t two copies of Moby Dick count for three books, two paperbacks plus their shared content? Perhaps we should instead say that “book” has multiple senses, differing in their extensions. Unfortunately,
Logic and The Lexicon 163 it is tricky to demonstrate that a term is polysemous, rather than ambiguous or general, using Zwicky and Sadock’s tests. Often a plausibly polysemous term, like an ambiguous term, will forbid mixed readings: “Jo and Al have beautiful books” can’t seem to describe a situation where Al has a beautiful physical object, and Jo one with beautiful content. But sometimes a mixed reading is permitted, even with the very same word: “The book is heavy but informative” (Liebesman and Magidor, 2017).5 When we apply these tests to a modal statement, we see that a mixed reading seems disallowed, suggesting that its meaning is not general. In (22) for instance, the modality is either epistemic or deontic for both individuals: either Jo and Al are both required to eat meat, or they are both likely meat-eaters. The sentences cannot be used to mean that one is a likely meat-eater and the other is required to eat meat. (22) a. b.
Jo and Al must eat meat. Jo must eat meat. Al too.
This suggests that must statements are ambiguous between modal flavors, and do not merely describe a general, unspecified, necessity. But what is the source of the ambiguity? Is the word must itself ambiguous, is it polysemous, or is the source of the ambiguity external to the word? Linguists from a functionalist tradition6 tend to assume that modals are polysemous, and thus come in multiple senses (e.g., Sweetser, 1990). But in her seminal account of modality, the formal semanticist Angelika Kratzer argues that there is just one must and just one may, and that the ambiguity arises from additional elements involved in modal statements, either other parts of the sentence, or interactions with the context of speech.
8.5.2 Lexical ambiguity? 8.5.2.1 Against lexical ambiguity A modal like must can be used to express different flavors of modality, including epistemic and deontic ones. Does this mean that we should have as many lexical entries (or senses) for must as there are flavors? Kratzer (1977) argues that the problem is made worse by the fact that each flavor comes in many subflavors. The sentence in (23), for instance, can describe obligations relative to different rules: family rules, rules governing school, cities, or entire countries... How many distinct musts should we have? (23) Children must wear seatbelts. Kratzer argues that we further need an extra neutral must, for cases like (24), where the flavor seems to be determined by an overt phrase. But once we have a neutral must, 5 For different views on polysemy, see Carston (2002), Pietroski (2005), Asher (2011), and Vicente (2018). 6 Functional linguistics contrast with formal approaches in aiming to explain grammatical facts from the way language is used.
164 Valentine Hacquard Kratzer contends, why not assume that there is just this one neutral must, which gets its flavor from another expression in its grammatical context, either explicitly, as in (24), or implicitly, as in (23)? (24) According to DC law, children must wear seatbelts. A further argument against lexical ambiguity is the fact that multiplicity of flavors is not just a quirk of English, or even Indo-European, as it can be found in many unrelated languages. If this is a lexical accident, why should it occur in language after language? A polysemist might retort that unlike in the case of pen, the different senses a modal can express are related: all express some kind of necessity. There may thus be natural reasons for this accident to keep on occurring (functionalists invoke notions like metaphorical extension, for instance). All else equal, however, the Kratzerian view is more parsimonious. But languages may not abide by the Kratzerian ideal, and parsimony may not be maintainable against the full empirical picture. We turn to the Kratzerian account and empirical arguments that threaten its viability next.
8.5.2.2 Kratzerian theory According to the classical Kratzerian theory (Kratzer, 1981, 1991), there is just one must, unspecified for flavor. Flavor gets determined by a restriction, called “conversational background,” which provides the set of worlds the modal quantifies over, with different restrictions giving rise to different flavors. This conversational background can be supplied by context when not overt. An epistemic conversational background, for instance, provides a set of worlds compatible with what is known in our world. The slightly simplified lexical entries in (25) illustrate. Here p is the proposition expressed by the modal’s prejacent, and f(w) picks out the set of worlds determined by a conversational background f at a world w. The meanings differ just in the force of quantification over this set: universal quantification for must, and existential quantification for may. We can see how in this system, there is no lexical ambiguity: we have just one must and just one may, neither of which specified for flavor. (25) a. At a world w, “must” names a relation between propositions p and backgrounds f, true just when p is true in all worlds w’ compatible with f(w) b. At a world w, “may” names a relation between propositions p and backgrounds f, true just when p is true in some world w’ compatible with f(w) Kratzer’s theory not only explains flavor multiplicity without invoking ambiguity,7 it also overcomes empirical challenges that the original quantificational analysis faced.
7 Formally,
a conversational background is a function from worlds to sets of propositions (e.g., propositions that describe some known facts). One problem with this purely contextual account of
Logic and The Lexicon 165 Seeing this requires introducing a further complication in Kratzer’s system. In the full system, modals are relative not to just one, but two conversational backgrounds. This double relativity is motivated in part by puzzles of deontic modality. Imagine that Jo has committed a crime, for which the law requires that she go to jail. We can report this with the deontic necessity statement in (26), which should be true if Jo goes to jail in all worlds compatible with the law. But how could Jo have committed a crime in these worlds? Such worlds are supposed to be crime free! Kratzer argues that we can solve this conundrum by separating facts from ideals. Jo has committed a crime: this is an irrevocable fact. We can nonetheless invoke worlds that best fit the law among those imperfect worlds in which this crime has occurred: in all such worlds, Jo goes to jail. (26) Jo must go to jail. To implement this separation of facts and ideals, Kratzer proposes to make modals relative to two conversational backgrounds. The first, called the modal base, is based on facts: it picks out an initial set of worlds in which certain facts hold (for instance, the fact that Jo committed a crime). The second, called the ordering source, is based on various ideals, such as laws, needs, or desires. The modal ends up quantifying over a subset of the initial set of worlds, namely those that best fit the ideal (for instance, those that best obey the law). This double relativity provides a solution to Karttunen’s puzzle, mentioned in Section 8.4.2. Recall that an epistemic necessity statement like (19a) intuitively feels weaker than its unmodalized counterpart (19b). However, a necessity modal should quantify over all worlds compatible with the known facts, including our world, thus (19a) should entail (19b). Kratzer argues that the entailment fails to go through because the modal doesn’t necessarily quantify over all epistemic worlds. The modal takes an epistemic modal base, which does pick out all of the worlds compatible with the known facts, including our world, but it also involves a stereotypical ordering source, which pares down this set of worlds to only those that best fit stereotypical expectations. Our world may not be stereotypical, thus (19a) need not entail (19b). Double relativity also allows for some graded notions of possibility. Recall that in our initial quantificational analysis, modals either quantified over some or all relevant worlds. We thus couldn’t distinguish between slight and regular possibility, or strong and weak necessity. Double relativity allows worlds to be ranked according to some ideal, which in turn allows for talk about better or worse possibilities.8
flavor determination is that nothing about a set of propositions makes it inherently epistemic vs. deontic. Consequently, whatever meaning differences we find between a modal like might (which is typically only associated with epistemic and metaphysical uses) vs. one like can (which is typically only associated with root uses) can’t be attributed to a difference in value of the conversational background as determined by context (Nauze, 2008; Kratzer, 2012; Harr, 2019). 8 Some
of the finer-grained gradability that modal adjectives in particular can express may require more than double relativity; for an overview, see Lassiter (to appear).
166 Valentine Hacquard
8.5.2.3 Lexical ambiguity after all? In the last few decades, new empirical data has been brought to light, arguing that the different flavors expressed by a modal like must cannot share the same lexical entry. First, as discussed in Section 8.4.2, modals can be used to perform different speech acts: deontics to give orders or permissions, epistemics to indicate a degree of certainty. The various accounts that derive these speech acts by encoding a performative dimension to the modal words themselves necessarily assume lexical ambiguity, given that the speech act performed differs by flavor. However, this challenge may be easy to skirt, if these speech acts can be derived in a more pragmatic way, as discussed earlier. A speaker using a deontic must sentence, for instance, would literally merely describe a deontic necessity, but could indirectly issue a command in virtue of counting on her audience to understand why she is describing what is necessary: namely, that she endorses the rules that underwrite it and wants her audience to comply with them. A second and perhaps more worrisome concern for the Kratzerian view is that cross- linguistically, modals tend to interact differently with elements like tense, depending on the flavor they express.9 To see this in English, we need to turn to the semi-modal have to, which, unlike modal auxiliaries, can be tensed. Consider (27). With a deontic interpretation, (27) expresses a past obligation: Jo was required to be home. With an epistemic interpretation, however, the sentence can express a current necessity about a past state of affairs: it is necessary, in virtue of what we now know, that Jo was home. (27) Jo had to be home at the time of the crime. Not only can had to report a present epistemic necessity, it may have to, unlike a past tensed verb like seemed. Consider the scenario in (28). (28a) reports a past seeming state, which no longer holds. But if we try to make roughly the same point with (28b), it sounds distinctly odd. This suggests that it can only report a current epistemic state, which conflicts with the continuation denying that Jo was home.10 (28) Earlier this week, the available evidence pointed us in the wrong direction. For instance, a. while Jo seemed to be home at the time of the crime, we now know that she wasn’t. b. ??while Jo had to be home at the time of the crime, we now know that she wasn’t.
9 This
is part of a more general pattern: epistemic and root modals differ in their interactions with aspect, negation, quantifiers, and other modals. For an overview, see Hacquard (2011). 10 Putting the modal in an adjunct helps sharpen the contrast (A. Williams, p.c.), see van Dooren (2020) for experimental support.
Logic and The Lexicon 167 These facts are usually captured as a matter of scope, that is, the position where the modal has to be interpreted relative to elements like tense.11 Epistemic modals scope above tense; hence, their time of evaluation is not affected by a past tense that appears in the same clause. Deontic (and other root) modals, on the other hand, scope below tense, hence their evaluation time is shifted by a past tense (Picallo, 1990; Stowell, 2004; Hacquard, 2006; a.o.). This scope difference goes against the Kratzerian view: if deontic and epistemic modals scope in different positions how can they share a lexical entry? We may need (at least) two separate musts, each specified for flavor and scope. There are ways to maintain the Kratzerian view by complicating how the modal combines with its restriction: there would still just be one neutral must, which could freely appear above or below tense, but different modal bases would be available in different positions (Hacquard, 2006, 2010; Kratzer, 2012). But now, parsimony might argue for ambiguity. This apparent advantage however seems to diminish when we consider issues of learnability in the next section. On any analysis of modals, the child needs to recognize the association between flavor and scope. Since this association is not an idiosyncratic fact about each individual modal, it seems wrong to encode it via lexical stipulations. Thus, in the end, the Kratzerian view might on balance be the simplest.
8.6 Learnability issues Children acquiring the modals of their language presumably face the same challenges as semanticists trying to figure out their meanings. The first challenge we discussed, namely the difficulty in untangling the semantic and pragmatic contributions of modal statements is further amplified for children, who are only ever exposed to speaker meanings. How do they extract the literal content of speakers’ modal claims? Might they have certain expectations about how modal meanings are packaged in natural language? The second challenge comes from the fact that the same modal word can express different flavors, but that these flavors interact differently with tense. Children not only need to figure out that a word like have to can express epistemic and deontic necessity but also that with an epistemic flavor (but not a deontic one), it outscopes tense. What prevents them from assuming that have to uniformly scopes below tense, just like any 11 To
be precise, the generalization is not strictly speaking that had to can’t report a past epistemic necessity. “I thought that Jo had to be home” clearly reports a necessity at a past thinking time. In that sentence, the past tense on the main verb, “thought,” introduces the relevant past time, and the necessity is simply concurrent with the thought. The question is whether the modal can scope under the tense in its own clause (not the tense of a higher clause, like the past tense on thought). And this is a matter of debate. Some argue that it must, and question data like (27) and (28) (e.g., Rullmann and Matthewson, 2018). Others argue that it cannot, and that counterexamples involve special contexts, which introduce higher operators responsible for shifting the modal’s evaluation time (e.g., Stowell, 2004; Boogart, 2007; Hacquard, 2006).
168 Valentine Hacquard tensed verb? Positive evidence for epistemics scoping over tense is virtually absent in speech to children: epistemic modals are infrequent, and almost always occur in the present tense.12 It can’t just be a matter of notional meaning (e.g., an incompatibility between epistemic meanings and pastness), since predicates with epistemic meanings like know, seem, or be likely all happily scope below tense. These scope facts create problems for both the classical Kratzerian account and for a pure lexical ambiguity account, the former because it cannot tie particular flavors to particular scopal positions, the latter, because if scope is a lexical idiosyncrasy of two homonyms, it is one impossibly hard to detect. There seems to be something special about grammatical modals that goes beyond the meanings they express, which dictates their scopal behavior, and which learners have to somehow be privy too. Cinque (1999) proposes that functional elements (which is roughly to say closed class or grammatical elements), like tense and modals, are rigidly and universally organized in a particular order, with epistemic modals above, and root modals below tense. Because Cinque’s hierarchy proposes that different flavors of modals occupy different positions, it fits well with ambiguity proposals where different lexical entries for different flavors can occupy different positions, and a learner equipped with something like Cinque’s hierarchy might expect a functional (i.e., grammatical) epistemic modal to scope above tense, unlike a verb with an epistemic meaning. Another possibility, alluded to at the end of Section 8.5, is to tie modal bases to particular positions, rather than the modals themselves. If that is right, the spirit of the Kratzerian view might be maintainable. Either way, there is something particular to grammatical modals that allows them to express different flavors of modality but constrains their scopal behavior based on the flavor expressed, and which both children and semanticists need to figure out. (For more on the acquisition of logical vocabulary, see Crain, this volume.)
8.7 Conclusion Languages have various means of expressing notions of possibility and necessity. A language like English has a dedicated grammatical category of modals—words like can or must, alongside verbs, adverbs, nouns, and adjectives that express similar notions. Because modal statements involve a complex interplay of morphology, syntax, semantics, and pragmatics, isolating the exact lexical contributions of the modal words themselves presents challenges for both semanticists and children acquiring their language. The first is that it isn’t always easy to untangle the semantic and the pragmatic contributions of modal statements, given that modals are routinely used to perform
12 Only one epistemic had to was found out of 2,400 occurrences of have to in maternal speech in the Manchester corpus (339,795 utterances) (Theakston, Lieven, Pine, and Rowland, 2001), analyzed in van Dooren, Dieuleveut, Cournane, and Hacquard (2017).
Logic and The Lexicon 169 illocutionary acts that go beyond mere descriptions of possibilities and necessities. The second is that the same words can be used to express different flavors of modality, raising questions as to whether each of these words comes in different lexical entries in our mental lexicon. However way these questions ultimately get resolved, modals provide a rich terrain to explore how meaning gets packaged in natural language.
Acknowledgments Many thanks to Alexander Williams, Anouk Dieleuveut, Anna Papafragou, and one anonymous reviewer for helpful feedback.
Pa rt I C
I N T E R FAC E S A N D B OU N DA R I E S
Chapter 9
Pragm atics a nd t h e Lexic on Florian Schwarz and Jérémy Zehr
9.1 Introduction Pragmatics, as the study of meaning in language use, is concerned with the ways that contextual information affects the overall conveyed meaning of an utterance. For example, whether (1) is uttered out of the blue or in response to a preceding claim that Nina swims influences whether one is more or less likely to interpret it as communicating that Lola does not swim. (1)
Lola runs.
While to some extent the term run itself may be seem to be associated with an inherent opposition to swim, this opposition clearly is not part of its core meaning—after all, one can perfectly well claim that Lola runs and swims without contradiction. Furthermore, as we just saw, the presence and salience of this opposition is clearly modulated by context (here, a preceding utterance—or its absence). The present chapter reviews how contextually driven inferences, broadly construed, interact with the encoding of meaning at the lexical level. Our starting point is a fairly standard view of divvying up meaning into semantic and pragmatic components: semantics is primarily concerned with what we alluded to as “core” meaning above, while pragmatics is concerned with contextual reasoning. In light of this view on the semantics-pragmatics interface, a central research question is whether (and to what extent) a given ingredient of meaning associated with the use of a particular expression in context should be seen as directly encoded in the lexicon, or whether (and to what extent) it is derived via general reasoning in context. Perhaps unsurprisingly, the answer to this question is oftentimes far from clear- cut, and the details of just which bits are conventionally encoded and which are
174 Florian Schwarz and Jérémy Zehr pragmatically inferred are delicate and controversial. Our discussion will focus on key phenomena that serve as case studies to illustrate the task of spelling out the division of labor between semantics and pragmatics in full detail. First, we turn to phenomena that crucially are construed as relating to scales, namely scalar implicatures and scalar adjectives. Underlying the notion of scale is the generic process of comparison; as we will see in our discussion of these two phenomena, context and lexical encoding both fundamentally contribute to the identification of the objects being compared, and to the structures they define in that process. The next case study turns to presuppositions, which introduce another set of issues with regard to the source of ingredients of meaning. On the one hand, presuppositions are inherently contextual, as they integrate the conversation’s participants’ beliefs in the discourse; besides, while traditional theoretical accounts of meaning model with a certain accuracy how conventionally encoded meaning is computed, they fail to model characteristic behaviors of presuppositions. On the other hand, presuppositions seem to be directly tied with particular lexical items, suggesting a conventional association. Presuppositions thus constitute a perfect example of the interaction between context and conventionally encoded content at the lexical level. But before turning to these detailed case studies, the following section reviews the theoretical background and assumptions that form the basis for the underlying view of the semantics-pragmatics interface in more detail.
9.2 Background: constructing meanings A core notion that is the basis of our discussion is that the overall meaning intuitively conveyed by uttering a given expression is a conglomerate of distinct ingredients that can be separated out based on their properties, and which may have different sources. As sketched for (1) above, some aspects of meaning are thought to be conventionally associated with particular lexical items (e.g., that Lola runs), while others (e.g., that Lola doesn’t swim) can enter the picture via contextual reasoning that enriches the conventionally encoded meaning. Putting aside further complications for a moment (of which there are many), the former constitute the realm of semantics, whereas the latter fall into the purview of pragmatics. Let us explicate the key terms here in a bit more detail: conventional encoding of meaning refers to the arbitrary links any language relies on between expressions in the lexicon and their meaning, which have to be learned. The notion of context includes both the physical utterance context as well as the linguistic and discourse context, including assumptions about the communicative purposes and issues at stake. Reasoning about utterances in context encompasses considering their “core” meaning in relation to aspects of the context, which then can lead to inferring additional meaning beyond the core parts to enrich what is taken to be the speaker’s intended message.
Pragmatics and the lexicon 175 As we noted above, a—and perhaps THE—central theoretical question for this perspective on the semantics-pragmatics interface is to determine which aspects of meaning have to be seen as conventionally rooted, and thus encoded in the lexical entries of particular expressions, and which can be accounted for without such encoding. Two guiding principles that commonly underlie deliberations in this regard, though often just implicitly, are the following: First, meanings should only be conventionally encoded in lexical entries if accounting for related phenomena forces us to, that is, positing conventional encoding is subject to Occam’s razor. Second, if a given bit of meaning is present in some contexts but not in others, that provides an argument against encoding it conventionally. In other words, conventional aspects of meaning are those that are stably present across contexts, and which cannot be explained in terms of general reasoning. In much of this chapter, we will be concerned with phenomena that seem prima facie pragmatic, but which have been argued to require conventional encoding of at least some related ingredients to adequately capture the empirical facts. We will complement our theoretical discussions with insights from empirical investigations, which systematically reveal a more gradual picture than expected under a radical convention- vs.-context split approach. We will see that many ingredients of meaning, while failing to display a stable presence across contexts, cannot seem to be fully accounted for without assuming further conventional encoding at the lexical level after all.
9.2.1 The interplay of content and context: basic illustrations The need for integration of conventional content and contextual information is pervasive. In many instances, both are needed, even in just accounting for the interpretation of individual expressions. Consider indexicals such as first and second person pronouns: what is the meaning of, say, ‘I’? In one sense, it crucially has a stable meaning that systematically contributes a referent to the interpretation of utterances it occurs in. But just who the contributed referent is, of course, not at all constant across utterances, as ‘I’ will pick out the speaker of whatever utterance we are considering. The standard solution is to assume a conventionally encoded meaning that crucially incorporates contextual information: The referent introduced by ‘I’ is conventionally fixed to be the speaker of the utterance, where information about interlocutor-roles, such as speaker (as well as addressee, for second person), is assumed to be directly supplied by (a suitable representation of) the context.1 Another class of expressions that can, on certain interpretations, draw on the context to fill in their lexically encoded content are third person pronouns, such as ‘she.’ These
1 Parallel
effects can, of course, be observed for various expressions relating to other aspects of the context, for example, the time (‘now’) and place (‘here’) of utterance.
176 Florian Schwarz and Jérémy Zehr differ from purely indexical pronouns in that their semantic values can depend on another expression in the linguistic context, as (2) illustrates: (2) a. b.
Lola said that she runs. Every girl said that she runs.
Identifying what ‘she’ stands for here requires consideration of the semantic value(s) of another expression in the sentence. Besides the “co-referential” and “bound” readings in (2), ‘she’ also has an indexical reading, where it refers to a single contextually salient individual.2 Various other expressions display parallel types of context-sensitivity, including a wide range of phenomena, such as domain restriction for quantificational expressions (where, say, ‘every student’ can be contextually understood to effectively mean, for example, ‘every student in this class’; Westerstahl, 1984; von Fintel, 1994) and standards of comparison for gradable adjectives like ‘tall’ (which can differ when we’re talking about preschoolers vs. basketball players; see Section 9.4.2.2 below), and even the semantics of tense (e.g., when ‘I didn’t turn off the stove’ seems to refer to some particular contextually salient time where this happened; Partee, 1973). While the types of cases considered so far come with many intricacies of their own, it is fairly clear what aspects of their interpretation come from context and what aspects have to be conventionally encoded, at least at a basic level (there’s an abundance of technical issues for any full-fledged formal theoretical account integrating contextual information into the interpretation process). Our main focus in the following will be phenomena that are best discussed in the context of the meanings of entire sentences, and the contributions that individual words make to them.
9.2.2 Compositional semantics for sentence meanings In order to talk about pragmatic aspects of sentence meanings, we need at least a rough sketch of the general framework for deriving sentence meanings from the lexical meanings assumed for individual expressions (for a related overview on the lexicon and meaning composition, see Piñango, this volume). In the framework we’re adopting, sentence meanings are construed in terms of truth conditions. As the first sentence of an influential formal semantics textbook states: “To know the meaning of a sentence is to know its truth conditions” (Heim and Kratzer, 1998, p. 1). We hasten to note that this does not preclude the existence of other dimensions of “meaning,” which we focus on below. It simply acknowledges as basic the capacity of speakers to differentiate between
2 Some languages distinguish these readings by using different forms. Note that there also are languages that allow certain bound readings of their first-or second-person pronouns, making the distinctions we’re drawing here for illustration more subtle.
Pragmatics and the lexicon 177 situations in the world that fit the description provided by a sentence and those that do not. The truth conditions of a sentence derive from the semantic contributions of the words in the sentence: ‘Lola runs’ and ‘Lola swims’ are true in different situations because ‘run’ and ‘swim’ describe different activities. The core task of formal semantics is to provide a theory of how the meanings of sentences (and more generally, any complex, multi-word expression) can be derived using the fundamental principle of compositionality (Frege, 1892): the meanings of complex expressions are a function of the meanings of their parts and the syntactic structure they occur in. For example, the semantics of predicates like ‘run’ is modeled as a function from individuals to truth- values, which will return true if the individual of whom it is predicated is a runner (more sophisticated variants will instead characterize this in terms of the individual having to be the agent of a running event).
9.2.3 Gricean reasoning With such a basic semantic framework in place, we can now consider what is conveyed by utterances of sentences in context more broadly, and turn to our central question: which aspects of the overall conveyed meaning have their source in the conventionally encoded content of the expressions involved, and which are derived from contextual information and domain general reasoning about the utterance? To start with a simple case where it’s easy to discern and illustrate the sources of different aspects of the meaning that’s conveyed, consider the following (adapted from Grice): (3) Context: A driver stops to talk to a pedestrian. A: Excuse me, I’m out of gas. B: There’s a gas station around the corner. Among the things conveyed by B are the literal meaning that a gas station is located around the corner, as well as additional notions about it being functional and open for business (or at least B not being aware of any facts to the contrary). Intuitively, this seems to ride on the idea that if B thought the gas station was not in operation or closed at this time, they should have said so or simply not mentioned that gas station at all. The most influential account of these types of inferences is due to seminal work by Grice (1975). It is based on a view of conversation as a cooperative endeavor among rational interlocutors. His overarching Cooperative Principle, which assumes discourse participants to be cooperating in pursuing a common cause, is more concretely reflected in the Gricean Maxims: the maxim of quality, requiring speakers to be truthful; the maxim of quantity, requiring speakers to provide appropriate amounts of information; the maxim of relevance, requiring utterances to relate to the current discourse goals; and the maxim of manner, requiring appropriately brief, transparent, and well-ordered utterances. These maxims, which are intended as descriptive characterizations of principles that actually
178 Florian Schwarz and Jérémy Zehr guide interlocutors (rather than prescriptive rules), commonly interact with one another, so that pragmatic inferences often arise by striving for a balance between them. For example, if B in (3) knew that the local gas station was closed, the answer would still have been true (and thus aligned with the maxim of quality), but the mere existence of a gas station is not enough to make B’s contribution relevant. The pragmatic interpretation of the existential utterance can be explained in terms of Relevance and Manner: B takes the existence of a gas station to be relevant to the discourse situation, so it must be that the gas station is open and serves gas, but chose not to make that explicit for the sake of brevity. In addition to these highly contextual inferences that are not very closely tied to particular lexical entries, a host of other inferences are quite generally associated with connectives and other logical expressions. For illustration, take the common temporal inference associated with ‘and’: (4)
John stepped out of the house and changed out of his pajamas.
The described circumstances seem unusual, given standard societal conventions, because what seems to be conveyed is that the two events occurred in the order described, and John thus changed clothes out on the street. This effect would be puzzling if all we have to work with is the conjunction of statement logic, which is fully symmetric, so that ‘p and q’ cannot possibly have a different meaning than ‘q and p.’ Rather than giving up on the logical analysis of ‘and’ for purposes of meaning composition, Grice’s solution supplements it with a system for deriving additional pragmatic inferences that account for its actual use. In particular, it attributes the order effect to reasoning about the maxims, as adherence to the maxim of manner will generally give rise to the conclusion that a narrated sequence of events occurred in the order they were introduced in (barring any indications to the contrary). In the context of our discussion, a Gricean view allows for a minimal conventionally encoded content for “and” (the logical conjunction), while reasoning about its use in context can add to its overall meaning contribution. For historical context, it is worth noting that Grice’s account of cases like these played a pivotal role in reconciling two opposite approaches in the philosophy of language at the time: one analyzing meaning in natural language using logic, the other, ordinary language philosophy, focusing on language use. Thereby, Grice defended logical approaches to semantic meaning, while independently accounting for systematic aspects of language use. This general approach created the basic framework of encoding certain aspects of meaning in the lexical entries of particular expressions (such as the logical core of conjunction), while deriving other aspects (such as the order effect above) in independent ways, typically in terms of reasoning about language use in context. In the next section, we turn to the perhaps most influential, and most studied, phenomenon in this realm, namely that of scalar implicatures. These are of particular interest to our current discussion for at least two reasons. First, scalar implicatures embody the Gricean idea that the conventional content of a lexical entry can include less than what might seem to be its overall contribution to meaning at first sight. Second, scalar implicatures fundamentally involve comparing linguistic expressions; as we will see, this may require conventional association between lexical entries in some cases.
Pragmatics and the lexicon 179
9.3 Scalar implicatures Both the first case study on scalar implicatures in the present section and the second one in the next section on scalar adjectives are fundamentally concerned with the notion of scales and the extent to which they may require lexical association of particular expressions with such scales. Formally speaking, a scale integrates several abstract objects (e.g., linguistic expressions or degrees) into an ordered set. Association with a scale allows for comparison with other elements of the scale; when these elements themselves are lexical entries, this brings into play the notion of an alternative. While there is a fair amount of—perhaps under-appreciated—overlap in issues related to the role of scales in scalar implicatures and scalar adjectives, and potentially fruitful connections can be drawn between them, we will largely present each set of phenomena in its own right, while highlighting potential parallels when discussing scalar adjectives.
9.3.1 The basic phenomenon Scalar implicatures have particular relevance for the central concerns of this chapter, as they are standardly seen as a pragmatic phenomenon involving competition between lexical entries, and thus invoking so-called alternatives. As the name suggests, these are taken to be ordered on a scale; we will elaborate this notion below. Following much of the discussion in the literature, let us illustrate scalar implicatures by looking at the quantifier ‘some’: (5)
Some of the guests already left.
Uttering this sentence intuitively conveys that there also are guests that are still there, that is, that not all guests left. This is commonly seen as the result of reasoning about the alternative possibility of making the statement in (6): (6)
All guests already left.
In principle, one could also consider the lexical entry for ‘some’ to encode a meaning equivalent to that expressed by ‘some but not all.’ However, there are many environments where these are not intuitively equivalent: (7)
a. b.
If some of the guests already left, the party was a failure. If some but not all of the guests already left, the party was a failure.
Clearly, the possibility considered in the ‘if ’-clause of the ‘some’ version includes the case where all of the guests already left (in which case the party would have been an
180 Florian Schwarz and Jérémy Zehr even greater failure, if anything). In contrast, the ‘some but not all’ version allows for the (contextually implausible) possibility that if all of the guests left, the party somehow wasn’t a failure. Another indication that the ‘but not all’ part of the meaning conveyed by (5) is not encoded in the literal meaning derived from its lexical parts is that it can be explicitly cancelled without giving rise to a sense of contradiction (8a)—in contrast to cancelling literal meaning, as in (8b). (8) Some of the guests already left. a. ✓In fact, everybody’s gone. b. # In fact, nobody’s gone. This contrast suggests that some inherently negates none as part of its lexical meaning, but the negation of all cannot be encoded on an equal footing, for otherwise (8a) should give rise to the same sense of contradiction as (8b). Rather, as we already alluded to, the negation of all obtains as a scalar implicature. Scalar implicatures arise when speakers do not make a more informative claim using an alternative sentence. Let us walk through our current example. Both (5) and its all alternative (6) are true if all of the guests left, but only (5) is still true if only a subset of the guests left. In other words, the situations that make (6) true are a proper subset of those that make (5) true. So from the point of view of a speaker who knows that, in fact, all of the guests left, using either (5) or its all alternative would count as a (technically) true statement, but from the point of view of a hearer ignorant of the facts, (5) would not be as informative, because its truth is not enough to conclude with certainty that all of the guests left. Therefore, an informed and cooperative speaker is expected to use (5), and not the more informative all alternative, only if they do not think that all of the guests left. That explains why, upon interpreting (5), we typically conclude that not all of the guests left.
9.3.2 Constraining alternatives 9.3.2.1 The symmetry problem In light of the key role of reasoning over potential alternative utterances in the discussion above, it comes as no surprise that one of the central issues in the literature is just what alternatives are to be considered in the process. That this has to be constrained has been clear from early on, due to what has become known as the “symmetry problem”: if not only (6) but also (9) were alternatives to (5), the desired derivation of the standard pragmatic interpretation of (5) would be lost. (6) (9)
All guests already left. Some but not all guests already left.
After all, (9), too, is a more informative statement than (5) (assuming, as before, a ‘some and possibly all’ interpretation of ‘some’). So the hearer should apply the same reasoning
Pragmatics and the lexicon 181 to this statement as to (6), and conclude that (9), too, is false, that is, that either all guests already left, or none left yet. But in conjunction with the literal meaning of (5) (that some and possibly all guests left) this is inconsistent with the parallel inference based on (6) (namely, that not all guests left). Since there is no obvious way to choose between these inferences, this at best leaves us with a very weak overall interpretation, namely that some, and possibly all, guests left, while the speaker lacks evidence to assert either that all guests left or that only some guests left. But the standard observed interpretation is that the negation of (6) is inferred, whereas (9) does not seem to enter the reasoning at all. The conclusion is that we need a systematic way of excluding (9) from being considered as an alternative when reasoning about (5). And more generally, we need a systematic account of which alternatives are taken into consideration when evaluating a given sentence.
9.3.2.2 Alternatives and the lexicon Seminal work by Horn (1972, and subsequent work) proposed that expressions like ‘some’ are associated with scales that provide the relevant alternatives. The perhaps most basic version of this proposal is that it is part of speakers’ lexical knowledge that certain words are associated with scales, like , and that the alternatives needed for the reasoning above are to be found on those scales. On this view, truth conditional meaning is not the only type of information that is stored in lexical entries—certain properties that play an essential role at the pragmatic level can also be encoded lexically. This notion of “Horn-scales” forms the basis of much discussion in the literature, although the theoretical underpinnings are far from clear, as highlighted by recent work homing in on the question of how precisely alternatives for a given expression are selected. While some early work (Gazdar, 1979, a.o.) put forward the notion that Horn scales are essentially “just given to us,” this can hardly be the whole story: claiming that alternatives are retrieved from memorized scales that are lexically associated with a given expression immediately raises the question of how those associations came to form, and why those but not others. Thus, more systematic considerations on why certain expressions but not others are to be included on a scale for a given lexical entry seem to be called for. One necessary condition on alternatives was already implicit in the above: they must be related by strength in terms of an entailment relation. The reasoning based on the maxim of quantity spelled out above crucially considers logically stronger statements. But ‘some but not all’ is logically stronger than ‘some,’ as well. So there must be more to the sufficient conditions for being an alternative. One straightforward option to consider is that only lexical entries are candidates for being alternatives. This would rule out all more complex expressions, including ‘some but not all.’ However, there are cases where more complex expressions do seem to enter the process of scalar reasoning, for example, when the alternatives are explicitly present in the local discourse context and in so-called downward entailing environments (such as the restrictor of every; Ladusaw, 1980):
182 Florian Schwarz and Jérémy Zehr (10)
It was warm yesterday, and it is a little bit more than warm today. (Matsumoto 1995, p. 44)
(11)
Every day on which it was a little bit more than warm we went swimming. (based on parallel examples in Katzir, 2007)
In both cases, ‘warm’ and ‘a little bit more than warm’ seem to be considered as alternatives, as reflected in the inferences that yesterday was not a little bit more than warm (10) and that we did not (necessarily) go swimming every day that was (merely) warm, respectively (11). Katzir (2007) argues that parallel examples do not exist for ‘some but not all,’3 and that we thus need to differentiate the two cases. Matsumoto (1995), expanding on Horn (1972, 1989), formulates a monotonicity constraint on scales, such that all scalemates have to be positively or negatively scalar (for a formal definition of the relevant notions, see Sevi, 2005). (Note that this differentiates between (9) and (10).) However, as Katzir (2007) points out, the monotonicity constraint need not be met for cases like (13), where the non-monotonic expressions exactly three semanticists and John but not Mary seem to give rise to alternative-based reasoning: (12)
I doubt that exactly three semanticists will sit in the audience.
(13)
Everyone who loves John but not Mary is an idiot.
No doubt is conveyed here about the possibility of there being more than three semanticists (as encoded in the commonly assumed meaning of ‘three’ that is semantically equivalent to ‘at least three’); and loving John is not claimed to make one an idiot, as long as one also loves Mary. This suggests that three and John can indeed serve as alternatives to exactly three and John but not Mary. Based on these examples and some further considerations, Katzir (2007) proposes a formal definition of alternatives that, in addition to lexical substitutions, allows substitutions with constituents in the discourse context (making ‘a little bit more than warm’ an alternative for ‘warm’ in (10)), as well as with less complex substitutions (bringing ‘warm’ into play as an alternative in (11)). Several other authors have further extended complexity-based proposals for constructing structural alternatives (Katzir, 2007; Fox and Katzir, 2011; Trinh and Haida, 2015, a.o.). However, in any variant, issues of both under-and overgeneration remain, as reviewed by Breheny, Klinedinst, Romoli, and Sudo (2018). For example, as noted by Swanson (2010), not even all lexical alternatives seem to be considered in the relevant reasoning processes, in particular where meanings of the ‘some but not all’ kind seem to be lexically encoded, as is arguably the case for ‘intermittently’: (14)
a. b.
3 The
The heater sometimes squeaks. The heater intermittently squeaks.
resulting sentences are strange, as Katzir notes as well, so one may want to be cautious in drawing too firm a conclusion here; see Katzir’s paper for discussion and his take on this.
Pragmatics and the lexicon 183 Complexity-based approaches (without anything like the monotonicity constraint added in) predict (14b) to be an alternative to (14a), but there is no inference that the negation of (14b) holds—to the contrary, the observed inference (based on the negation of alternatives such as ‘constantly’) is precisely that (14b) DOES hold. Such lexical instances of the symmetry problem thus pose a serious challenge to these structural approaches to alternatives.
9.3.2.3 Alternatives and context Another possible response to examples involving more complex alternatives is to maintain a notion of scales as involving only lexical alternatives and explain other examples in a different way, that is, consider lexical and contextual alternatives as distinct phenomena with distinct explanations. Whatever the best explanation for these may be, it is clear that the range of contextual alternatives is much broader than what we have considered so far. Consider the following examples of what has traditionally been considered a particularized (as opposed to generalized, or more specifically scalar) implicature (see Hirschberg, 1991, for extensive discussion on this). (15)
A: B:
Bill is a swimmer. What does Lola do? She’s a runner.
(16)
It was warm yesterday and it is warm and sunny with gusts of wind today. (Katzir, 2007, p. 687)
As discussed in the introduction, B seems to convey both that Lola (identified as the referent of ‘she’) is a runner and that she’s not a swimmer in (15). But of course, this is completely driven by the specific context at hand, and in no way generally associated with the lexical entry for ‘swimmer.’ Similarly (16) suggests that yesterday it was not sunny with gusts of wind, which again is not part of the lexical meaning of warm. Examples like these have often been characterized as a distinct case of their own, given that they depend on very specific aspects of the context. But note that (at least certain variants of) structural approaches will analyze these examples in the same way as more standard “scalar” implicatures, as the mechanism for generating alternative structures can draw on contextually present expressions, making warm and sunny with gusts of wind an alternative of warm in the context of (16). (Also note that they may need additional constraints to rule out the lexical alternative ‘swimmer’ when it is not present in the context.) Hirschberg (1991) put forth an alternative approach extending a more lexical, scale-based approach, by allowing for contextually constructed ad hoc scales. Another illustration of what can easily be seen as an ad hoc scale based on context and world knowledge is the following: (17)
Context: A left Los Angeles and is on their way to Seattle A: I passed Sacramento.
184 Florian Schwarz and Jérémy Zehr In the given context, passing Portland entails passing Sacramento in a way that passing Phoenix does not, and thus Portland qualifies as an alternative to Sacramento on the ad hoc scale < Los Angeles, Sacramento, Portland, Seattle >. Accordingly, A’s statement conveys that they have not yet passed Portland (while not conveying anything about passing Phoenix). Examples like these suggest that the alternatives considered in implicature reasoning need not stand in a semantic entailment relationship—purely contextual entailment (resulting from the combination of the semantic meanings at play with contextual assumptions) suffices for triggering the relevant comparisons with regard to the strength of the statements.
9.3.2.4 Taking stock Our discussion here has focused on perspectives that all share the assumption that the lexical entries of expressions like ‘some’ do not directly encode the stronger ‘some but not all’ meaning, but rather have a semantics amounting to the more general ‘some and possibly all’ meaning.4 Instead, they assume that the strengthened meaning comes about by considering alternative statements.5 The main dimension of divergence that we focused on here concerns the different ways in which these alternatives are selected, and how they relate to the lexicon. On traditional lexical approaches, the alternatives are directly hard-wired in the lexicon in the form of Horn-scales. Other approaches have a looser connection to the lexicon, as they merely see it as one source of alternative expressions for a more general mechanism, which also can take into account contextually salient alternatives. This extends the coverage to more context-based cases of particularized implicatures. Extensions of lexical approaches that try to capture the more contextual cases as well also loosen the connection to the lexicon by introducing the notion of ad hoc scales. Taking a step back, a key question that is at play in this area is precisely whether there are any lexically associated alternatives that have a special status distinct from those that are in play due to salience in the context. Most current accounts try to work with broad enough notions to accommodate both in one system, but it is also perfectly possible that there are more fundamental differences between the two types of cases meriting a separate treatment. The answer to that question more or less determines the extent to which this key ingredient for a central type of pragmatic reasoning requires specific lexical encoding of information that goes beyond core truth-conditional meaning. As things stand, this remains a genuinely open question that will need to be resolved in future work.
4 Other
accounts, that is, the defaultist view (Levinson, 2000), don’t necessarily share this assumption, as they assume a more direct conventional association between the lexical entry and the ‘some but not all’ meaning, while still allowing for a ‘some and possibly all’ interpretation, perhaps via ambiguity. We are unable to go into further details on these for reasons of space. 5 Note that this is also true of so-called grammatical approaches, which don’t see the implicature- generating process itself as pragmatic, but rather as coming about through an exhaustivity operator akin to ‘only’ in the structure. Again, space reasons preclude us from discussing these in more detail; see Sauerland (2012) for an introductory overview.
Pragmatics and the lexicon 185
9.3.3 Connections to experimental work On standard theoretical accounts, implicatures have two properties that have historically put them at the forefront of the emerging subfield on the experimental study of meaning: (i) they are a secondary type of meaning, taken to build on conventional meaning (the primary meaning); (ii) they are optional, in that contradicting them does not come with the red flag associated with contradicting conventional meaning. This can easily be mapped onto a cognitive model where conventional meaning is computed first and then, in an optional secondary step taking place in real time, contextual information is integrated into the interpretation process to derive the implicature interpretation (where warranted). There is a by now rich and extensive body of literature investigating the processing and acquisition of implicatures (see Chemla and Singh, 2014a, b, for an overview, as well as Crain, this volume, and Grigoroglou and Papafragou, this volume, for related discussion of issues concerning the role of logic and pragmatics in acquisition). Consistent with the simple cognitive model just laid out, early results were argued to point towards delays for implicature interpretation in both domains, with relatively slower access to implicature meanings in online processing and later availability in acquisition (Bott and Noveck, 2004). These results have been commonly interpreted as supporting the view that literal (e.g., ‘some and possibly all’) meanings have a primary status, presumably due to being lexically rooted, while implicature (e.g., ‘some but not all’) meanings have to be derived through additional pragmatic reasoning that takes time in online processing, and which has to be mastered separately by children in the acquisition process. The empirical picture has since grown substantially more complex, with various studies showing rapid access to implicature meanings in other contextual setups (e.g., Grodner, Klein, Carbary, and Tanenhaus, 2010) and earlier diagnoses of acquisition ages based on richer contextual support (e.g., Papafragou, Friedberg, and Cohen, 2018). While we cannot review this literature in greater depth here, it is worth noting that one key factor that has been discussed for variation in both processing speed and acquisition age is the contextual accessibility of the alternatives. For example, Katsos and Bishop (2011) argue that the apparent delays in acquisition are not necessarily due to an inability to carry out the relevant implicature reasoning, but rather should be attributed to issues with accessing the (correct) set of alternatives (also see Barner, Brooks, and Bale, 2011; Skordos and Papafragou, 2016). This interpretation moves the focus from mastering an independent high-level reasoning module to a mechanism that young language learners need to leverage when structuring the lexicon in order arrive at mature in-context meanings. The experimental work referenced so far, however, has usually focused on a very limited set of scales (e.g., ). Another recent line of work has been looking at variation in the implicature effects across a wider range of expressions. For example, van Tiel, van Miltenburg, Zevakhina, and Geurts (2014) drew a list of 43 pairs of alternatives and proceeded to experimentally measure how likely an implicature is to obtain for each pair. They identified two factors presumably affecting adults’ access
186 Florian Schwarz and Jérémy Zehr to alternatives: semantic distance (e.g., how much more intense is hot as compared to warm) and boundedness (e.g., certain representing an end-point as compared to likely). Those factors however do not define clear categories, as van Tiel et al. observed a very gradual variation in inference rates across the 43 pairs of alternatives they tested. This failure to observe clear-cut categories does not immediately support the idea of a radical split between lexically-given and contextually-determined alternatives. At the same time, it suggests the possibility of a model that integrates contextual factors at the lexical level: it could be that speakers conventionally encode lexical entries on a conceptual space that includes semantic distance and boundedness, and that these properties are leveraged, and maybe further contextually weighed, in the process of computing scalar implicatures. Finally, some work has begun to investigate the extent to which the availability of alternatives can be modulated through manipulation of the presence of potential alternatives in the experimental context. Kim (2016) shows that varying the frequency of occurrences of all within an experiment has a substantial impact on the frequency of implicature interpretations of some, suggesting that the salience of an alternative in context directly modulates the salience of the implicature. With regard to symmetric alternatives, on the other hand, Uppili (2018) finds that including, for example, only some in an experiment leads to higher rates of implicature interpretations of some, that is, participants assimilate some to only some, rather than considering it as an alternative that some might contrast with. As the empirical picture becomes more fleshed out, it is important to integrate insights from the experimental and theoretical literatures further. This will not only help determine which theoretical take on alternatives best aligns with empirical findings, but it will also help test new hypotheses that arise from more refined theoretical proposals. Ultimately, this will get us closer to answering the question of how much information needs to be lexically encoded to account for the range of observed implicatures.
9.4 Scale structures and adjective meanings The previous section highlighted how the notion of alternatives raises the question of whether accounting for scalar implicatures requires positing certain ingredients to be lexically encoded. We saw that while Horn scales have traditionally been thought of as “given to us,” recent proposals have worked towards deriving rather than stipulating alternative sets, with some success but also with some limitations. At the end of the day, it remains a genuinely open question at this point whether at least some lexical entries are associated with (something like) a conventionally encoded Horn scale. In this section, we will focus on another notion of scale traditionally taken to be lexically given, which in part (but only in part) can be directly related to the notion of Horn scales.
Pragmatics and the lexicon 187 As detailed below, so-called gradable adjectives, like warm and hot, crucially invoke degree scales as their key semantic ingredient. The case of warm and hot in particular indicates a connection between the two notions of scales: both adjectives refer to the same degree scale (of temperature) and differ only in their respective thresholds on that scale; and they also are standardly seen (on the relevant accounts) as forming a Horn scale, < warm,hot >, to account for the not hot implicature of warm. The central role of the very same degree scale in the lexical entries with the different thresholds opens up the possibility that all the information that is needed to encode this Horn scale is already independently present in the lexicon. This specific deflationist option, of course, can only apply to alternatives that consist of gradable adjectives; the issues in the previous section, however, are mostly concerned with expressions of a different nature, and considerations about what content gradable adjectives lexically encode will therefore not settle all those issues by any means. In light of space constraints, we are unable to discuss the relation between the two notions of scales in more detail, even though the topic promises further insight into the lexical roots of some pragmatic phenomena.6 We will, however, critically review the traditional lexicalist view whereby gradable adjectives are mentally represented in the form of degree scales. While the ultimate conclusions here may be different from those warranted for scalar implicatures in general, our discussion highlights parallels between issues in the two domains, specifically concerning the tension between the (at least apparent) need for lexical encoding and the desire for general mechanisms that can be applied across the board.
9.4.1 Gradable vs. non-gradable adjectives It has long been noted that two classes of adjectives can be distinguished, namely the gradable vs. non-gradable ones, based on their distributional properties (Sapir, 1944). Gradable adjectives naturally appear in a number of so-called degree constructions whereas non-gradable adjectives generally cannot be used in these. For example, the adjective cheap is usually classified as gradable (18) whereas the adjective free (in the free of cost sense) is not (19) (question marks indicate that the sentences are degraded). (18)
a. Attending college is cheaper in France than in the United States. b. How cheap is attending college in France? c. Attending college is very cheap in France.
(19)
a. b. c.
?? Attending college is freer in France than in the United States. ?? How free is attending college in France? ?? Attending college is very free in France.
6 We refer the interested readers to Chanchaochai and Zehr (2019) who argue for an alternative-based analysis of certain Thai degree constructions, thus fleshing out the connection between the two types of scales further to account for a specific new set of linguistic data. On a similar note, Beltrama (2018) offers an alternative-based account of emphatic simply for extreme adjectives.
188 Florian Schwarz and Jérémy Zehr In light of the main theme of the present chapter, we can now ask whether gradability is encoded lexically and, if so, what type of information must be stored in the lexicon in order to capture the semantic properties of adjectives with regard to gradability. Intuitively, the meaning of comparative uses of gradable adjectives, as in cheaper, amounts to comparing the degree to which two entities have a certain property. The comparative morpheme (more/-er) introduces the element of comparison. But according to one prominent approach, the degree element is already present in the adjective meaning itself, so that cheap on its own denotes a relation that maps individuals onto degrees on the relevant scale, here involving cost (Cresswell, 1976; Kennedy and McNally, 2005, a.o.). This is fundamentally different from the meaning of non-gradable adjectives, which have standard predicate denotations (corresponding to sets of individuals). On this type of view, the gradability properties of a given adjective are fundamentally lexical, as they are part and parcel of its core semantic make-up. In other words, the minimal contrast between (18) and (19) is taken to indicate that the gradability of cheap and the non-gradability of free inherently follow from their meanings, leaving little room for contextual factors to play a role in an adjective’s distributional pattern in this regard. At the same time, as we saw with implicatures, there is a general inclination to account for aspects of meaning as deriving from more general principles whenever possible, rather than simply stipulating them in the lexicon. And so one could also entertain the notion that speakers need not encode the (non-)gradability of cheap vs. free in their mental lexicon, and that the contrastive patterns in (18) vs. (19) obtain from general pragmatic considerations rather than a difference in the type of meaning involved in the two cases. For example, speakers could store both cheap and free as simple predicates relating to cost, but reason that the very concept of something being free leaves no room for further ordering of cost, as otherwise required by degree constructions like those exemplified in (18). A potential line of reasoning supporting such a pragmatic view of gradability could point to the apparent fluidity of the gradability split. While the adjective pregnant is standardly considered a non-gradable adjective, it can still naturally appear in degree constructions in everyday conversations, as in (20).7 (20)
a. b. c.
Do you think I could be more pregnant than I thought? We just found out we’re pregnant, find out how pregnant I am! This is what it’s like to be very pregnant.
If one were to strictly maintain that each adjective is stored in the mental lexicon as either a relation involving degrees on a scale or a binary property that can be modeled as a set, these data introduce some uncertainty as to whether pregnant should be encoded as non-gradable, following the tradition in the literature and its fundamental either-or
7 The examples in (20) were found online at, respectively, https://www.mumsnet.com/Talk/ pregnancy/710893-Do-you-think-I-could-be-more-pregnant-than-I; https://www.youtube.com/watch? v=e2DkWskS61U; and https://www.scarymommy.com/what-its-like-to-be-very-pregnant/.
Pragmatics and the lexicon 189 semantics, or as gradable, in an effort to account for data like (20). A possible resolution of this issue from the perspective of the traditional analysis of pregnant as non-gradable is to enrich the interpretative machinery with an optional mechanism that can coerce non-gradable expressions into gradable ones by introducing contextually salient and plausible scales (e.g., of time passed pregnant in the cases above).8 One can then maintain the split in lexical representations, where gradable adjectives involve degrees on lexically defined scales, whereas non-gradable adjectives require a modification via coercion of their core semantics in context to account for (more or less exceptional) gradable uses.9 By defending a lexicalist approach, coercion-based proposals are able to address the tension in how gradability manifests itself: as a linguistic phenomenon giving rise to specific grammatical constructions (recall the contrast between (18) and (19)) but, at the same time, with a determining role of extra-linguistic factors (recall how (20) came with somewhat atypical conceptions of pregnancy). Those proposals render the linguistic aspect of gradability as clear-cut lexical categories that appear in specific grammatical constructions, and its extra-linguistic aspect as a coercion operation that provides leeway to redraw those category boundaries when contextually motivated.
9.4.2 Relative vs. absolute adjectives 9.4.2.1 Bounded vs. open scales Among gradable adjectives, we can further distinguish between so-called relative vs. absolute gradable adjectives (Unger, 1975; Kennedy and McNally, 2005). For example, dry is a gradable adjective allowing comparatives, and so forth (21), patterning with cheap (and contrasting with non-gradable free), but it also patterns with the non-gradable adjective free in some respects (22). (21)
a. b. c.
Californian summers are drier than Arizonian summers. How dry are Californian summers? Californian summers are very dry.
(22)
a. b. c.
?? Attending college in France is completely cheap. Attending college in France is completely free. The soil is completely dry.
This is because unlike cheap, which is a relative adjective, dry can target an endpoint, that is, it is an absolute adjective. This can be modeled in terms of the structure of the
8 Coercion is a type of operation often introduced to account for a range of linguistic phenomena, of which gradability is only one. See Lauwers and Willems (2011) for an overview. 9 Note that this is not dissimilar to the discussion of ad hoc scales for scalar implicatures, a potential parallel that is worth exploring in future research.
190 Florian Schwarz and Jérémy Zehr scales involved: relative adjectives are associated with open scales, whereas absolute adjectives are associated with bounded scales.10 Importantly, there are no clear general conceptual grounds that would require the association of cheap with an open scale and that of dry with a bounded scale. To see this, imagine the adjective freap, defined as follows: (i) what is free is completely freap, and (ii) if not free, then the cheaper, the freaper. Formally put, freap is associated with a bounded scale: the degrees on the scale are ordered following the costs they formalize, and null costs are mapped to the endpoint degree of the scale. There is no a priori conceptual reason why cheap should not mean what freap means, for the scale of freap has the exact same structure as that of the attested adjective dry and it actually maps rather directly to the (non-linguistic) measure of cost. The apparent absence of a general pragmatic principle that could rule out the possibility of an absolute counterpart of cheap (i.e., freap) suggests that the relative nature of cheap is merely a conventionally fixed property, which accordingly has to be part of its lexical entry.
9.4.2.2 Relative adjectives and context The relative vs. absolute split is relevant to the focus of the present chapter in yet another way, in that relative adjectives seem to exhibit a lexically encoded dependence on context that absolute ones do not. Compare contexts where we are talking about either a one-room apartment in a small town or about houses in Manhattan. The sentence A rent of $2,000 is cheap will likely be false in the former but true in the latter case. Note that speakers can also offer a class of comparison explicitly, using for-phrases (23a). However, as Siegel (1979) notes, this does not generally seem to be available for absolute adjectives (23b), suggesting that their semantic contribution is not context-dependent in the same way. (23)
a. b.
That’s a cheap rent for a full house in Manhattan! ?? That’s a dry towel for a bath towel!
While such data suggest that some, and only some, gradable adjectives should encode context-dependence in their lexical entry (at least insofar as classes of comparison modulate their threshold) other data once again question the sharpness of the division (Sassoon and Toledo, 2011; McNally, 2011). (24)
a. b.
The soil is dry for this usually very green land. The towel is very dry for a towel stored in a sauna.
Two approaches to this offer themselves: (i) speakers store semantic representations for relative and absolute adjectives that formally differ in whether they require
10
See Kennedy (2007) for an explicit formalization of expensive as denoting an open scale—excluding null costs—and Lassiter (2010) and Lassiter and Goodman (2013) who take issue with it.
Pragmatics and the lexicon 191 contextual parameters. (23b) would then be degraded because the semantic entry for dry is incompatible with contextual modification, though an additional coercion mechanism can override this. Alternatively, (ii) speakers store semantic representations of the same formal type for relative and absolute adjectives (thus explaining their co-occurrence in degree constructions) and context-sensitivity emerges as a split that is not lexically stipulated. The second solution has been favored both by supporters and critics of degree semantics. Defending a degree semantics approach, Kennedy (2007) assumes the open vs. bounded scale structures we mentioned earlier and derives context sensitivity from there: all gradable adjectives convey the surpassing of a degree point on their scale, but only bounded scales define a non-arbitrary degree of reference (their endpoint), whereas speakers need to recruit contextual factors to establish a degree of reference on open scales. From another perspective, Burnett (2014), who assumes that gradable and non-gradable adjectives alike define binary properties at their core, proposes to derive gradability from a context-sensitive semantics for both relative and absolute adjectives, but argues that the two types of adjectives are dependent on context in crucially different ways. Lassiter and Goodman propose a more radical approach where all gradable adjectives define the exact same type of semantics at their core and where “the relative/absolute distinction is not a binary distinction, but a matter of degree” (Lassiter and Goodman 2013, pp. 601–602). At the end of the day, the question remains relatively open as to what principles underlie the categorization of different adjectives as relative vs. absolute in the mental lexicon, specifically with respect to how much of this categorization is derived with every single interpretation and how much has to be learned and stored in the lexicon. The semantics-pragmatics division we posit makes this categorization question pervasive, for usage often does not perfectly align with ontologies posited by semanticists. As we have seen with gradable adjectives in particular, one standard solution is an analysis in terms of coercion. Results from recent experimental work focusing on the relative-absolute distinction have elicited signals of additional processing associated with cross-category interpretations (Frazier, Clifton, and Stolterfoht, 2008; Bogal- Allbritten, 2012; Aparicio, Xiang, and Kennedy, 2015). While those results are consistent with coercion-based views that assume a sharp lexical distinction between relative and absolute adjectives, more experimental work is needed before one can discard analyses that attribute more uniform semantics to categories of gradable, or even non-gradable, adjectives.
9.5 Presuppositions The notion of presupposition constitutes the first instance in the history of the study of natural language meaning where a distinction between different layers of meaning was invoked. Frege (1892) not only laid the foundations for the compositional approach
192 Florian Schwarz and Jérémy Zehr of meaning alluded to in Section 9.2, but also discussed issues that arise because speakers’ truth-value judgments with regard to sentences like (25) tend to be unclear or inconsistent. (25)
The Queen of France is not bald.
Following the Frege-Strawson analysis (Strawson, 1964), the source of confusion is the meaning of the, which contributes a presupposition that there is a unique referent fitting the noun phrase description. If this presupposition (here that there is a unique Queen of France) fails to be satisfied, then semantic composition cannot proceed and the sentence fails to receive a truth value. Expressions that exhibit properties parallel to those of the here are standardly referred to as presupposition triggers. Given the association of certain lexical items with presuppositions, one may posit an additional layer of presuppositional information as part of (certain) lexical entries. Unsurprisingly, given the overall thread of this chapter, there are ongoing debates whether this is indeed warranted, or whether a more explanatory account, where presuppositional properties are derived in a more general way, is called for. In light of the multi-faceted set of issues in accounting for presuppositions theoretically, variations of this question arise concretely for separate aspects of the phenomenon. First, there is the question of where presuppositions come from, and how they attain their status—what is referred to as the “triggering problem.” Second, the apparent variation in more fine-grained properties across (classes of) presupposition triggers has given rise to the suggestion that triggers differ in how and to what extent the presuppositional nature of certain bits of information is lexically encoded. Finally, the special behavior of presupposition triggers in embedded environments—what is known as the “projection problem” (introduced in more detail in the following section)—raises questions about the extent to which embedding operators and connectives have to lexically encode specific aspects of how they pass on presuppositions to the larger structure they appear in. We turn to these respective issues in turn in the following sections.
9.5.1 The triggering problem A full account of presuppositional phenomena requires answering the following questions: (i) Which expressions are presupposition triggers? (ii) What do they presuppose (potentially in contrast to other, non-presupposed content)? (iii) What makes the relevant content have presuppositional status? The standard take on the first two questions relies on how presuppositions interact with the compositional process, in particular with regard to embedding under certain types of operators: presuppositions ‘escape’ (i.e., remain unaffected by) operators such as negation, modals, and if, all of which are “entailment-canceling” operators (cf. the family of sentences tests in Chierchia and
Pragmatics and the lexicon 193 McConnell-Ginet, 1990).11 This phenomenon is known as “presupposition projection.” To illustrate, start by considering the meaning of (26), which can be seen as involving the components below. (26)
Lola stopped running. a. First Lola was running, . . . b. . . . and then she was not running.
These two components of the meaning contributed by stop to the sentence behave differently under embedding, for example, only the second part appears to be affected by sentential negation (paraphrased below in bold): (27)
Lola didn’t stop running. a. First Lola was running, . . . b. . . . and then it-was-not-the-case-that she was not running. (i.e., she was (still) running)
In other words, the first part, which is unaffected by negation, is presupposed and projects, and it is diagnostics like these that allow us to identify presupposition triggers and the content they presuppose. An additional aspect of presuppositions is that they tend to be pragmatically backgrounded, and thus not constitute the main point of the sentence. The third question above constitutes the triggering problem, namely how it comes to be that the content in (26a) has presuppositional status. One answer, long dominant in the literature at least since Heim (1983, building on Karttunen, 1973 and Stalnaker, 1973, but in a new formal framework), is that presuppositional information is separately encoded as such in lexical entries of the relevant expressions, as a condition on which contexts a presuppositional sentence can be uttered in.12 However, much recent work, reviving a perspective first raised in early work by Stalnaker (1974), has raised what amounts to an explanatory challenge questioning why it should be lexically stipulated that certain parts of the meaning of an expression like stop, but not others, have presuppositional status. After all, it doesn’t seem like a coincidence that the split in (26) is as it is, as it can be found systematically across languages (and there don’t seem to be any attested variants of stop that do this the other way around). The alternative pursued in this line of work is to derive the presuppositional status from more general principles. The key assumption that is generally shared across variants of such accounts is that the content that winds up being presupposed is lexically 11 Two caveats: first, there are other expressions with similar, though arguably distinguishable, behavior under embedding, for example, conventional implicatures in the sense of Potts (2005); second, presuppositions are also taken to have a second defining property, of being backgrounded or taken for granted, which we’ll come back to momentarily. 12 In Heim’s context change semantics, this is technically implemented by utilizing partial functions.
194 Florian Schwarz and Jérémy Zehr encoded as part of the simply entailed content, that is, that (26a) and (26b) are entirely on a par as far as the lexical semantics is concerned. The special status of (26a) and its behavior under embedding operators is then derived pragmatically in one way or another. To sketch the general gist of the central idea, it’s useful to start by considering what would happen if sentential negation applied to the content of the sentence in (27) wholesale: (28)
It-is-not-the-case-that . . . a. . . . first, Lola was running, . . . b. . . . and then she was not running.
Note, first, that the negation of a conjunction as a whole (¬(p&q)) is strictly weaker than the projection reading with only one conjunct negated (p&¬q), meaning that this is compatible with situations where the projection reading is false, for example, if Lola was not running to begin with. The negation of stop actually seems to permit such interpretations when called for in context: (29)
A: I think Lola may have stopped running. B: No—She didn’t stop running, since she wasn’t running in the first place.
Thus, in contrast to the general case where we find projection, the content in (26a) does not necessarily seem to escape negation, and any theoretical approach will have to account for this option. On lexicalist views, an additional operator such as that of “Local Accommodation” in Heim (1983) has to be invoked.13 On the other approaches we’re now considering, the availability of this interpretation falls out automatically, since this content is lexically encoded as regular content. But what do these accounts say about the interpretation involving projection above? One family of approaches invokes parallels to implicatures, by positing alternatives for presupposition triggers that form the basis for reasoning leading to the projection interpretation. For example, Abusch (2002) assumes that continue is an alternative to stop, along with a principle that one of the alternatives of a sentence has to be true in a given context. The fact that continue shares the ingredient in (26a) then is used to account for the presuppositional status of this content (though the account has to say more to explain projection; see Abusch, 2010 for more details). In a similar vein, Romoli (2015) accounts for presupposition projection by alluding to reasoning that is even more closely parallel to that involved in implicatures. He proposes that stop is associated with used to as an alternative, that is, the lexical encoding of just the first ingredient in (26) 13 The
phrase “(Global) Accommodation” was initially reserved for situations where interlocutors accept some information as backgrounded that is yet new to them, for questioning the presupposition would otherwise disrupt the flow of the conversation (Stalnaker, 1973). Local Accommodation also avoids such a presupposition crash, but it does so in linguistic environments that end up cancelling the information in some form at the global level, for example negation as in (29).
Pragmatics and the lexicon 195 above. Under negation, this yields a strictly stronger reading (i.e., John didn’t use to run entails John didn’t stop running), and the idea is that just as in the case of implicatures, the projection interpretation with the ‘ran before’ content not affected by negation comes about via strengthening through the negation of the stronger alternative: It-is- not-the-case that Lola didn’t used to run amounts to Lola used to run—precisely the content of the projected presupposition. While we have to refer the reader to the original proposals for further details, an obvious issue that arises for these accounts is just why the relevant presupposition triggers are associated with the alternatives posited on the respective accounts (and not others)—this is entirely parallel to the issue of the lexical status of alternatives for implicatures discussed in Section 9.3. Another interesting aspect of Romoli’s proposal in particular is its empirical prediction that speakers derive so- called presuppositions, as in (26), in the same way that they derive scalar implicatures. A number of experimental studies (Romoli and Schwarz, 2015; Bill, Romoli, Schwarz, and Crain, 2016; Kennedy, Bill, Schwarz et al., 2015; Bill, Romoli, and Schwarz, 2018) have tested precisely this prediction, and while there are some parallels, the overall evidence of different behavioral results for presuppositions and implicatures is more in line with lexicalist accounts of presuppositions (or at any rate, accounts that predict different behavioral patterns for the two types of inferences). Another influential line of work, starting with Simons (2001), has tied presuppositional status to the discourse structure, modeled explicitly in later work (Beaver, Roberts, Simons, and Tonhauser, 2017) in terms of the Question Under Discussion (QUD: Roberts, 1996), the main idea being that whatever follows from the QUD will be not at-issue, and thus presupposed (in contrast to at-issue information that contributes to resolving the QUD). A key property, and potential problem, for this type of approach is that it attributes a central role to context in determining whether a piece of information ends up being presupposed, with no hard-wired lexical split between presuppositional vs. non-presuppositional lexical units. While proponents of this perspective have put forth experimental results suggesting a correspondingly gradient empirical picture of projection (Tonhauser, Beaver, and Degen, 2018), other recent experimental work (Djärv and Bacovcin, 2017; Djärv, 2019) finds evidence for clustering of presuppositional vs. non-presuppositional lexical items, with Mandelkern, Zehr, Romoli, and Schwarz (2019) adding further methodological variants of the relevant paradigms that strengthen the evidence for a more categorical split in what expressions introduce projecting content. A final line of attack to tackle the explanatory challenge under consideration ties the backgroundedness of presuppositions to whether or not they are at the attentional focus in terms of the temporal unfolding of events (Abrusán, 2011, 2016).14 For stop, for example, the idea is that the main reference time introduced by tense concerns the information in (26b) (of not running), whereas the part in (26a) is about a preceding time, which is not part of the main events attended to relative to the sentence. The
14
Also see Qing, Goodman, and Daniel (2016) for a broadly similar approach.
196 Florian Schwarz and Jérémy Zehr backgroundedness of this information is then seen as the basis for it projecting, though projection itself is delegated to other mechanisms operating on backgrounded content. In sum, the question of how presuppositional content winds up having its special status is subject to a lively and ongoing debate, with some relevant experimental work and potential for much more. One aspect that is shared by most of the approaches considered here is that they intend to only deal with certain types of presuppositions, but not others. We turn to some related questions about differences in the lexical nature across presupposition triggers in the next section.
9.5.2 Differentiating classes of presupposition triggers One central topic in the recent literature has been the apparent need to distinguish different sub-classes of presupposition triggers (and many of the pragmatic approaches to the triggering problem above only are intended to account for one of these). A particularly influential line of argument from Abusch (2002) draws on contrasts such as the following (adapted from Abusch): (30)
John will either attend the first meeting, or miss it. a. ?? And he will either attend the second meeting too, or miss the second meeting too. b. And he will either stop attending meetings, or stop missing them.
The version with the presupposition trigger too seems infelicitous, apparently due to the conflict between the two presuppositions (of attending/missing the first meeting), both with one another and the global context. In contrast, the very much parallel presupposition of stop does not give rise to the same issue, because it allows for a non-projecting interpretation (where the notion that he previously attended/ missed meetings is interpreted within the respective disjuncts). Abusch terms triggers of the latter sort, which allow for non-projecting readings relatively easily, “soft” triggers, and the former, which resist non-projecting readings, “hard” triggers. These and similar observations have led various authors to attempt to construe differences in how their presuppositional content is encoded in the lexical structure of the different types of triggers. While some authors (Kripke, 2009; Zeevat, 1992) distinguish between triggers that have an anaphoric element (such as too) from those that don’t, others—such as the various pragmatic approaches discussed in the previous section—are based on the idea that the presuppositional status of the relevant information contributed by soft triggers is pragmatically derived, and not lexically encoded, whereas the lexical entries of hard triggers may encode it more directly. In another line of work, Klinedinst (2012, 2016) and Sudo (2012) independently present an account of the soft-hard split that assumes separate layers of encoding for entailed and presupposed content. They propose that while hard triggers lexically encode totally independent information on their presuppositional and conventional layers, soft triggers
Pragmatics and the lexicon 197 redundantly encode the information from their presuppositional layer on their conventional layer as well (e.g., the information of having done X before presupposed by stop is both presupposed and entailed). This redundant encoding then is called upon to explain the easy availability of non-projecting readings for soft triggers, which by assumption already contribute the presupposition to their local context, even if one ignores the presuppositional layer. In the case of hard triggers, ignoring their presuppositional layer would result in a net loss of information, unless additional steps were taken to add it to the entailed layer (e.g., via local accommodation). Notably, this approach predicts that even though soft triggers normally make a contribution at the presuppositional level, they also necessarily contribute the same piece of information at the conventional level as well.15
9.5.3 Connectives and projection The question of how presupposition-related information ought to be represented in the lexicon, and whether it should be represented at the lexical level at all, is not only limited to presupposition triggers but also arises when considering how presuppositions interact with embedding operators to yield (or not yield) projection. In addition to the projection data considered above, a key aspect of presuppositions is that they can be “filtered” (in the terminology of Karttunen, 1973) by other parts of the sentence: (31)
a. b.
If Lola is taking night classes now, then she stopped running. If Lola used to run Marathons, then she stopped running.
(32)
a. b.
Lola is taking night classes now and she stopped running. Lola used to run Marathons and she stopped running.
The respective first parts of these conditionals and conjunctions affect whether the sentences as a whole presuppose that Lola used to run (based on the trigger stop)— as the (a)-versions do but the (b)-versions do not. Intuitively, this is because at the point the trigger is considered, the notion of her having been a runner in the past is already introduced by the other clause. While we cannot go into the details of the formal attempts to capture these patterns—which have concerned a substantial literature for several decades—a key tenet present from the start is that at least for certain connectives, linear order crucially interacts with the projection pattern, for example, the reverse order of conjuncts yields a relatively odd sentence:16 (33)
??Lola stopped running and she used to run Marathons.
15 Corresponding
predictions based on examples by Sudo (2012) for enviroments involving non- monotonic quantifiers have been experimentally examined by Zehr and Schwarz (2018), with some support for a split between triggers, though the results leave open a number of currently unsolved questions and issues. 16 Changing the order in conditionals gives rise to even more complications; see Mandelkern and Romoli (2017), Romoli and Mandelkern (2017) for discussion.
198 Florian Schwarz and Jérémy Zehr Since and is usually modeled in terms of conjunction of statement logic, which is symmetric, this asymmetry has to be accounted for in some way. The dynamic context change semantics proposed by Heim (1983) builds the asymmetry directly into the lexical entry for and, by stipulating that the presuppositions introduced in the second conjunct get evaluated relative to the first conjunct (in combination with the preceding discourse context), whereas those in the first conjunct get evaluated relative to the discourse context only. This has been criticized as lacking in explanatory adequacy, and an influential line of recent work (most notably, Schlenker 2009) has revived a notion first introduced by Stalnaker (1973), namely that the asymmetry results from the linear precedence of the first conjunct. It generalizes this approach further to other relevant connectives by providing a classical logical, and symmetric, account that is combined with a pragmatic notion of local context that unfolds linearly and thus can account for the observed asymmetry. An important difference between these lexical and pragmatic accounts of the projection properties of connectives such as conjunction is that while the former take the asymmetry to be a hard-wired property, it is natural to treat them as an overridable default on the latter. In other words, pragmatic accounts in principle allow for the content of a linearly second conjunct to be taken into account when evaluating the presupposition of the linearly first conjunct. Several experimental projects have aimed to address related questions, with Chemla and Schlenker (2012) arguing their data to support the pragmatic view when looking at parallel issues for disjunction, conditionals, and unless-clauses. However, more recent experimental evidence homing in on the key case of conjunction (Mandelkern et al., 2019) suggests that speakers generally treat projection from conjunction asymmetrically, and do not resort to considering linearly later information, even when that is the only way to rescue the felicity of a sentence. In particular, they judge the continuation (34a) where stop appears after and more natural than the minimally different continuation (34b) where stop instead appears before and. (Embedding the conjunction in a conditional preceded by an explicit ignorance context ensures that the presuppositional properties of stop are tested here.) (34)
I never see Lola run. I don’t know if she ever was a runner, but... a. if she used to run the Marathon and she stopped running, she needs to find another sports activity. b. if she stopped running and she used to run the Marathon, she needs to find another sports activity.
While further work is needed to reconcile the seemingly conflicting results in the literature by extending this paradigm to other connectives, there clearly is a tension between the drive for explanatory adequacy on the theoretical side and the apparent rigidity of projection asymmetry found here empirically for conjunction.
Pragmatics and the lexicon 199
9.6 Conclusion and outlook We have aimed to provide a broad overview of issues relating pragmatic phenomena arising in language use to questions about what related information has to be construed as being encoded in the mental lexicon. We focused on three concrete areas for more detailed illustration, with particular emphasis on issues pertaining to potential lexical encoding. In the realm of scalar implicatures, we reviewed the debates about whether the alternatives that crucially feature in implicature reasoning have to be represented at the level of the lexicon, in the form of Horn scales. We furthermore discussed the role of scales in the interpretation of gradable adjectives and their lexical representations also with regard to different types of scales and the potential need for encoding this lexically. Finally, we reviewed presuppositional phenomena, with questions arising about whether the presuppositional status of certain information encoded by presupposition triggers has to be marked as such lexically, or whether it can be derived pragmatically. Additionally, the question of differences between triggers and corresponding differentiation in lexical encoding came up, as well as the issue of whether projection properties of connectives such as conjunction have to be lexically represented. There is, of course, a clear common thread throughout, which is at the core of the semantics- pragmatics interface, and the language-cognition interface more generally: For any given interpretive effect of a linguistic expression, we can ask whether it comes about through conventional encoding, or whether it can be explained independently, typically in domain-general (i.e., not language-specific) terms. The latter route makes for a more minimal theory but often leads to challenges in capturing empirical nuances. The former route tends to be more powerful, but risks being overly stipulative theoretically, and faces challenges in accounting for how the relevant conventional encoding is learned. There are, needless to say, many other phenomena in this realm that could have been discussed as well (e.g., implicated presuppositions, to name just one), but we hope to have given the reader a taste of the types of debates in the linguistic study of meaning, broadly construed, and how they relate to highly general questions about lexical representations and general reasoning and cognition; and to have conveyed that this is a rich area where new empirical directions combined with ever more refined theoretical considerations promise many fruitful lines of future work.
Chapter 10
E f f icient C om mu ni c at i on and The Organi z at i on of The Lex i c on Kyle Mahowald, Isabelle Dautriche, Mika Braginsky, AND Edward Gibson
10.1 Introduction If you were constructing a language, how would you design a lexicon that could most effectively be used by speakers in order to efficiently communicate messages? Should all words be about equally frequent, or should some words be re-used many orders of magnitude more often than others? How long would you make the words in your lexicon: should they all be as short as possible? How would you carve up the semantic space? Would you want the meaning of a word to constrain its form, or should the relationship between form and meaning be completely arbitrary? How would you make sure that babies are able to learn the words in your lexicon?1 With natural language, of course, no one has to make these design decisions. Instead, they are sorted out through the organic process of language use and evolution, as people do the messy work of communicating with one another. There is now copious evidence that this process results in languages that are relatively efficient for the purposes of communication. Therefore, some of the observed features of language can be explained by understanding the cognitive and communicative constraints on the language system. This evidence comes from work in the functional linguistic tradition (Bybee and Hopper, 2001; Haspelmath, 2004), in linguistic evolution (Christiansen and Kirby, 2003; Cornish, 2010; Kirby, 1999; Kirby, Cornish, and Smith, 2008), and from a
1 For
helpful comments and feedback on this chapter, we thank Steve Piantadosi, Ramon Ferrer-i- Cancho, and an anonymous reviewer.
Efficient Communication and The Organization of The Lexicon 201 line of work that uses ideas from computer science to model natural language as an efficient communication system (Ferrer, 2001; Fenk-Oczlon, 2001; Ferrer-i-Cancho and Solé, 2003; Gibson, Bergen, and Piantadosi, 2013; Gibson, Futrell, Piantadosi et al., 2019; Levy, 2008). There is reason to believe in pressure toward efficient linguistic structure across domains, including phonology (Boersma and Hamann, 2009; Flemming, 2004; Priva, 2008), syntax (Fedzechkina, Jaeger, and Newport, 2012; Ferrer-i-Cancho, Hernández- Fernández, Lusseau et al., 2013; Ferrer-i-Cancho and Solé, 2003; Futrell, Mahowald, and Gibson, 2015; Gibson, 1998; Gibson et al., 2013; Hawkins, 1994; Jurafsky, 1996; Kirby and Hurford, 2002; Kravtchenko, 2014; Levy, 2008; Liu, 2008), and pragmatics (Frank and Goodman, 2012; Goodman, Tenenbaum, Feldman, and Griffiths, 2008; Zaslavsky, Hu, and Levy, 2020), across spoken and signed modalities (Slonimska, Özyürek, and Capirci, 2020). Since this book is about the lexicon, we will limit our scope to lexical topics and questions of lexical design. While the lexicon is of course inextricably linked to a language’s phonemic system and its morphological system, we will leave those topics to other work and instead focus mostly on large-scale lexical structure and its relationship to communication. Specifically, in this chapter, we will use the structure of the lexicon in order to reverse- engineer the de facto design decisions that emerge through language use. Understanding these design decisions can help us understand human language processing more generally. We will review work that investigates how the lexicon is informed by—and can therefore illuminate—both cognitive and communicative constraints on natural language. Central to this project are ideas from computer science, especially information theory, about what makes for efficient communication systems. Information theory (Shannon, 1948) gives us a framework for thinking about how to design a code that most effectively conveys information from a source to a receiver over a noisy channel. Applying these ideas to natural language has been fruitful in explaining features of the lexicon. To that end, rather than attempting a comprehensive review of all work on the organization of the lexicon, we will focus in particular on work in the information-theoretic/communicative tradition. That is, while the information-theoretic framework gives a set of formal principles for defining efficient communication, it is less straightforward to define a mathematical framework for what makes a code maximally efficient for cognitive processing. This cognitive constraint marks an important distinction between the theoretical systems dealt with in information theory and natural language. Natural languages have to be learnable by babies, and therefore the lexicons must be designed in such a way that they are learnable. There has been extensive work in psycholinguistics and cognitive science on what makes lexical items hard or easy to process (Baayen, Burani, and Schreuder, 1997; Baayen, Milin, and Ramscar, 2016; Frauenfelder, Baayen, and Hellwig, 1993; Vitevitch and Luce, 1998), and in developmental psychology on what makes words easier or harder to learn (Braginsky, Yurovsky, Marchman, and Frank, 2019; Dautriche, Swingley, and Christophe, 2015; Gentner, 1982; Gleitman, Cassidy, Nappa, Papafragou,
202 K. MAHOWALD, I. DAUTRICHE, M. BRAGINSKY, AND E. GIBSON and Trueswell, 2005; Goodman, Dale, and Li, 2008; Hills, Maouene, Maouene, Sheya, and Smith, 2009; Perry, Perlman, and Lupyan, 2015; Roy, Frank, DeCamp, Miller, and Roy, 2015; Stokes, 2010; Storkel, 2004; Swingley and Humphrey, 2018). We can use this body of knowledge to further understand features of the lexicon and how they emerge. In the rest of this chapter, we will treat the messy process of human language use as an expert designer and explore four major lexical design decisions that human languages must make: the structure of word frequency distributions, the relationship between word form and word length, the degree of lexical arbitrariness, and how to structure lexicons for child language learning. We will see how natural languages navigate these design questions and, in so doing, help us better understand aspects of communication and language processing.
10.2 How frequent should words in the lexicon be? In considering the large-scale structure of the lexicon from an efficient design perspective, one of the most important considerations is how many words there should be, how frequency should be distributed across words, and how words should be used to carve up the semantic space. Here, we will explore this question from two perspectives: first by considering the large-scale organization of the lexicon at a macro level and then by examining how this plays out in the efficient structure of more specific semantic spaces, such as color words, kinship systems, and human names. We argue that the latter can be illustrative of the large-scale patterns observed in the former.
10.2.1 Large-scale structure of frequency in the lexicon2 If a lexicon were being designed for efficient communication, would it be better to structure it such that all words are roughly used equally often? Or should the most frequent words be used more often? Natural languages all seem to answer this question in the same way: the most frequent words are used much more often than the less frequent words. Specifically, they all follow roughly the same Zipfian rank-frequency distribution (see Figure 10.1 for a rank-frequency plot based on the SUBTLEX corpus for English, where rank and frequency are estimated independently as in Piantadosi, 2014). Zipf (1935, 1949) famously observed that the distribution of words in a language obeys a power law distribution, such that the most frequent word appears in a corpus about twice as often as the second most frequent word, about three times as often as the third most frequent word, and so on. 2
This section draws on the discussion in Piantadosi (2014), which treats these issues in more detail.
Efficient Communication and The Organization of The Lexicon 203
Log Frequency
–5
–10
–15
0
3
6 Log Rank
9
Figure 10.1 Following Piantadosi (2014), we independently fit the rank and frequency distributions (here using the SUBTLEX corpus of movie subtitles in English) and plot the resultant density, where darker hexagons indicate greater density. The black line shows the line of best fit for a Generalized Additive Model (GAM) fit to the data.
Another framing of this observation is that the log rank of a word’s frequency and its log frequency are linearly related. Mandelbrot (1953) modified this proposal to slightly rank-shift the distribution for a better empirical fit and so gave perhaps the best known quantitative account of Zipf ’s law:
f (r ) ∝
1 (1) (r + β)α
where α is approximately 1 and β is around 2.7. Empirical studies, across languages, show that the actual frequency distribution of words in a language is somewhat more complex than a simple power law, but that the broad strokes of Zipf ’s observation are robust. Baayen (2001) explores a variety of proposed distributions (e.g., log-normal, generalized inverse Gauss-Poisson, generalized Z-distribution) and shows that no proposed distribution perfectly captures the empirical pattern of results across all corpora and that some corpora seem to fit better with some distributions than others. Yu, Xu, and Liu (2018) show that, across 50 languages, the rank-frequency distribution can be divided into three segments of low, medium, and high frequency.
204 K. MAHOWALD, I. DAUTRICHE, M. BRAGINSKY, AND E. GIBSON The emergence of Zipfian (or near-Zipfian) distributions is not limited to just large- scale lexical features. Piantadosi (2014) shows that near-Zipfian distributions emerge within a wide variety of semantic classes, such as for number words and taboo words. Moreover, in the absence of any semantic information at all, near-Zipfian distributions emerge: Piantadosi asked participants to generate novel stories using nonce words and found that even these nonce words follow a near-Zipfian distribution. In addition to discussion of the precise mathematical form of these distributions, there has been some debate as to what gives rise to these distributions of words across lexicons and whether they are particularly meaningful. The situation is complicated by a few factors. First, it can be difficult to distinguish power law behavior from other long-tailed distributions, like the various flavors of exponential distributions (Clauset, Shalizi, and Newman, 2009). Second, as we discuss below, there are many proposed mechanisms that can give rise to the observe distributions, and the mechanism that causes these patterns to emerge in natural language are less clear. Finally, the unigram distribution of words is not easy to model (see Nikkarinen, Pimentel, Blasi, and Cotterell, 2021 for such a model, drawing on a model from Goldwater, Griffiths, and Johnson, 2011). To that end, there remain some puzzles as to why all natural languages show the Zipfian frequency distribution. Miller (1957) and Conrad and Mitzenmacher (2004), among others, have argued that Zipfian distributions can arise as a result of random natural processes. For instance, if one imagines a bunch of monkeys randomly typing on a keyboard (or simulates this by generating random letters), a Zipfian frequency distribution will be observed. But random-typing models are bad scientific models of how languages could actually give rise to Zipfian distributions since, unlike in random- typing models, people store and re-use words as units (Ferrer-i-Cancho and Elvevåg, 2010; Piantadosi, 2014). The scientific fact that needs to be accounted for is not that certain words occur disproportionately often by random drift but that, in the presence of rich semantic information, words are stored and re-used in a way that generates the characteristic distribution. Furthermore, in addition to not being a realistic model of word generation, random-typing models do not fully capture the distributions observed (Ferrer-i-Cancho and Elvevåg, 2010). So what does give rise to Zipfian frequency distributions in natural language? Could it just arise from the nature of the world? That is, is there something about the distribution of objects in the world that causes humans to want to talk in such a way that gives rise to a Zipfian distribution (Ferrer-i-Cancho and Solé, 2003; Manin, 2008)? For instance, even naturally constrained kinds, like the planets and the elements, seem to follow a Zipfian frequency distribution when counted up in an English language corpus (Piantadosi, 2014). But Piantadosi (2014) argues that this cannot be the whole picture since, even when people are asked to tell a story using made-up words (e.g., Blicket, Fark, etc.), the distribution of those nonce words will be Zipfian. Simon (1955), among others, proposes stochastic accounts, which claim that Zipfian distributions arise from preferential re-use. If you use a word in some text, you are more likely to use it again. This “the rich get richer” scheme is known to give rise to power law distributions in
Efficient Communication and The Organization of The Lexicon 205 a wide variety of contexts. Ferrer-i-Cancho and others have proposed communicative accounts (Ferrer-i-Cancho, 2005; Ferrer-i-Cancho and Solé, 2003), suggesting that Zipfian distributions enable more efficient communication and can arise from pressures toward optimal communication and a minimization of effort by the speaker and listener (Ferrer-i-Cancho, 2016). Manin (2008) also argued that speaker cost is minimized when a Zipfian distribution is used. And a Zipfian distribution benefits word segmentation, of the sort required for learning language (Kurumada, Meylan, and Frank, 2013). Pagel, Beaumont, Meade, Verkerk, and Calude (2019) used data from surveys to suggest that humans have a preference to converge on the same word for a given meaning by preferring words that are most commonly used by others. Given the vast array of competing explanations offered, we will not settle the question here of why natural languages, at various scales, converge on similar Zipfian distributions of word frequencies. But, in the next section, we will explore how individual semantic spaces are carved up in efficient ways that strongly suggest a communicative aspect to the structure of lexical space.
10.2.2 Distribution of words in more specialized semantic spaces To get a better sense of how principles of communicative efficiency structure the distribution of words in the semantic space, it is helpful to consider a line of work exploring how semantic spaces are broken into meaningful linguistic units. In the general case, where the lexicon is studied as a whole, it is difficult to identify independent communicative goals about how often certain meanings should be conveyed. But it becomes more straightforward in these restricted domains, which has led to fruitful work on how communicative efficiency leads to lexical structure within semantic spaces, such as color words, kinship terms, and naming systems. Consider the color system. Languages vary widely in how many color words they have, that is, in how they divide the visual color space into discrete semantic units. What would an optimal lexicon do? Would it have 11 basic color words, like in English, or just five color words that everyone knows, as in Berinmo (Davidoff, Davies, and Roberson, 1999)? One answer is that it depends on the communicative needs of the language users, an insight that can be formalized using information theory (Gibson, Futrell, Jara-Ettinger et al., 2017; Regier, Kemp, and Kay, 2015). Imagine playing the following game. You and a partner are given a set of 100 colored boxes, such that each box is a different color and the colors of the boxes are evenly spread throughout the color space visible to the human visual system. You are secretly shown that one randomly selected box contains a prize, but your partner does not know which one it is. Your goal is to get your partner to correctly identify the prize-containing box, but you can say only one color word. Your partner guesses and is told whether they are right. If they are wrong, they guess again.
206 K. MAHOWALD, I. DAUTRICHE, M. BRAGINSKY, AND E. GIBSON Imagine that three of the boxes are what you would call “red” but 20 of the boxes are what you would call “blue.” First, imagine that the box is red and you say “red.” If your partner guesses randomly among the red boxes, on their first guess they have a decent chance (1/3) of getting it right. But if it’s a blue box and you say “blue,” they have a 1/20 chance. Therefore, the task is easier in this scenario when the box in question is red. On average, the number of guesses it would take your partner to find the correct box is related to the amount of information in your language’s color system. If you had a maximally informative color system such that you had a unique color word for each of the 100 boxes, you could communicate the key box to your partner in just one round. But this comes at a cost, in that it requires a significant amount of effort for a society to learn and maintain so many color words. At the other extreme, a color system that has only two words would be easy to learn. But it would require many more guesses for your partner to find the right box. Using a task not unlike the game described (where speakers are asked to name the colors of each of the chips on a Munsell grid—a grid of colors organized to be roughly equally spaced in human perceptual space), researchers have run this task and analyzed the data in an information-theoretic framework in order to measure the information in a color system (Gibson et al., 2017; Lindsey, Brown, Brainard, and Apicella, 2015; Regier et al., 2015). The information in a color system is given by the surprisal of a particular color (here defined as a particular color chip), averaged over all color chips. To compute the surprisal of a particular chip, one can estimate the negative log of the probability of a chip given a color word weighted, by the probability of a word given a chip. In the equation below, we compute the surprisal of a single chip S(c) by summing up the weighted log probability of the chip c given a word w, where weighting is over the probability of use of a word w given a chip c.
S(c) = − ∑ P (w c)log(P (c w )) (2) ω
Surprisal, averaged over the probability of a particular chip, is used to compute the information in the color system, as in Eq. 3.
∑ P(c) * S(c) (3) c
Following this procedure and using World Color Survey data, which includes color naming data from a wide variety of world cultures, it has been observed that more industrialized cultures have more information in the color system. Gibson et al. (2017), for instance, uses the Tsimane’ culture, a relatively isolated Amazon group, as a test case alongside English and Spanish. The result of this analysis, when applied to data from English, Spanish, and Tsimane’, is that the English and Spanish color systems have significantly more information (or less entropy) than the Tsimane’ color system. The
Efficient Communication and The Organization of The Lexicon 207 hypothesized reason for this asymmetry is that the Tsimane’ have less need for a highly informative color system since they are less likely to differentiate objects based on color (Gibson et al., 2017). There is also evidence that the distribution of information within the color word system is meaningful. Specifically, salient foreground objects are more likely to have warm colors (reds, yellows) whereas backgrounds are more likely to have cool colors (blues, greens). People are more likely to talk about foreground objects, so an efficient color system should have more information in warm colors. And indeed, across languages, there is more information in the warm color space (reds, yellows, browns) than in the cool color spaces (blues, greens). Both of these observations—that industrialized cultures have more information in the color system and that there tends to be more information in warm colors than cool colors—give evidence for the idea that color words are adapted to the needs of the speakers. If there is a need for more information in the color system, there are likely to be more color words, and those color words are likely to have boundaries that are more widely agreed on by speakers. As it is, languages seem to trade off optimally between the complexity of the system and its informativity (Zaslavsky, Kemp, Tishby, and Regier, 2019; Zaslavsky, Regier, Tishby, and Kemp, 2019). Moreover, while the above analysis relates to color words (part of our lexicon for describing visual experiences), Winter, Perlman, and Majid (2018) show that, at a higher level, the lexicon is optimally distributed such that the space of visual words is overrepresented in English. That is, compared to the number of words in the language for describing auditory or olfactory experience, English has a greater number of words for describing visual experience and those words have higher token frequency. This follows from the fact that we more often want to speak about visual experience than about other sensory domains. Similar ideas have been applied to other semantic domains, outside the senses. For instance, Kemp and Regier (2012) showed that kinship terms are structured along an efficient frontier. Some languages have more kinship terms than others do. For instance, Old English had distinct words for paternal uncles and aunts (as distinguished from maternal uncles and aunts)—a distinction lost in modern English. Imagine the same sort of game described above for color words but where, instead of hiding a prize in one of the colored boxes, I hide the prize with a random relative and am allowed to give a kinship term (i.e., aunt, cousin, grandmother, etc.) as a clue. For instance, if you have three aunts and I give you the clue “aunt,” you now have a 1/3 chance of guessing which family member has the prize. A language with more kinship terms will make it easier for you to guess who has the prize. That is, if you have one paternal aunt and two maternal aunts and we speak a language with separate terms for those, I could instead give the clue “paternal aunt” and you would know exactly who has the prize. The trade off is complexity: it’s more complex to have to learn more kinship terms. Crucially, from an information- theoretic perspective, there do not appear to be languages that have extra complexity without a subsequent increase in informativity or vice versa (Kemp and Regier, 2012), which means that languages seem to exist approximately along a Pareto-optimal frontier
208 K. MAHOWALD, I. DAUTRICHE, M. BRAGINSKY, AND E. GIBSON (meaning that you cannot improve informativity without increasing complexity, and you cannot reduce complexity without decreasing informativity). This approach has been applied in a number of diverse domains from numeral systems (Xu, Liu, and Regier, 2020) to animals (Zaslavsky, Regier, Tishby, and Kemp, 2020). It even appears that, according to an analysis that combines climate data and linguistic data, speakers develop optimal season naming conventions for their local environment in order to trade off the informativity of season words (e.g., given that it is “summer,” how much does that tell you about today’s temperature and expected rainfall) against the season system’s overall complexity (Kemp, Gaby, and Regier, 2019). More recently, the approach has also been fruitfully applied to grammatical systems, including indefinite pronouns (Denić, Steinert-Threlkeld, and Szymanik, 2021), tense systems (Mollica, et al., 2020), quantifiers (Steinert-Threlkeld, 2020), and person systems (Maldonado, Zaslavsky, and Culbertson, 2021). Applying information-theoretic ideas to human names, a system that (unlike color words and kinship terms) does not appear to be semantically constrained, Ramscar (2019) found that the structure of human naming systems show patterns consistent with predictions from communication theory. In particular, in a society with relatively few people (like many small-group societies before the Industrial Revolution), people use only first names and these first names are sufficient for picking out individuals in a population. But, as populations grow, cultures independently have converged on compound naming systems. This body of work, as a whole, suggests that individual semantic spaces are largely efficiently structured along an optimal frontier (Kemp, Xu, and Regier, 2018; Zaslavsky, Regier et al., 2019). It is not a coincidence that the domains chosen (color words, kinship terms, names) are ones in which it is particularly easy to quantify the space of real-world referents. That is, work in visual perception has made it relatively straightforward to categorize the human color space and words that map onto it. Kinship terms correspond to a set of real-world relations that are relatively easy to compute. Formulating these sorts of analyses for other, less mathematically precise domains is more difficult. But the work, taken as a whole, is suggestive that we should expect the lexicon to display efficient structure across domains.
10.3 Which wordforms should be frequent? In the previous section, we saw that natural languages settle on a lexical distribution such that the most frequent words are much more frequent than less frequent words, and that individual semantic spaces are largely efficiently structured. But we have not yet considered the question of which strings should be assigned to which meanings. Would the English lexicon be just as effective if the word that
Efficient Communication and The Organization of The Lexicon 209 corresponds to our concept of red were called “cerulean” and the concept of cerulean were called “red”? Or would it be onerous to have to say a long word like “cerulean” to refer to a frequently used concept, like red. In this section, we will focus on the relationship between frequency and other lexical properties to argue that the answer is no: reshuffling the mapping between color names and their referents (or between any set of words and their referents) would lead to sub-optimal design because the lexicon is already structured such that frequent words have a particular set of desirable properties.
10.3.1 Word length and word frequency In addition to observing the power law distribution for word lengths, Zipf also observed that the most frequent words tend to be short. The reason for the correlation between length and frequently is seemingly obvious: it would be onerous if you had to say something like “thessaloniki” every time you wanted to use the article “the” or if you had to say “antediluvian” every time you wanted to use the article “an.” Zipf called this the Principle of Least Effort (Zipf, 1949), and the relationship between length and frequency in language is often referred to as Zipf ’s Law of Abbreviation. The actual distribution of word lengths in languages is not as straightforward to characterize as the distribution of word frequencies. Sigurd, Eeg-Olofsson, and van de Weijer (2004) showed that, in Swedish and English, a gamma distribution can be used to characterize the word length distributions, with a peak of tokens occurring at length 3. Extending the observation to more languages, Bentz and Ferrer-i-Cancho (2016) showed that the relationship between frequency and length holds across a wide variety of languages and linguistic contexts. Pimentel, Nikkarinen, Mahowald, Cotterell, and Blasi (2021) found that, when accounting for phonotactics and morphological composition and using a unigram model to model the distribution of words, lexicons across a diverse set of languages were shorter than would be expected by chance—but not maximally compressed. For a summary of the distribution of word lengths across languages and models used to explain them, see Grzybek (2015). But using frequency alone to predict word length ignores the important contribution of context. In fact, information theory tells us that to construct an optimal code (one that is short as possible while still being robust), word lengths should be proportional not to frequency but to predictability in context. Let’s go back to the color game we described in the previous section, where you have a set of colored boxes and a prize hidden randomly inside one of them. You know the location of the prize, but your partner does not. You can use one color word to try to get your partner to identify the correct box, but you lose points for every character you use. If your lexicon is well designed, the boxes with colors that you are more likely to discuss will have shorter color names. But there is a further prediction that can be made using information theory. With our colored boxes, imagine that the color word you are transmitting to your partner is transmitted over a noisy channel, but that you get help from the context. Namely, your partner knows that if the prize is in a red box, all the red boxes will light up. In this
210 K. MAHOWALD, I. DAUTRICHE, M. BRAGINSKY, AND E. GIBSON case, you might wonder if you even need to say the word red since context (the light) has already done all the work. The same might hold for language: if context is highly informative, the next word can be short or even sometimes elided. If the preceding context is not informative or the upcoming word is surprising, you might want it to be long. Consistent with this account, it has been repeatedly observed that people will take longer to pronounce a word that is difficult or not predictable from context (Arnon and Cohen Priva, 2013; Aylett and Turk, 2004; Bell, Jurafsky, Fosler-Lussier et al., 2003; Fox Tree and Clark, 1997; Watson, Buxó-Lugo, and Simmons, 2015). Piantadosi, Tily, and Gibson (2011) used this observation to test the idea that a measure of predictability that uses local n-gram context could better explain word lengths than frequency alone. The n-gram context, in this case, means that predictability is computed for a word wn by using the n−1 words that precede it. Using Google n-gram corpora from ten languages, they use the local context to estimate the average predictability of words in context and find that, in general, there is a higher correlation between length and average surprisal (the predictability of a word in context) than between length and frequency. Indeed, the information-theoretic underpinnings of Zipf ’s Law of Abbreviation may follow from very general cognitive principles toward compression and simplicity (Chater and Vitányi, 2003). For instance, see Ferrer-i-Cancho, Betnz, and Seguin (2020) and Ferrer-i-Cancho et al. (2013) for an information-theoretic explanation of Zipf ’s Law of Abbreviation, which covers both human language and animal communication systems (such as calls from dolphins, macaques, and other animals). They show that the drive toward compression seems to be a general property of such systems, and that it can be derived by a pressure toward compression in the signal. As with the Zipfian rank-frequency distribution, there has been some question whether the observed length/frequency correlation effect is a statistical artifact. Miller (1957) argued that monkeys typing on a keyboard (with a space bar) would produce text where the shorter words are typically more frequent than longer words. The intuition is that a string like “fep” is far more likely to be resampled than a string like “fepalopolis.” This line of criticism, namely that the correlation between word length and frequency/ informativity may be spurious because it can be generated by chance, has been made in various forms since. Ferrer-i-Cancho and del Prado Martin (2011) argued that the word length and frequency/informativity effect can arise as a result of random typing. But the random-typing models rely on the idea that all meanings are equally likely to be conveyed, which is not true. Furthermore, the random-typing model does not capture the fact that words are stored and re-used (Ferrer-i-Cancho and Elvevåg, 2010; Kanwal, Smith, Culbertson, and Kirby, 2017; Piantadosi et al., 2011). See also Richie (2016) for a discussion of these issues. Caplan, Kodner, and Yang (2020) have challenged the claim that lexicons evolve efficiently by using a more sophisticated, updated model of Miller’s random-typing monkeys that account for phonotactics and, to some extent, semantics. Specifically, they argue against the claim that there is special communicative optimization in the fact that
Efficient Communication and The Organization of The Lexicon 211 more frequent and phonotactically probable words tend to have more homophones and meanings, as argued by Piantadosi, Tily, and Gibson (2012). Instead, they claim that, under certain phonotactic assumptions, random models generate lexicons in which common and probable words have more homophones and meanings. Similarly, Trott and Bergen (2020) find that real lexicons have fewer homophones than one might expect by chance. Criticisms from the perspective of random-typing models toward the efficient structure of the lexicon have, in part, led to work exploring the mechanism by which optimal word length distributions actually emerge in the lexicon. Mahowald, Fedorenko, Piantadosi, and Gibson (2013), for instance, explicitly tested the idea that words shorten after informative contexts. To do so, they selected words that had been shortened in the extant lexicon, such as chimp from chimpanzee. Using the Google Books corpus, they found that these shortenings are more likely to occur after predictive n-gram contexts. They also ran a behavioral experiment in which participants were asked to choose a short or long form of a word (chimp of chimpanzee) after either a supportive context (“Susan loves the apes at the zoo, and she even has a favorite...”) or a neutral context (“During a game of charades, Susan was too embarrassed to act like a...”). People were more likely to choose a short form after a predictive context, as shown in Figure 10.2. Kanwal et al. (2017) pursued this line of inquiry by running artificial language experiments. They showed that, if people are pressured to be both accurate and efficient
Difference between long and short surprisal
10
5
0
–5
–10
24
27
30
33
Log combined corpus frequency
Figure 10.2 The y-axis shows the difference between the log surprisal of the long form and the log surprisal of the short form (estimated from Google Books) for the short words forms shown on the plot. Words above the line show the predicted effect whereby the long form surprisal is greater than the short form surprisal. Adapted from Mahowald et al. (2013).
212 K. MAHOWALD, I. DAUTRICHE, M. BRAGINSKY, AND E. GIBSON (but not if they are pressured to be only one or the other), the length/frequency effect will emerge spontaneously during the task. Interestingly, when there was no penalty for inefficient speaking, the effect did not emerge. Similarly, Chaabouni, Kharitonov, Dupoux, and Baroni (2019) found that, in an emergent language simulation using artificial neural agents, there was actually an anti-efficient relationship between length and frequency unless there was an explicit penalty for long utterances. In effect, without that penalty, the system optimized itself to ease the confusability burden on the listener. Indeed, besides frequency and informativity, other work has shown that word lengths (and people’s expectations about them) vary in predictable ways based on factors directly related to a word’s meaning, such that communicative cost is higher when a more complicated or weighty meaning is needed. Lewis and Frank (2016) showed that the complexity of an object is related to its length, and Bennett and Goodman (2018) showed that more extreme intensifiers tend to be longer (and therefore costlier) than less extreme intensifiers. In that sense, a speaker has to pay a cost in order to use a more extreme intensifier. Furthermore, Tauzin (2019) showed that people’s expectations about the frequency of reference objects influences how long they expect words to be. These studies, which rely on semantic representations being linked to expectations about word length, are difficult to account for under the random-typing models. Under the random-typing models, there is no notion of a word that can be stored and re- used and so there is no way for semantic effects to act systematically on word length. Therefore, taken together, all of the results that show systematic relationships between word length and frequency argue against the random typing account for the relationship between word length and word frequency. Rather, the length of words is tied to a word’s semantics and to the real-world meanings it conveys.
10.3.2 Phonotactics and word frequency Another degree of freedom lexicons have regarding word frequency is how to distribute the phonological characteristics of words across the frequency spectrum. As with many aspects of the functional lexicon, Zipf had something to say on this topic. Just as he said that frequent words should be short, he also stated that languages should preferentially re-use easy-to-articulate sounds. To take this one step further, we might expect that spoken languages should prefer phonotactically probable words, that is, sequences that are likely under that particular language’s sound structure. In English, a word like cat is phonotactcially quite probable whereas a word like tzatziki, with its unusual onset, is less probable. How does this prediction play out in lexical structure? Landauer and Streeter (1973) showed a correlation between phonotactic probability and frequency for English, and Frauenfelder et al. (1993) found a similar result for English and Dutch. Across 96 written languages tested using corpora from Wikipedia entries in those languages, the most frequent words tend to be more orthographically probable than less frequent words, where orthographic probability is measured using an n-letter model trained over types
Efficient Communication and The Organization of The Lexicon 213 and where orthographic probability is taken as a stand-in for phonotactic probability (Mahowald, Dautriche, Gibson, and Piantadosi, 2018). Meylan and Griffiths (2017) made a similar argument across 13 languages, showing that phonological surprisal is correlated with frequency. There are a few possible reasons for this pattern to emerge across languages. First, consistent with Zipf ’s Principle of Least Effort, it is preferable to be able to re-use phonotactically probable sequences, which have been shown to be easier to process and understand (Vitevitch and Luce, 1998; Vitevitch and Sommers, 2003). Second, just as it has been argued that long words are preferable in unpredictable (i.e., high information) contexts because they are more robust to being understood, so too might it be the case that it is preferable for less frequent words to have more information in their phonetic content. Arguing along these lines, King and Wedel (2020) show that words which are less probable (i.e., less frequent) contain more disambiguating sounds in general (i.e., are less phonotactically probable). Because word beginnings are particularly important for disambiguation, they also explore how this plays out over the course of words and find that information is frontloaded: the beginnings of improbable words contain more disambiguating information early on. There is also an intriguing relationship between phonemic information and word length. Words with more contrastive segments tend to be shorter (Nettle, 1995), and languages with more complex syllables have fewer syllables and more polysemy (Fenk- Oczlon and Fenk, 2008). Adopting an information-theoretic approach, Pimentel, Roark, and Cotterell (2020) use a phonotactic model to quantify the bits per phoneme across languages and find a negative correlation with word length, suggesting a complexity tradeoff such that long words have, on average, less information per phoneme. The implication is that there may be a relatively constant amount of information delivered per word. While there are compelling information-theoretic reasons for a lexicon to prefer a structure with highly frequent words that are also highly phonotactically probable, this necessarily leads to words that are closer together in phonetic space. That is, words like run, fun, and sun are all minimal pairs and frequent, whereas less frequent words tend to be longer, more phonotactically unusual, and therefore have fewer neighbors (nothing rhymes with “abecedarian”). Because of the increased confusability of short words in dense phonetic neighborhoods, there is a competing communicative pressure that says that words should be spread out in phonetic space. If we attempt to control for these phonotactic effects, do words cluster in phonetic space? Dautriche et al. (2017) asked this question and showed that, relative to various statistical baselines, real lexicons are clumpy. Specifically, across lexicons from English, Dutch, German, and French, they use statistical models (such as n-gram models over phonemes) to construct “null lexicons,” which can be compared to real lexicons on various properties such as number of minimal pairs. Compared to these baselines, real lexicons have more neighbors and minimal pairs than one would expect by chance.
214 K. MAHOWALD, I. DAUTRICHE, M. BRAGINSKY, AND E. GIBSON Lexical ambiguity can, in a sense, be thought of as an extreme form of lexical clustering. Just as languages tend to favor lexicons with many minimal pairs (“cat,” “bat,” “hat”), another ubiquitous feature of language is the re-use of not just a neighboring wordform for a new meaning but the very same wordform. That is, in English, “bat” can be the flying mammal or the piece of baseball equipment. Piantadosi et al. (2012) argued that, although this may appear confusing, the ambiguity of individual lexical items is typically not a problem in everyday contexts and that ambiguity is actually efficient. It enables the re-use of short, probable words. Moreover, frequent words tend to have more meanings (Ferrer-i-Cancho and Vitevitch, 2018), and those meanings tend to be cognitively related (Xu, Duong, Malt, Jiang, and Srinivasan, 2020). Trott and Bergen (2020), though, suggest that, when properly accounting for phonotactics, there are actually fewer homophones than one would expect. They argue that, in effect, homophones are avoided by instead populating neighboring words and that this may explain the prevalence of dense phonological neighborhoods. Overall, there is good evidence that suggests that potential confusability is not the predominant driving force in the lexicon. That is not to say, however, that there is no pressure in the lexicon to avoid confusion. Indeed, there is some evidence that, when the confusability of phonemes is considered, there are indeed lexical dispersion effects. For instance, whereas many large-scale lexical studies treat phonemes as equidistant, King (2018) presents a lexical analysis suggesting that, when phonetic confusability is considered, words are structured to avoid confusability. This line of evidence is broadly consistent with the observation that phonological systems are structured to minimize confusion (Boersma and Hamann, 2009; Flemming, 2004; Graff, 2012; Wedel, 2011), and that these effects can be exaggerated through speech (Buxó-Lugo, Jacobs, and Watson, 2020; Gahl, 2008; Gahl and Strand, 2016; Gahl, Yao, and Johnson, 2012; Jacobs, Yiu, Watson, and Dell, 2015; Lindblom, 1990; Meinhardt, Bakovic, and Bergen, 2020). The importance of these phonemic considerations suggests that different perceptual considerations may affect the frequency of words in speech as compared to writing (Lau, Huang, Ferreira, and Vul, 2019).
10.4 Which meanings should be assigned to which wordforms? If you were designing a lexicon from scratch, you might consider structuring it so that the form of a word told you something about its meaning. For instance, all the words having to do with the concept of a cat could be similar to each other, or “cat-like” in some way. Instead, most lexicons have largely arbitrary relationships between word forms and word meanings. The sequence of sounds /kæt/, for instance, has nothing to do with the real-world entity cat. As a matter of fact, different languages pick different sequence of
Efficient Communication and The Organization of The Lexicon 215 sounds to refer to these small furry pets. This arbitrariness of the sign, where the form of the word is not systematically related to its meaning, is a well-established property of the world’s lexicons (Hockett, 1960; Saussure, 1916). A closer look at languages, however, reveals plenty of exceptions to the arbitrariness of the sign. One often-cited form of non-arbitrariness is iconicity (Perniss, Thompson, and Vigliocco, 2010), whereby certain acoustic properties resemble aspects of meanings within spoken languages or certain signs reflect aspects of their meanings within signed languages (Orlansky and Bonvillian, 1984; Perlman, Little, Thompson, and Thompson, 2018; Perniss, Lu, Morgan, and Vigliocco, 2018; Slonimska et al., 2020; Wilcox, 2004). For instance, onomatopoeic words such as French “miaou” or English “mew” offer straightforward examples where languages imitate some aspects of the referent (here cat vocalizations) with the imposed constraints of their phonology. Another class of iconic words are ideophones, which recruit other aspects of the signal to imitate their meaning (Hinton, Nichols, and Ohala, 2006). For instance, contrasts in vowels have been shown to correspond to contrasts in magnitude (Dingemanse, 2012) and reduplicated words convey often plurality or repetition. It has also been shown that, cross-linguistically, people prefer to give rounded, smooth objects names with labial consonants and open vowels (e.g., “bouba”), whereas spiky, sharp objects are more likely to have sounds with closed vowels (e.g., “kiki”). This is the so-called “bouba-kiki” effect that has been observed many times in a variety of contexts (Bremner, Caparos, Davidoff et al., 2013; Köhler, 1970; A. Nielsen and Rendall, 2011) and in ways that extend to an array of phonetic properties (D’Onofrio, 2014), but in a way that is heavily influenced by the phonetic properties of the linguistic system (Shang and Styles, 2017). The effect has also been observed in children (Fort, Lammertink, Peperkamp et al., 2018; Maurer, Pathman, and Mondloch, 2006). And there seems to be some ability to generalize semantic features across languages (Tzeng, Nygaard, and Namy, 2017). Another form of non- arbitrariness is systematicity, which refers to statistical regularities between forms and their usage within a specific language. For instance, grammatical classes, such as nouns vs. verbs or open-vs. close-class words, share certain phonological and prosodic properties (Kelly, 1992; Monaghan, Chater, and Christiansen, 2005) and phonesthemes, such as word-initial “fl-” in English (e.g., flap, fly, flutter, flicker), suggest a certain meaning (in this case, verbs related to movement; Bergen, 2004; Marchand, 1959). Availability of large spoken and written corpora, and advances in natural languages processing techniques and statistical analyses, have enabled large-scale analyses of the lexicons to quantify the degree of non-arbitrariness present across languages. These studies reveal a systematic positive correlation between the phonological similarity of word forms (i.e., the number of phonemes they share) and the semantic similarity of their meanings (i.e., the distance between their vector-based representations) significantly above what would be expected under random form- meaning re-assignment, in a variety of typologically unrelated languages (Blasi, Wichmann, Hammarström, Stadler, and Christiansen, 2016; Dautriche, Mahowald, Gibson, Christophe, and Piantadosi, 2017; Monaghan, Shillcock, Christiansen,
216 K. MAHOWALD, I. DAUTRICHE, M. BRAGINSKY, AND E. GIBSON and Kirby, 2014; Pimentel, McCarthy, Blasi, Roark, and Cotterell, 2019; Pimentel, Roark, Wichmann, Cotterell, and Blasi, 2021; Tamariz, 2008). Importantly, these correlations are not driven by regions of iconic words present in the vocabulary, but rather seem to be a feature of the lexicon as a whole, even when controlling for the etymology of words or the morphological structure of the lexicon (Monaghan et al., 2014). In sum, while the lexicon is indeed arbitrary in the sense there is no straightforward mapping from form to meaning, there is still some substantial, albeit subtle, systematicity between word forms and their meanings. What are the possible advantages of a non-arbitrary lexicon? One of the tasks faced by children learning their language is to link the form of a word to one of the many plausible meanings available to them (Quine, 1960). A fully arbitrary lexicon may be challenging for learners, as they would need to put greater resources in learning the form-meaning mappings of their language than if they could simply deduce the meaning of a word from its phonology. Non-arbitrariness may thus convey several learning advantages to help learners bootstrap their way into language (Imai and Kita, 2014; Nielsen and Dingemanse, 2018). Several studies have shown that adults and children as young as 14 months find it easier to learn iconic form-meaning mappings than arbitrary mappings (Asano, Imai, Kita et al., 2015; Imai, Kita, Nagumo, and Okada, 2008; Imai, Miyazaki, Yeung et al., 2015; Maurer et al., 2006). The literature on iconicity in signed language acquisition in children is mixed, with some suggesting that iconicity is a major factor driving acquisition of signed languages and others suggesting it is not (Konstantareas, Oxman, and Webster, 1978; Orlansky and Bonvillian, 1984). Congruent with the idea that iconicity facilitates word learning, corpus analyses have demonstrated that the words that are acquired the earliest are the most iconic (Laing, 2014; Monaghan et al., 2014; Perry et al., 2015) and that adults use more iconic words when speaking to children than to other adults (Perry, Perlman, Winter, Massaro, and Lupyan, 2018). Another important challenge for children is to learn how to use words in sentences depending on their grammatical class. It has been shown that children learn nouns and verbs better if there is a systematic correspondence between their sounds and their grammatical classes (Fitneva, Christiansen, and Monaghan, 2009; Kelly, 1992; Nygaard, Cook, and Namy, 2009). Thus, systematicity in vocabulary may support categorical generalization to novel words. Similarly, one may wonder about the possible advantages of an arbitrary lexicon. Arbitrariness is advantageous for speakers as it allows them to communicate about anything, beyond what could be referred to iconically (Clark, 1998). It is also critical for communication as it minimizes confusion between words with similar meaning, which in a systematic system would be expressed in a similar way. This is supported by experimental studies showing that arbitrary form-meaning mappings facilitate the distinction between referents, which is hindered by systematic mappings (Monaghan, Christiansen, and Fitneva, 2011). Arbitrary and non-arbitrary form-meaning mappings each bring their own selective advantages and disadvantages. Over generations of learners, such advantages and disadvantages, will shape the lexicon’s structure, influencing the presence and the
Efficient Communication and The Organization of The Lexicon 217 distribution of (non-)arbitrary form-meaning mappings within and across languages. This process can be tested in the laboratory: Iterated learning provides a framework to study the emergence of a lexicon through this repeated cycle of learning and use. Iterated learning is the process by which individuals learn a language produced by a previous individual, who learned it in the same way (Kirby et al., 2008; Smith, Kirby, and Brighton, 2003) and can be simulated using computational models or experiments with human participants in the lab. These experiments show, among other things, that arbitrary signals can be turned into systematic ones after repeated episodes of language transmission (Kirby, Tamariz, Cornish, and Smith, 2015; Silvey, Kirby, and Smith, 2015; Verhoef, Roberts, and Dingemanse, 2015). The arbitrariness of the sign, previously proposed as a “design feature” of language (Hockett, 1960), is broadly correct in the sense that the vast majority of words do not have meanings that can be straightforwardly referred from their referred. But it does seem to be an oversimplification in the light of recent quantitative research revealing a substantial and meaningful systematicity of the vocabulary across languages.
10.5 How should the lexicon be designed to make it learnable by children? In the preceding section, we discussed one aspect of the lexicon that helps children learn it, namely the presence of some degree of iconicity—or other forms of non- arbitrariness—in the form-meaning mapping. What other aspects of lexical design might be related to considerations of learnability? A basic constraint that infants bring to the language learning problem is the limited repertoire of sounds they can produce. Their lives would therefore be much more convenient if the words they needed to express earliest were easy to say. One way that this desideratum does in fact appear to be borne out is the fact that across the world’s languages, even entirely unrelated ones, child words for “mother” tend to resemble /mama/and child words for “father” tend to resemble /papa/or /baba/. This reason for this widespread phonological confluence, as formulated by Jakobson (1962), is thought to be that the sounds in these words —/m/, /p/, /b/, /a/—are the sounds that are easiest for infants to produce, and so are likely to make up the earliest word-like sequences that emerge from infant babbling. Parents then interpret these sequences as referring to themselves and encourage their repetition. Under this account, infants and parents effectively co-create the earliest slice of the lexicon to best facilitate infants’ earliest communicative needs, thus shaping it to be near- universal cross-linguistically. Pimentel, et al. (2021) find that such phonological/conceptual universals across languages, when controlling for non-independence, have significant but small effects.
218 K. MAHOWALD, I. DAUTRICHE, M. BRAGINSKY, AND E. GIBSON Of course, beyond these earliest few words, the lexicons of the world’s languages diverge dramatically. What properties might lexicons share that ensure their learnability? One way of gaining traction on the question is through a related question—what influence do lexical properties have on the course of early word learning? A body of research has explored this question by using corpus and survey data to estimate the age at which each of a large set of words is learned, and then finding properties of words’ meaning or linguistic environment that predict those ages. Within a lexical category (e.g., nouns, verbs), English words that are more frequent in speech to children are likely to be learned earlier (Goodman et al., 2008). Further studies (also in English) have found evidence that age of acquisition is likely to be earlier for words that have more phonological neighbors (e.g., Storkel, 2004, 2001, 2004; Stokes, 2010; Jones and Brandt, 2019; but see Swingley and Aslin, 2007; Stager and Werker, 1997); words that share more associations with other words in the learning environment (Fourtassi, Bian, and Frank, 2018; Hills et al., 2009); words that occur more often in isolation (Brent and Siskind, 2001; Swingley and Humphrey, 2018); words whose meanings are more concrete (Gentner, 1978, 1981, 1982; Gleitman and Gleitman, 1997; Swingley and Humphrey, 2018); words that are rated more iconic and/or more associated with babies (e.g., such as “choo-choo” or “doggy,” Massaro and Perlman, 2017; Perry et al., 2015; Thompson, Vinson, Woll, and Vigliocco, 2012); words which have sound symbolism (even in a different language, as in Kantartzis, Imai, and Kita, 2011) and words that occur in more distinctive spacial, temporal, and linguistic contexts (Clark, 1987; Ferrer-i-Cancho, 2013; Roy et al., 2015; Trueswell, Lin, Armstrong et al., 2016). For more on this domain, see Swingley (this volume). Building on this body of work, Braginsky et al. (2019) conducted a large-scale analysis of the predictors of early word learning across languages. They found a much greater degree of consistency in these predictors than would be expected by chance. Across ten languages (Croatian, Danish, English, French, Italian, Norwegian, Russian, Spanish, Swedish, and Turkish), words are more likely to be learned earlier if they are (in order of largest to smallest effect size) more frequent, shorter, more concrete, more frequently the only word in an utterance, more likely to appear in shorter utterances, more associated with babies, and more frequently the final word in an utterance. Additionally, these effects differ by lexical category: for content words (nouns, adjectives, and verbs) being more concrete and more frequent affect their learning the most, while for function words, being shorter and appearing in shorter sentences affect their learning the most. These patterns are supportive of the hypothesis that different word classes are learned in different ways, or at least that the bottleneck on learning tends to be different, leading to different information sources being more or less important across categories (Gleitman et al., 2005). For work on vocabulary growth differences across individuals, see Potter and Lew-Williams, this volume. Taken together, these results paint a picture of what sort of lexicon may be most conducive to early word learning. In many ways, that ideal lexicon has properties that resemble the real-world lexical features discussed throughout this chapter. For instance, an ideal lexicon might have relatively high levels of phonological and associative density
Efficient Communication and The Organization of The Lexicon 219 and, indeed, much evidence suggests that lexicons tend to be highly clustered in phonetic space (Dautriche, Mahowald, Gibson, Christophe, and Piantadosi, 2017). A lexicon that is good for learning might also have a non-negligible set of words exhibiting iconicity: this too is a property we have seen across a variety of lexicons (Dautriche, Mahowald, Gibson, and Piantadosi, 2017; Monaghan et al., 2014). Finally, lexicons that are well suited for learning should have a tendency for words that are of communicative import for infants and toddlers to be relatively short, frequent, and amenable to appearing in supportive environments like shorter sentences and distinctive linguistic contexts (see Dautriche, Fibla, Fievet, and Christophe, 2018; Dautriche et al., 2015, for the role of contextual usage). In sum, the learnability advantages described here translate into the overall structure of the lexicon. In some sense, this is deeply unsurprising: the words that survive in languages are necessarily those that can be learned by children.
10.6 Conclusion By combining insights from communication theory, observations in functional linguistics, and results from cognitive science, we can see that the lexicon shows a number of efficient design properties that make it tractable and efficient for human language use. Specifically, the body of evidence presented here suggests that, to design an optimal lexicon for human language use and acquisition, one should do the following: 1. Structure the lexicon such that the words follow a near-Zipfian distribution. 2. Structure semantic spaces along a Pareto-optimal frontier, trading off complexity against the information in the system according to the communicative needs of the speakers. 3. Make the words that are easiest to predict in context short. Make words that are hard to predict by context long. 4. Make frequent words phonotactically probable, even if it means that frequent words will be more easily confused phonetically. 5. While the mapping between form and meaning is arbitrary, build a lexicon with some correlation between form and meaning. Throughout this chapter, which depends on the notion of a lexicon that is pressured toward efficiency, we keep in mind criticism along the lines of Marcus and Davis (2013), who have argued that care should be taken when making claims about optimality in human cognition. A danger in this work is making optimality—or efficiency—a moving goalpost such that any behavior observed is defined as optimal. One advantage of working in language and specifically on the lexicon, however, is that communication theory provides a mathematically precise and rigorous definition of efficient behavior under different assumptions. And experiments and modeling work in developmental
220 K. MAHOWALD, I. DAUTRICHE, M. BRAGINSKY, AND E. GIBSON psychology give clear guidance as to what makes linguistic systems easier or harder to learn. Given that the lexicon is largely arbitrary and has a high number of degrees of freedom, it is an ideal place for exploring these pressures because the lexicon is flexible enough, and free enough, to adapt and change in response to different communicative pressures. Consider, for instance, the explosion in use of word shortenings that emerged as a result of the rise of texting (which, due to the tininess of phones and the largeness of thumbs, penalizes long words more strongly than other forms of communication): “ttyl,” “lol,” “imho,” and so on. Of course, no language will ever have a perfectly efficient lexicon because there are many competing demands in language, for example, efficiency in syntactic space and the sometimes contradictory pressures for learnability on the one hand and efficiency on the other. To that end, work purporting to show efficiency in the lexicon needs to always ask the question “efficient relative to what baseline?” Indeed, the use of baseline can affect the strength of the claims made. Moreover, the communicative needs of any individual speaker in any individual context are likely unique, and it is impractical to have a unique lexicon for every conversation and context—although note that any specialized context quickly develops its own vocabulary with its own shortenings and jargon. Rather, the lexicon of a language needs to be general enough to function in a wide variety of contexts. Nonetheless, it is noteworthy how many principles of efficient communication can be observed by measuring large-scale statistical properties both within and across languages. Some of the statistical properties reported in work here are straightforward: it is certainly not surprising that short words tend to be more frequent than long words. But much of the statistical work discussed here has contributed to measuring the statistical structure of various aspects of the lexicon. In doing so, doors have opened toward establishing newer and more detailed principles that guide how words and languages evolve to satisfy language users’ needs.
Chapter 11
C omp ositiona l i t y of c oncep ts Gala Stojnic and Ernie Lepore
11.1 Introduction The nature of concept representation has been a hot topic among cognitive scientists for almost half a century. Nonetheless, early work in experimental cognitive science (psychology in particular) focused more on the nature of single (simple) concept representations, whereas the problem of concept composition and complex meaning has been left somewhat ignored in psychological theories. Compositionality has been the matter of inquiry among philosophers and linguists almost exclusively, and the gap between their work and empirical research in cognitive psychology seemed unrecognized (see Hampton and Winter, 2017; see also Katz and Fodor, 1963). It was relatively late in development of research on concept representations, that cognitive psychology acknowledged the importance of compositionality constraint and aimed to incorporate it as a necessary constraint of cognitively plausible theories of concepts (see Gleitman, Connolly, and Armstrong, 2012; Margolis and Laurence, 1999; Osherson and Smith, 1981; Smith, Osherson, Rips, and Keane, 1988; Smith and Medin, 1981; Spalding and Gagné, 2015). As we will attempt to show in this chapter, this might be the very reason why our current understanding of what concepts are is still quite limited. A compositionality constraint, we believe, restricts the form of mental representation that concepts might have, hence it should be central to any account of concepts. Compositionality of concepts is, thus, the central matter of this chapter. We discuss why any cognitively plausible theory of concepts must incorporate a compositionality constraint, and we show how some of the most influential theories of concept representation still fail to do so adequately. In the first section, we explain what is meant by the notion of concepts as well as why a plausible theory of concepts has to incorporate the compositionality constraint to satisfactorily account for the nature of mind. We, then,
222 Gala Stojnic and Ernie Lepore turn to specific views on concept representation. In particular, we discuss Inferential Roles Semantics and Prototype theories of concepts, as these views (especially the later) have been particularly influential among cognitive scientists. We discuss whether these views are compatible with the compositionality constraint, and we conclude that they still fall short of giving us a satisfactory account of how simple concepts combine to form complex ones. In the end, we discuss broader implications that this might have for theorizing on concept representation. We emphasize that the issue of concept composition must not be marginalized in the course of research on concept representation.
11.2 Concepts and why they ought to be compositional The concept of concepts is central to cognitive science, as it refers to the building blocks of thoughts and is involved in virtually any higher mental activity, such as categorization, inferencing, reasoning, communication, and so forth. Conceptual representations allow us to attend to the relevant structures of the external environment and to learn about them, rendering richer and more sophisticated understanding of the surrounding world. Indeed, there is an abundance of research suggesting that even infants possess rich conceptual knowledge of different core domains, such as the physical world and objects (Baillargeon, 1987; Kellman and Spelke, 1983), number (McCrink and Wynn, 2004; Wynn, 1992), agency (Gergely, Nádasdy, Csibra, and Biró, 1995; Newman, Keli, Kuhulmeler, and Wynn, 2010; Woodward, 1998), and even social agents’ epistemic states (Scott and Baillagreon, 2009; Onishi and Baillargeon, 2005; Onishi, Baillargeon, and Leslie, 2007). Hence, it appears uncontroversial that to understand the nature of human cognition, one ought to have a satisfactory theory of concepts, specifying how they are structured, how they relate to the entities in the world, and—what will be the focus of this chapter—how they relate to one another to form complex meanings. To efficiently navigate the surrounding world, the human cognitive system has to be able to not only entertain single concepts (e.g., AGENT, CHAIR, SIT) but to flexibly manipulate them and combine them together to form more complex, meaningful expressions ((The) agent is sitting (on the) chair). That the content of complex concepts critically depends on combinatorial structure of their constituents (single concepts) can be illustrated by the following example: assume that you want to express a thought that a big gray cat is chasing a small brown mouse. You understand the contents of the simple concepts involved in this event, that is, CAT, MOUSE, GRAY, CHASING, BROWN, BIG, SMALL. But each of these individual concepts needs to work together with one another to issue the content of the thought that a big gray cat is chasing a small brown mouse. Should you combine them differently, you end up expressing a different thought, for example, that a big gray mouse is chasing a small gray cat. The two thoughts
Compositionality of concepts 223 clearly are distinct, yet they are comprised of the same constituents.1 This leads us to the compositionality constraint: concepts ought to be compositional to allow for the formation of more complex ones. What this means is that the content of a complex concept/thought is determined by the content and structural properties of its simpler constituents (primitive concepts).2 Or, more precisely, we say that a set of concepts is compositional just in case the contents of its complex concepts is a function of their structure together with the contents of their constituent concepts. We work under the assumption that a theory of concepts must incorporate the compositionality constraint to provide a satisfactory, cognitively plausible account on human cognition.3 The non-negotiability of the compositionality constraint is justified by the fact that it is at the heart of some of the most striking properties that minds exhibit: most notably, productivity and systematicity. Productivity is the property of the mind that allows it to entertain an open-ended set of propositions, so that (in principle) indefinitely many concepts can be formed from a finite set of units. For instance, it is quite unlikely that you have ever entertained the thought that a blue cow was flying above the Moon’s crust, while drinking pink coffee, although you probably have a decent experience with each of the separate concepts that form it. Moreover, you can entertain the thought quite easily, although you might have never entertained the concepts BLUE COW or PINK COFFEE. Your ability to entertain this thought despite its novelty (and to entertain as many novel thoughts as you wish) rests on the compositional nature of concepts. The second key property of the mind—systematicity—reflects the fact that any human mind that can entertain a certain proposition P can entertain propositions semantically close to P. With respect to systematicity, for example, if a mind can entertain the proposition that aRb, then it can also entertain the proposition that bRa. So, if you can entertain the thought that Oedipus loves Jocasta, you can also entertain the thought that Jocasta loves Oedipus. For another example, if a mind can entertain the proposition that Jim is Tim’s brother, then it can also entertain the proposition that Tim is Jim’s brother. (Note that truth of these propositions is not an issue here. What matters is that having P in your cognitive repertoire entails that you can also entertain systematic variations of P, see Cummins, 1996.) Like productivity, systematicity also rests on the compositionality constraint—systematic permutations of single concepts in a complex one give rise to the formation of multiple semantically close ones. Following Fodor and Lepore (2002), we note that a cognitively plausible theory of conceptual content for a productive and systematic mind must specify two functions: a Composition Function (FC) that maps a finite basis of simple concepts onto infinity
1 On
the relation between structural aspects (syntax) of an expression and its constituents (single words) in the process of word acquisition, see Lidz, this volume (also Gleitman and Trueswell, this volume). 2 Here, we assume that simple (or primitive) concepts roughly correspond to capitalized single words (e.g., DOG, YELLOW) whereas phrases comprised of putting capitalized single words together (e.g., DOG + YELLOW → YELLOW DOG) correspond to complex concepts. 3 But see also Robbins (2002) for an opposing view.
224 Gala Stojnic and Ernie Lepore of complex ones together with structural descriptions; and an Interpretation Function (FI) that maps arbitrary concepts, simple or complex, onto their contents. An FC is needed because if minds are productive, then they entertain indefinitely many concepts. Practically without exception, people who accept this inference conclude infinitely many concepts must have internal structure; specifically, they must have simpler concepts as constituents. So, we explain why indefinitely many concepts can be entertained by positing that relatively complex ones can be constructed using more primitive ones as constituents. That concepts have constituent structure is thus essential to explaining compositionality. Accordingly, FC serves to specify the constituency relations concepts can enter into. Given the usual idealizations, there are indefinitely many concepts we can entertain. We can explain this productivity by postulating FI and FC, but we also require that thinkers be epistemically related to these functions in appropriate ways; they can grasp these functions. What explains productivity is that we grasp FI and FC. Of course, we can only grasp a function that is finitely specifiable. We generate a productivity problem by asking how a finite creature could have an infinite epistemic capacity. Clearly, either the existence of such infinite capacities is unproblematic, in which case the productivity problem doesn’t arise; or, if there really is a problem, the solution must not presuppose epistemic relations to infinite sets. That FI and FC must be finite objects doesn’t mean they must have finite extensions. A finite creature can get into an epistemic relation to an infinite set by being in an epistemic relation to a finite object that specifies the set. Similar considerations suggest that each set of concepts must be finitely specifiable, that the primitive basis from which complex concepts are constructed must be finite. In short, the interpretation that FI assigns to a certain concept must be computed from the structural description that FC assigns to it. Call this Principle P. So, suppose FI assigns to the complex concept BROWN COW an intersection. Principle P further requires that FI does so because FC assigns BROWN COW a structure that includes the constituent concepts BROWN and COW (in the appropriate configuration), that is, the operations FI performs must be sensitive to the structural descriptions FC enumerates, so that the structure of interpretation that FI assigns derives from the structure FC assigns. But imagine a theory that succeeds in getting the “right” extensions assigned to each of infinitely many concepts but without satisfying Principle P. Technically, such a theory would succeed in representing the productivity of a mind. It would represent the set of concepts as an infinite set of interpreted objects. But it would leave certain explanatory gaps in the resulting explanation of why minds are productive. Namely, it would be at a loss to explain what constituent structure of concepts is for. It would be idle in the general case. It would also fail to explain why a given concept has the interpretation it has. By contrast, when P is enforced, we can see why a concept with COW as a constituent (as in BROWN COW) has as part of its interpretation an interpretation of COW, likewise, for the complex concept NOT COW. So, to account for the fundamental properties of human flexible and productive thinking, a satisfactory theory of concepts has to ensure that a proper care is taken of the
Compositionality of concepts 225 compositionality constraint. However, many existing accounts of concepts representation and concepts meaning still fall short of satisfying this goal. In the next section, we discuss one such influential account—the Inferential Roles Semantics.
11.3 Inferential roles semantics and compositionality Compositionality poses a problem for accounts of conceptual content according to which inferential role (even partially) determines the content of complex concepts. Such accounts are known as Inferential Role Semantics (or Conceptual Role Semantics) and they hold a functionalist approach to the mind, where what determines the content of a complex concept is the role it has in the cognitive trafficking of an agent (Brandom, 2000; Block, 1987; Harman, 1987; Miller and Johnson-Laird, 1976). In other words, this view posits that what determines the content of a mental representation, say a belief, is the cognitive (or inferential) role this belief plays as it interacts with other mental representations of an agent. We claim that inferential role theorists cannot explain the striking properties of productivity and systematicity of mind, and the reason lays in their violation of the compositionality constraint. The problem is that the inferential roles of concepts are not compositional themselves, and so, conceptual contents cannot be (even partially determined by) inferential roles. This argument obviously presupposes that inferential roles are not compositional, but why should we believe that? Well, consider the content of the complex concept BROWN COW; according to an inferential role theorist, what that content of this concept is depends on the contents of its simpler constituent concepts together with its structure, as compositionality requires. To a first approximation, the concepts BROWN and COW, and the interpretation of the structure (ADJECTIVE + NOUN) is property conjunction. But suppose you happen to think brown cows are dangerous; then it is a part of the inferential role of the complex concept BROWN COW in your mind, at least, that it can figure in inferences like BROWN COW → DANGEROUS. But this fact doesn’t seem to derive from the inferential roles of its constituents. Some, but not all, of the inferential potential of BROWN COW is determined by the respective inferential potentials of BROWN and COW, but the rest apparently are determined by “real world” beliefs about brown cows. In response to this worry, even if compositionality is assumed, and so, conceptual contents cannot be identified with inferential roles, why couldn’t we say that they can be identified with their inferential roles in analytic inferences?4 Thus, on the one hand, the inference from the complex concept BROWN COW to the complex concept BROWN 4 Analytic inferences are those whose validity depends only on their internal structure, without the need for any external knowledge. Synthetic inferences, by contrast, would require external empirical validation. As we note later in the text, the analytic/synthetic difference is now largely deemed unprincipled among philosophers (e.g., Fodor and Lepore, 2006).
226 Gala Stojnic and Ernie Lepore ANIMAL is compositional (inherited from the inference from the simpler constituent concept COW to the simpler constituent concept ANIMAL) and, on the other hand, precisely because it is compositional, we can say the inference from the complex concept BROWN COW to the complex concept BROWN ANIMAL is analytic. That is, compositional inferences are analytic, and analytic inferences are compositional. It is clear that this reply suffers from circularity. The proposal is intended to reconcile an inferential role account of concept individuation with the compositionality constraint by identifying the content of a concept with its inferential role in analytic inferences. But the difference between analytic inferences and inferences tout court is that the validity of the former is guaranteed by the contents of their constituent concepts. In short, on this reply, though we can’t identify conceptual contents with inferential roles, since they aren’t compositional, we can (partially) identify contents with inferential roles in analytic inferences, because analytic inferences are compositional. The cost of so doing, however, is a commitment to an analytic/synthetic distinction. And practically everybody thinks the analytic/synthetic distinction is unprincipled (e.g., Fodor and Lepore, 2006). So, if compositionality is non-negotiable, it follows that one of the foundational principles of cognitive science must go. Which one and what do we replace it with? One possible solution is to try to resuscitate a viable analytic/synthetic distinction. Plausibly, compositionality entails analyticity. How can the meaning of BROWN COW be compositional without the inference from BROWN COW to BROWN being analytically valid? If it is undeniable that the content of BROWN COW is constructed from the contents of BROWN and COW, it is equally undeniable that the inference from BROWN COW to BROWN is guaranteed by conceptual principles, but any inference guaranteed by such principles is analytic. So, on this counter-reply, the analytic/synthetic distinction that compositionality underwrites holds only between complex concepts and their constituents: it distinguishes, say, BROWN COW → BROWN from BROWN COW → DANGEROUS, but it does not underwrite a distinction between, say, BROWN COW → ANIMAL, and BROWN COW → DANGEROUS (that is, inferences that turn on conceptual inventory of premises as opposed to conceptual structure alone). In short, if the content of the complex concept BROWN COW derives from the contents of its simpler conceptual constituents BROWN and COW (as compositionality requires), and if the content of COW is (partially determined by) its inferential role, then somehow BROWN COW inherits the inference to ANIMAL, but not too DANGEROUS, from the contents of its constituents. This requires that we exclude from the content of COW an inferential role such as COW → KIND OF X SUCH THAT THE BROWN X-S ARE DANGEROUS. But to exclude such inferences from the contents of COW and BROWN means these inferences are not constitutive of the contents of COW and BROWN. But this presupposes an analytic/synthetic distinction for conceptual governed inferences, and it is the analytic/synthetic distinction for these inferences that Quine (1951) famously jeopardized. We conclude that Inferential Role Semantics struggle to provide a principled way to account for the compositionality constraint. We turn, then, to another influential proposal, namely, that conceptual content is (partially) determined by associated
Compositionality of concepts 227 stereotypes/prototypes. We will see that prototype theories, although dominating the research in cognitive psychology for decades, also suffer from not being able to satisfy compositionality constraint (in addition to the problems they face at the level of simple concept representation) and seem not to be (sufficiently) supported by empirical evidence.
11.4 Prototype theories and compositionality In this section, we focus on prototype theories of concepts, and discuss their cognitive plausibility, both with respect to how well they account for the nature of single concept representations as well as with respect to whether they satisfactorily implement the compositionality constraint. We assume that these two issues (i.e., how concepts are represented and how they combine) are, in the end, inseparable. Hence, certain problems that these theories face on the level of single concept representations could be rooted in their (in)ability to account for the compositionality constraint. Prototype accounts of concept representation posit that concepts are represented as graded categories. So, for instance, the content of the concept BIRD for a certain individual is the most prototypical instance of birds that she associates with this concept (notice that this implies certain variations in concepts across different individuals, cultures, backgrounds, etc.). Historically, after the gradual demise of definitional theories,5 a lot of enthusiasm among cognitive scientists interested in how single concepts are represented centered on prototype theories as they appeared quite promising from the experimental standpoint. In particular, Rosch and colleagues (e.g., Rosch, 1973, 1975) conducted a number of experiments that many interpreted as a strong empirically support for prototypicality in concept representation. These experiments showed that adults would provide graded responses when asked to judge how well a certain instance represents its category (e.g., Rosch, 1973), they would list more stereotypical members of a certain category first (Cree and McRae, 2003) and that their reaction times on semantically judgment tasks (presenting sentences of a form “X is a Y”) would vary as a function of how typical X is considered to be for the category Y (Rips, Shoben, and Smith, 1973; Rosch, 1975a; Armstrong, Gleitman, and Gleitman, 1983). These findings encouraged many to undertake a similar approach to demonstrate prototypicality in numerous domains of human conceptualization, including culture (e.g., Sinha, 2002), love 5 Definitional theories attempted to explain concepts by appeal to necessary and sufficient conditions (features) that would allow for a clear distinction between members and non-members of a given category (see e.g., Russell, 1956). For instance, DOG would refer to all the features that all (and only) dogs have. It soon become clear that this view wouldn’t give rise to a cognitively plausible theory of concept representation (let alone addressing the issue of compositionality), since finding the necessary and sufficient conditions turned out to be virtually impossible for a vast majority of everyday concepts.
228 Gala Stojnic and Ernie Lepore (Aron and Westbay, 1996), and even prototypes of smokers and drinkers (Spijkerman and van den Eijnden, 2004). Prototype theory has quickly established itself as the first- rate prototype among cognitive theories of concepts. Nonetheless, a number of authors have questioned whether the growing body of findings indeed reflected a prototypical conceptual representation. Notably, Armstrong et al. (1983), found that adults would provide graded responses for well-defined categories (such as ODD NUMBER) much like they would do for fuzzy categories; similarly, they would react faster to statements such as “7 is an odd number” compared to “13 is an odd number”. Armstrong et al. interpreted these findings as undermining the prototypicality interpretation of Rosch and others, the reasoning being that if well- defined categories yield graded responses much like fuzzy ones do, one has no solid ground to unequivocally conclude that graded responses necessarily reflect graded concept representations (Armstrong et al., 1983, but see also LaRochelle, Richard, and Soulières, 2000). As the authors conclude, empirical findings that’d been taken to demonstrate prototype structure of concepts fall short off this aim; at best, they affirm that many concepts do have better or worse representatives, but not necessarily that they are prototypes in nature (Armstrong et al., 1983; Gleitman et al., 2012). Of course, there have been attempts to rescue the prototype approach by pointing out that Armstrong et al. findings might as well be interpreted as suggesting certain prototypicality of well- defined concepts as well.6 Others pointed out that Rosch and colleagues only intended to show that prototypicality effects have to be taken into account when comprising a theory of concepts, but without making an assertion that the structure of conceptual representations must be prototypical in nature (e.g., Lakoff, 1987; see also Keil, 1989). Now, putting aside the matter of whether prototype interpretations are methodologically grounded or not, the Prototype theory would still have to incorporate the compositionality constraint in order to be cognitively plausible. As we established in the previous section, a satisfactory account of concepts ought to be able to explain productivity and systematicity of the mind, the properties that rest on the compositionality constraint. How do the prototypes do in this respect? In response to this question, some authors even indulge in a guarded optimism. Kamp and Partee (1995, p. 56), for instance, state that, “[...] when a suitably rich compositional theory [...] is developed, prototypes will be seen [...] as one property among many which only when taken altogether can support a compositional theory of combination.” We are, however, inclined to be more skeptical when it comes to the prospects of prototype theories satisfying compositionality constraint. Just as was the case with Inferential Role Semantics, the central question here is whether prototypes can be combined together to form complex concepts. Namely, for the proposal that contents might be (even partially) individuated by appeal to prototypes to work out, one has to show that prototypes are compositional, because contents must
6
As noted by Keil (1989), this would put us in an odd position to assume that there are such parts of a concept’s content that do not affect what instances the concept picks out.
Compositionality of concepts 229 be compositional (this, again, stems from the properties of systematicity and productivity of the mind). However, prototypes do not seem to satisfy the compositionality constraint, so contents can’t be (even partially) identified with them. It is challenging for the Prototype theory to explain, for instance, how we get to a goldfish (prototypical pet fish) by combining, say, a trout (prototypical fish) and a dog (prototypical pet). Things get even more complicated if we consider the complex concept NOT CAT, for instance. This concept seems to not have a prototype at all. We will come back to these examples shortly. Given the demands of the productivity, systematicity, and compositionality of mind, we can say that there are two main objections against any account that (partially) individuates concepts by an appeal to prototypes. First, they can’t account for certain relations of logical equivalence among concepts.7 Second, they can’t predict the semantic relations between complex concepts and their constituents. To be more concrete, for indefinitely many Boolean concepts (i.e., complex concepts that are built from Boolean operators, such as AND, IF, NOT, OR) there isn’t any prototype, even though (i) the primitive constituent concepts all have prototypes, and (ii) the complex concept itself has a definite content. So, the complex concept NOT CAT has a definite interpretation, yet it clearly has no prototype. Likewise, there isn’t a prototypical non-prime number; there isn’t anything prototypically pink if it’s square and so on. The main theme behind these comments is that there are indefinitely many cases without a prototype corresponding to a complex Boolean concept. Faced with this problem, a theorist might admit that in indefinitely many cases, what a primitive concept transmits to its complex hosts is not its prototype. In the face of this criticism, there are not many attractive alternatives for the advocates of prototype-based conceptual content accounts. One could grant that the interpretation of a complex concept of the form NOT(F) isn’t computed from a prototype, but still defend prototypes. For example, one might deny that the interpretation of a complex Boolean concept of the form NOT(F) is computed from the conceptual content of F. As Kamp and Partee (e.g., 1995) write, . . . consider [the concept] red. The concept (is) not red does not appear to have a prototype; for how might one resolve the choice among white, green, black, yellow and all the other colors that red excludes? Nevertheless, the degree of membership in the concept not red is [sic] a matter of prototypicality. Only, the relevant prototype is not some prototype for not red but the prototype for red, and the degree to which something is not red is a matter of how little rather than how much, it resembles that prototype. (Kamp and Partee, 1995, p. 48)
7
We say that two statements are logically equivalent if under all interpretations they have the same truth value. In other words, two statements are logically equivalent if they have the same meaning.
230 Gala Stojnic and Ernie Lepore However, recall that Principle P requires that the Interpretation Function (FI) assign to a complex concept an interpretation that is computed from the structural descriptions assigned to it by the FC. This means that if prototype for the concept RED is a fire engine in the thought THAT IS RED, then it is a fire engine in the complex Boolean thought THAT IS NOT RED too. However, RED does not contribute its CONTENT to NOT RED by contributing the prototype of RED to the prototype of NOT RED, and this is the sense in which Principle P is violated. Of course, one might argue that we could go without Principle P; but, we might very well wonder, is giving up Principle P coherent? Kamp and Partee tell us that, in computing an interpretation for NOT RED, “the relevant prototype is not some prototype for not red but the prototype for red, and the degree to which something is not red is a matter of how little rather than how much, it resembles that prototype.” So, in this case, prototype RED could still be seen as a fire engine in the complex thought THIS IS NOT RED. However, this solution faces a problem: how does the computation that assigns a conceptual content to NOT RED know that it is the prototype for the concept RED (and not, say, the concepts GREEN or SOUP or TRANSCENDENTAL) that it should consult when it does so? The answer that Kamp and Partee offer is that what FI computes over is not the prototype for RED but rather a representation of the logical form of NOT RED. However, notice that this means that (a) Principle P is still in force; that is, the meaning of NOT RED is computed from the content of NOT RED, not from the one of RED, and (b) the content of NOT RED relies on a logical form, not a prototype. In effect, what Kamp and Partee have opted for is the alternative that makes all complex Boolean concepts counterexamples to Prototype theory. There is another kind of case in which it is apparently not possible to provide the correct content of a complex concept even given its structure and the prototypes of its primitive constituents. The problem is not that the complex concept fails to have a prototype, as in many of the Boolean cases, but rather that an object’s similarity to the prototype for a complex concept seems not to vary systematically as a function of its similarity to the prototypes of the constituent concepts. For instance, consider the classical example of the complex concept PET FISH from Fodor and Lepore (e.g., 1995, 2002). Although one is likely to report that the prototypical pet fish is a goldfish, a goldfish is neither a prototypical fish nor a prototypical pet (in other words, a goldfish is a poor example of a fish, and a poor example of a pet, but it is quite a good example of a pet fish). The problem is that, prima facie, the distance of an arbitrary object from the prototypical pet fish is not a function of its distance from the prototypical pet and its distance from the prototypical fish. So, knowing that ‘pet’ and ‘fish’ have the prototypes they do does not permit one to predict that the prototypical pet fish is more like a goldfish than like a trout or a herring, on the one hand, or a dog or a cat on the other. A strategy to account for this would be to treat the failure of pet fish to be good examples of pets as a kind of context effect, like the failure of big ants to be good examples of big things. Kamp and Partee think that the content of BIG ANT is something like big for an ant, so that a really good example of a big ant would be something that’s as good an example of something big as an ant can be. Similarly, a really good example of a pet fish would be something that is as good an example of a pet as a fish can be. Unfortunately,
Compositionality of concepts 231 this assimilation of PET FISH to BIG ANT is ill advised: PET FISH entails pet, but BIG ANT does not entail big. Kamp and Partee’s proposal leaves this asymmetry entirely unexplained. Together, the examples of complex concepts described here, point in the direction of prototypes not being compositional, but (as we elaborated above) if prototypes are not compositional, then the identification of contents with prototypes cannot explain the fundamental properties of mind, that is, its productivity and systematicity. So, can we save the prototype account without giving up on compositionality? We discuss the promise of such attempts bellow.
11.5 Reconciling prototypes with compositionality? The examples presented in the previous section certainly pose a challenge for the prototype accounts of concepts. But can these accounts still be saved without giving up the compositionality constraint? There have been various attempts to do so (e.g., Smith, Osherson, Rips, and Keane, 1988; Wisniewski, 1997; Hampton, 2007, 2017, also Prinz, 2002). One answer, for instance, is that those cases where prototypes fail to exhibit compositionality are relatively exotic and involve phenomena which any account of compositionality is likely to find hard to deal with (e.g., Kamp and Partee, 1995; Osherson and Smith, 1988). However, as we aimed to show above, by no means do we entertain complex concepts such as, say, NOT RED only on rare occasions, nor would we consider them exotic. Given that indefinitely many Boolean complex concepts are hard to (or even impossible to) explain by appeal to prototypes, and given the non- negotiability of compositionality constraint, it seems that modification of such theories is necessary for them to preserve cognitive plausibility. A notable strategy to save the prototypes account was proposed by Smith et al. (1988; Osherson and Smith, 1988), namely the Selection Modification Model (SSM). This model assumes that exemplars are replaced with matrices of weighted features. So, a simple concept would be a prototype represented by a matrix, containing a number of dimensions (features) that are differently weighted (note, however, that the question of how to determine which features enter the matrix in the first place, still remains vague in the SSM). For instance, a part of the concept COW is the dimension of color, where ‘brown’ is weighted heavily, while ‘green’ is assigned a low weight. If we added a modifier to form a concept of GREEN COW, it would change the weights on the dimension of color: ‘green’ would be assigned a high value, whereas the value of ‘brown’ would drop down); however the combination—PURPLE COW—gets its own matrix, which (importantly) preserves all the simple concept’s features with unaltered (except for the explicitly modified dimensions) values. The appeal of this model lies in the fact that it allows simple prototype representations to be combined without losing their prototypical
232 Gala Stojnic and Ernie Lepore features, as they are inherited in the complex concept’s matrix as defaults. Since the new concept preserves the prototypes’ features, this should allow for inferring that a green cow was just as likely to have four legs and give milk as a typical (brown) cow. The upshot of this solution is that the less representative of a category the modified concept is, the more prototypical it becomes, since the (unusual for the category) modifier spares the majority of the prototype features’ original values. So, the model yields a surprising prediction: a flying cow should be a better representative of cows than a green cow, since the modifier ‘flying’ doesn’t alter the majority of features’ values from the original COW matrix. This is known as the Default to the Stereotype (DS) strategy which was proposed as an “escape route” for the Prototype theory when it comes to the compositionality constraint, but there is a problem that this strategy faces. Namely, how much does the feature ‘purple’ weigh in matrix for the complex concept PURPLE APPLE? It must weigh more than the feature ‘red’ does in the matrix for the concept APPLE, since, though there can be non-red apples, all purple apples are purple. Purple has to weigh infinitely much in feature matrix for the complex concept PURPLE APPLE because PURPLE APPLE → PURPLE is a conceptual (logical?) truth. So, we face a dilemma: we either treat conceptual (logical?) truths as extreme cases of statistically reliable truth, or admit weights assigned to features in derived matrices are not compositionally determined even if the features themselves are. What sets the weight of PURPLE in PURPLE APPLE is not its prototype.8 As we see, the SSM model still runs into problems when it comes to accounting for compositionality in a principled way. Surely this is not to minimize its importance in attempting to provide a solution for the prototype accounts on concepts composition. What we worry is that there is no promising way to maintain the Prototype theory as plausible for prototypes that do not obey the principles of composition, which productivity and systematicity of the mind dictate. Before we conclude this section, we want to make a couple of additional remarks, going back to the notorious case of PET FISH. Conceptual inquiry will tell you pet fish are fish, but no merely conceptual inquiry will tell you which pet fish are prototypical. Why, then, don’t we just admit that the story about compositionality does not work for the concept PET FISH, but argue the reason it does not is that PET FISH is a simple concept? On this proposal, the prototypical ANs are the intersection of the prototypical A
8 Connolly, Fodor, Gleitman, and Gleitman (2007) provide some interesting empirical findings to challenge DS. They present interesting empirical findings to challenge DS. They asked subjects to judge how likely sentences focusing around a central noun were to be true, where the way in which the noun was modified was manipulated—for example, “Ducks have webbed feet,” “Quacking ducks have webbed feet,” “Baby ducks have webbed feet,” “Baby Peruvian ducks have webbed feet” (Connolly et al., 2007). Contrary to the prediction of the SMM, subjects’ judgments differed systematically across conditions, although the modifiers were set up so to not alter the original features of the concept ‘DUCK’; this indicates that subjects were not defaulting to the prototype (Connolly et al., 2007; but see also Jönsson and Hampton, 2007 for the alternative interpretation). This evidence indicates that, indeed, computations involved in concept composition could not be explained away by appeal to the blind resetting of the prototype’s matrix of features.
Compositionality of concepts 233 things with the prototypical N things in the unmarked case, but you default to the unmarked case only if you do not have specific information to the contrary. When you’ve learned which fish people keep as pets, you learn to override the inference that conceptual content supports. The problem with this defaulting strategy is that it is irrational. For one, all being equal, the more complex a concept is, the less you are likely to have special knowledge about things in its extension. We know little about cows qua cows, but nothing about brown cows owned by people whose last names start with “W” qua brown cows owned by people whose last names start with “W.” The upshot is that if the strategy is to default to compositional prototype when we have no special information to contrary, then the more heavily modified a concept is, the more likely we are to default to its compositional prototype. Secondly, the more complex a concept is, the less likely its prototype is predicted by prototypes of its constituents. For example, pet fish aren’t good bets for satisfying the pet prototype, but still less so are pet fish that live in Armenia, and still less so are pet fish who live in Armenia and have swallowed their owners, and so forth. Given these two points, it’s clear why defaulting to the compositional prototype is an irrational strategy: The more complexly modified a concept is, the more likely it is that you are required to default to the compositional prototype. But the more complexly modified a concept is, the more likely it is that defaulting to the compositional prototype will give you the wrong content. It’s obviously irrational to employ a strategy if the more likely you are to use it, the more likely it is to fail. The bottom line is that the complex concept PET FISH is a counterexample to the compositionality of prototypes. The reason you can’t derive the PET FISH prototype given the PET prototype and the FISH prototype, is that what kinds of fish people keep as pets is about the world, not about concepts. It is therefore possible to be clear what the conceptual content of PET FISH is, and yet have no idea which pet fish are prototypical. Which fish are prototypical is something you have to go out and learn. The structure of the complex concept PET FISH assures you the prototypical pet fish is a pet and a fish, just as it assures you that the prototypical big ant is big for an ant. After that, you’re on your own.
11.6 Concluding remarks In this chapter, we focused on one of the critical constraints that a theory of concepts must incorporate to be cognitively plausible, namely, the compositionality constraint. As we argued, the compositionality constraint is necessary to account for the productivity and systematicity of the human mind. Without compositionality, it would be impossible to explain how humans are (in principle) capable of entertaining (and comprehending) indefinitely many new thoughts out of a finite number of primitive ideas. Hence, any account of what concepts are and how they are represented in the mind, has to be able to satisfactorily incorporate this constraint, for giving up on it would imply giving up on fundamental properties of the mind.
234 Gala Stojnic and Ernie Lepore We critically discussed some of the most influential theories of concepts, analyzing whether they manage to satisfy the compositionality constraint. Inferential Role Semantics, particularly popular among philosophers and linguists, struggle to do so because of how they account for the nature of single concepts: namely, a concept’s content is identified with its cognitive role; but unfortunately, cognitive roles are not compositional. Therefore, this account ultimately fails to tell us how the mind combines simple concepts to form indefinitely many complex thoughts. Similarly, Prototype theories seem to struggle with the combinatorics of thought units, again, because of how they conceptualize the nature of single concepts. It appears that, although many advocates of prototypes do acknowledge the significance of this constraint (e.g., Kamp and Partee, 1995; Osherson and Smith, 1988), there is still no efficient strategy to explain how one would combine simple prototypes to form complex concepts. Just as inferential roles are not compositional, so prototypes aren’t either. Hence, they cannot explain the productivity and systematicity of mind. Of course, Inferential Roles Semantics and Prototype theories (although probably the most influential ones), are not the only accounts on concepts among thinkers interested in the nature of thought. Some authors, for instance, hold that concepts could be image- like structures (e.g., Hume 1739; Kosslyn, 1994), or theory-like structures (e.g., Carey, 2009; Gopnik and Meltzoff, 1997). Others completely give up the symbolic approach to cognition and offer a perspective of concepts as emerging from a distributed neural network, that is, from a brain’s adaptations to external experience (Barsalou, 2017, 2008a, 2008b, 2016b; McClelland and Rumelhart, 1985; Lebois, Wilson-Mendenhall, and Barsalou, 2015; Casasanto and Lupyan, 2015).9 Putting aside the question of whether concepts are pictures, prototypes, or scientific theories, we want to draw attention to an issue that seems to be shared across different theorists. It is the issue of treating the compositionality constraint as a research topic that is almost orthogonal to the nature of concept representations. That is, many accounts have focused heavily on how single concepts are represented, while compositionality of concepts has been surprisingly marginalized. This is not to claim that these accounts have completely ignored compositionality, but rather that they have treated it is a secondary concern. So, satisfying the compositionality constraint might be seen as an additional bonus for a theory, while failing to do so does not necessarily endanger it; one might play around with the original theory to satisfy this constraint, yet with surprising reluctance to explore it as a central issue. 9 For instance, Barsalou (2017, 2008a, 2008b, 2016b) offers a “grounded approach”—instead of being a precisely fixed symbol, a concept is based on a dynamic, distributed neural network, which flexibly adjusts to present situational factors and which is a result of accumulated experiences of this network’s interactions with the relevant category’s instances. Concept composition is achieved in virtue of multimodal simulations of sensory experiences that are coupled with the relevant categories that one entertains at present. Although this approach does indeed attempt to account for compositionality, it assumes the abandonment of symbolic approach to cognition; whether this is justified is itself a controversial matter, which goes beyond the scope of this chapter (for discussions of this issue, see Fodor and Pylyshyn, 1988; McLaughlin, 1993).
Compositionality of concepts 235 We believe this won’t work, as we demonstrated in this chapter, using two prominent theoretical accounts of concepts as an illustration. The reason lies in the fact that compositionality constrains what concepts can be, that is, how single concepts might be represented in a cognitive system. Hence, we note that it should be the starting point (or the central point) for any cognitively plausible theory of concepts, not a theoretical nuisance that might be swept under the carpet if the theory cannot account for it. So, if prototypes, for instance, do not combine to give rise to productivity and systematicity of mind we have reasons to say that concepts might not be prototypes. Note that by no means does this mean that the rich body of empirical findings stemming from Prototype (or other) accounts is to be disavowed. Instead, we want to make two points: (1) interpreting existing (and future) empirical findings ought to be done with regards to the compositionality constraint; (2) the constraint should serve as a guide for future experimental work on concepts. This means subjecting specifically the composition of concepts to empirical tests, with the aim of discerning its cognitive mechanisms. There are, certainly, researchers who move in this direction and we encourage the future continuation of this progress.10
10 See, for example, Piantadosi and Aslin (2016), who explored compositional reasoning in early childhood and found that 3.5-to 4.5-year-old children were able to compose two novel functions, which the authors interpret as suggesting early capacity for compositional reasoning.
Chapter 12
L anguage and t h ou g h t The lexicon and beyond Barbara Landau
12.1 Introduction 12.1.1 Language, thought and mind At the heart of understanding the human mind lies the question of how language affects thought. Humans alone possess language, a powerful, richly specified, multi-leveled, generative computational system of the mind. But does having language change the nature of human thought, and if so, how? There have typically been two different kinds of answers. One focuses on changes in thought that follow from learning a specific language, say English or Spanish or Russian or Tzeltal. The idea is that, because of cross-linguistic variation in the particular distinctions that are made in a given language, learning any particular language (compared to another) will cause corresponding differences in the non-linguistic representation of our thoughts. So, for example, speakers of languages (such as English) that typically encode the manner of motion of an event in the main verb of a sentence may have different non-linguistic representations of motion events from speakers of languages (such as Spanish or Greek), that encode manner of motion outside of the main clause. The second kind of answer focuses on changes in thought that follow from learning any human language, compared to having no language at all. This is the idea that learning a language allows us to represent our thought in wholly new and more powerful ways—perhaps even qualitatively different ways—than would be possible without language. These two kinds of answers are not new; indeed, they are at the heart of classic readings on language and thought. For the first view, Whorf (1956/1941) famously argued that “habitual use” of one’s language results in a parsing of the world that conforms to the
Language and thought 237 distinctions made in that language, even when language is not being used, resulting in different ways of thinking by speakers of different languages. According to Whorf (1940, pp. 229–231), “The categories and types that we isolate from the world of phenomena we do not find there because they stare every observer in the face. On the contrary, the world is presented in a kaleidoscopic flux of impressions which have to be organized in our minds. This means, largely, by the linguistic system in our minds.” That is, the language we learn, and speak as adults, forces particular organizations on our thoughts. For the second view, Quine (1960) argued that learning a language has a powerful effect on the way that children represent conceptual entities. For example, he argued it is only by learning specific words in context that children can come to represent (and hence express) certain concepts, for example, the distinction between individuated vs. non-individuated entities, encoded by ‘mass’ vs. ‘count’ words in English and other languages.1 Alongside a century of serious scientific inquiry into these issues, questions about the relationship between language and thought have also been of deep interest to the broader public. We can see this clearly through recent articles and debates in The Economist (2010), book reviews in the New York Times (Deutscher, 2010) and even Ted Talks. This coverage has been strongly biased toward representing the views of researchers who take a strong stance on the causal role of language in changing human cognition, following Whorf ’s theory that learning a specific language causes significant changes in human thought. Examples of these views currently in the public domain include claims that one’s native color lexicon affects our perception of color (Boroditsky, 2018; Evans, 2017; Gibson, 2017), that the set of spatial distinctions present in one’s native language affects our ability to orient ourselves in space (Deutscher, 2010), and that the means of marking tense in one’s native language affects our tendency to save money (Chen, 2013). In the latter Ted Talk, the economist Keith Chen argued that speakers of Mandarin, who do not mark present vs. future tense in the same way as speakers of English, do not distinguish between past-present-future parts of the timeline in the same way as speakers of English, leading to differences in how one views the future and therefore the extent to which one saves or not. This claim has been powerfully debunked by the linguist Mark Liberman (2012), who shows that Chen’s analytic method fails to account for correlated effects of culture. But Chen’s Ted Talk has had close to two million hits at the time of this writing, suggesting that this view is attractive to many. Indeed, the leap from cross-linguistic differences in tense marking to differences in one’s sense of time echoes 1 By contrast, with these strong advocates of the idea that language changes thought, Fodor (1975, 1981) argued that language itself cannot lead to new concepts, simply because it is logically impossible for people to learn a new concept through language: Learning the meaning of a new word requires that we represent in advance the conceptual primitives underlying the meaning. New lexical items may be created by combining conceptual primitives, but the primitives themselves cannot be created; they must exist prior to ‘learning’ the corresponding word. In fact, learning the word involves simply mapping a given form to a pre-existing meaning. See Jackendoff (this volume) and Stojnic and Lepore (this volume) for related discussion.
238 Barbara Landau Whorf ’s (1956/1939) widely cited conclusions linking lack of “tenses like ours” (p. 144) in the Hopi language, to a corresponding lack of sense of time as represented by members of Western cultures. He said: Concepts of “time” and “matter” are not given in substantially the same form by experience to all men but depend upon the nature of the language or languages through the use of which they have been developed. (Whorf, 1956/1939, p. 158)
It may seem surprising that such strong conclusions about the effects of language on non-linguistic understanding still exist in the literature, but exist they do. Fortunately, there is now a rich literature that sets out more nuanced hypotheses and more varied empirical literature that allows us to move us beyond the quite strong view that possessing a particular set of words in one language will inevitably change one’s non-linguistic representations and do so to such a large degree. In this chapter, I aim to move beyond such views and lay out several more specific and nuanced hypotheses about whether and how language changes thought, reviewing rich empirical data that help us understand the role of language in human thought. I begin (Section 12.1.2) by laying some groundwork for the remaining discussion, by ruling out a false (but commonly suggested) hypothesis about the relationship between language and thought—that they are ‘the same,’ and that, in fact, we think ‘in’ language. I then give a selective review of the field, focusing on three different hypotheses about whether and how language affects thought. In Section 12.2.1, I focus on what I call Version 1 (V1), associated with the classical Whorfian hypothesis. This version addresses the question of whether learning a specific language (e.g., English vs. Spanish or Greek) causes us to form different non-linguistic representations, leading to true differences in the way that speakers of different languages represent the world non-linguistically. This hypothesis has been offered with respect to several domains, including color, space, motion events, time, and gender. My discussion will focus on color and spatial relationships, as these have generated quite a large body of data that can be used to evaluate the hypothesis that non-linguistic representations change as a consequence of learning and/or life-long usage of a specific language. There is, of course, no question that there is significant variation across languages in the specific encodings that are chosen; lexicons vary across languages, even for what one might consider ‘rock-bottom’ categories, such as color, space, and time (Berlin and Kay, 1969; Levinson and Wilkins, 2006; Haspelmath, 1997). Lexicons also vary across languages in the domains of body parts, object categories, kinds of action, and kinship terms (see Malt and Majid, 2013 for review). But the fact of linguistic variation does not inevitably lead to a conclusion that it leads to corresponding changes in non-linguistic thought—either in the strong sense that Whorf intended, or in the many weaker versions that have been proposed over the past few decades (see discussion of color terms in Section 12.2.1). I will argue that the bulk of empirical evidence testing V1 does not support the strong claim that learning a particular language (compared to another)
Language and thought 239 changes one’s non-linguistic representations in ways that reflect the differences between the languages. In Section 12.2.2, I focus on a second version of the hypothesis that language affects our thought. Version 2 (V2) proposes that having a language—any language—results in changes to the content and/or format of human non-linguistic knowledge, allowing humans to represent their prior, non-linguistic knowledge in wholly new and more powerful ways than would be possible without language. This hypothesis has been offered most prominently in the domains of space, number, and Theory of Mind (TOM); accordingly, I consider this hypothesis in the context of these domains. Finally, in Section 12.2.3, I review a third version of the general hypothesis that having a language affects our thought. Version 3 (V3), the ‘momentary recoding hypothesis’ (Landau, Goldberg, and Dessalegn, 2010; Dessalegn and Landau, 2013) proposes that many empirical effects shown in both V1 and V2 can be explained by the idea that language has a powerful effect on human cognition by recoding non-linguistic representations ‘in the moment,’ as one carries out the task at hand. Unlike V1, this view does not entail that there are any permanent changes in our prior non-linguistic representations. Rather, it embraces the idea that our non-linguistic representations remain the same before, during, and after the recoding, but that use of language results in a temporary hybrid representation that combines the non-linguistic and linguistic representations. Unlike V2, V3 does not entail that language enables the construction of wholly new and permanent representations. V3 hypothesizes temporary binding of linguistic and non-linguistic information, followed by no real change in the content or format of the non-linguistic representations that underwent this temporary binding.
12.1.2 Is language the same as thought? (Hint: The answer is ‘no,’ for logical and empirical reasons) The first question I ask students in my language and thought seminar is “What is the relationship between language and thought?” Overwhelmingly, their answer is “We think in language.” Or, similarly, “That’s how we think.” These conclusions, as appealing as they may seem to some, can be shown to be false on numerous grounds. First, as Fodor (2001) pointed out, the idea that we think ‘in’ language would fail to explain the fact that words and sentences can be highly ambiguous, yet our thoughts are not; even though individuals words (e.g., ‘bat,’ ‘bank,’ ‘bug’) and sentences (e.g., ‘Visiting relatives can be a nuisance.’) have multiple meanings, the thoughts underlying these different meanings are distinct, that is, not themselves ambiguous (see Gleitman and Papafragou, 2012 for more discussion). Second, the idea that the contents of our thought exist ‘in’ language would fail to capture the fact that language encodes only certain aspects of our
240 Barbara Landau thoughts; we know this because we readily fill in the ‘extra’ information missing when someone says, “Can you pass the salt?” or “Enough already.” Finally, language marks distinctions that are not obviously available through other, non-linguistic modalities, and vice versa, so they cannot be exactly equivalent. For example, although even babies possess non-linguistic representations that encode single objects as well as sets (Spelke, 1990; Rosenberg and Feigenson, 2013), there is nowhere in the visual system an obvious encoding of the distinction between a ‘type’ of object (e.g., a tiger) and a ‘token’ of that object (e.g., that tiger) (Jackendoff, 1987, and this volume, for additional examples). Conversely, language is surprisingly poor at encoding certain properties that are the naturally encoded in other non-linguistic systems. Whereas the visual system specializes in encoding the precise structure of an individual human face, the pattern of spots on a Dalmatian, or the precise contours of a path through space, language is a remarkably poor (at best coarse) tool for encoding any of these (Landau and Jackendoff, 1993). Not everything that is represented by our non-linguistic systems of knowledge is also encoded by language. So no, we do not think ‘in’ language—that would underestimate the many-to-many mappings between our thoughts and our linguistic expression of those thoughts, the rich inferential capacity we have to fill in all of the understandings that go far beyond the spoken word or sentence, and the complementarity of encoding by language relative to encoding other cognitive systems. Two other kinds of evidence argue conclusively against the hypothesis that we think ‘in’ language. First, this hypothesis leads to the false inference that we cannot think without language. Yet, as decades of empirical evidence has shown, pre-linguistic babies possess rich conceptual knowledge (Spelke and Kinzler, 2007; Baillargeon, 2002) and non-linguistic species possess remarkably complex cognitive systems that are used to understand and function in the world (Gallistel, Brown, Carey, Gelman, and Keil, 1991). The evidence on rich, pre-linguistic cognitive representations in humans is at present overwhelming. For example, infants represent objects as complete bodies that move coherently through space in accord with principles of physics (Spelke, Breinlinger, Macomber, and Jacobson, 1992), as entities that afford the functional properties of support and containment (Baillargeon and deJong, 2017), and that are individuated (Xu and Carey, 1996). Infants possess representations of sets of objects up to three individuals, as well as approximate representations of large numerosities (Feigenson, Dehaene, and Spelke, 2004). They represent spatial arrays in terms of their geometric properties, and use this to guide their reorientation in space (Hermer and Spelke, 1996). Infants also represent higher-level conceptual /functional roles for objects in motion events, including ‘sources’ and ‘goals’ (Lakusta, Wagner, O’Hearn, and Landau, 2007). New research programs are probing infants’ understanding of logical operators such as disjunction, negation, and even the distinction between collective vs. distributive concepts (‘all,’ ‘each’), which have in the past been linked solely to language. Although we do not yet know conclusively whether these logical operators are fully mature prior to language learning (Mody and Carey, 2016; Feiman, Mody, Sanborn, and Carey, 2017),
Language and thought 241 evidence suggests that infants possess at least the rudiments of these operators (see, e.g., Cesana-Arlotti, Martín, Téglás et al., 2018). Second, the idea that we think ‘in’ language entails that cross-linguistic variation must cause differences in non-linguistic representations. That is, we are ‘thought prisoners’ of the language we speak. This hypothesis has been empirically tested in many studies over the past decades, with quite mixed results. Generally, experimental results that have purportedly shown effects of language on non-linguistic representations have tended to include both linguistic tasks and matched ‘non-linguistic’ tasks; but upon close inspection, many of the latter naturally invite linguistic encoding. These methodological confounds are reviewed in several papers (Gleitman and Papafragou, 2012; Munnich and Landau, 2003), and the next section will review some cases that are particularly telling. Suffice it to say, at this point, when scientists have compared carefully controlled linguistic vs. non-linguistic conditions, there is much evidence that universal non- linguistic representations peacefully co-exist with cross-linguistic variation. For example, Munnich, Landau, and Dosher (2001) compared English, Japanese, and Korean speakers’ naming and memory for locations of a dot relative to a square, focusing on the obligatory English lexical distinction between locations ‘on’ and ‘above,’ not present in Japanese or Korean. They found large differences in people’s descriptions of such locations, consistent with known patterns of each language, but no differences among language groups in their memory accuracy for the different locations, even those that straddled locations crossing or collapsing linguistic boundaries. Papafragou, Hulbert, and Trueswell (2006) found that speakers of English vs. Greek differed in their attentional patterns (revealed by eye movements) during a linguistic task, but not a non- linguistic task. Malt, Sloman, Gennari, Shi, and Wang (1999) examined both naming of common household containers and non-linguistic grouping of the same, and found that there were significant differences in the naming patterns of English, Chinese, and Spanish speakers, but little to no differences in their non-linguistic group patterns. In sum, the idea that we think ‘in’ language can be ruled out by both logical arguments and empirical data. Language and thought are not exactly equivalent; pre-linguistic babies have rich conceptual thought, and cross- linguistic differences have been shown to co-exist peacefully with similarities in speakers’ underlying non-linguistic representations. So does language affect thought? It might depend on the details of one’s hypothesis.
12.2 Classic and newer versions of how language might affect thought In this section, I discuss three different versions of the hypothesis that language affects thought. Version 1 (V1) is rooted in Whorf ’s general framework, focusing on whether cross-linguistic differences in the lexicon (i.e., the vocabulary) and/or morphology and
242 Barbara Landau syntax lead to corresponding changes in our non-linguistic representations. The essential methodological tool here is studying groups of individuals whose native language varies in some critical way, and asking whether their non-linguistic representations are driven toward representing that variation. Version 2 (V2) is in some ways more radical, proposing that learning any language confers new representational power, including wholly new kinds of content and structure, beyond that which is possible for non-linguistic species, or individuals who have never fully learned a first language. The essential methodologies here vary widely, including studies of the cognitive capacities of young children who are in the earliest stages of language learning, the capacities of deaf individuals who have never been exposed to a sign language but have invented their own and then built on these through interaction with other such signers, and the capacities of species who have no human language at all (e.g., chicks, birds, rats, etc.). Both V1 and V2 of the language-affects-thought hypothesis encounter challenges, which I also review. Version 3 (V3) seeks to resolve some of these challenges by proposing that language is a ‘momentary’ enabler of many tasks, recoding information from non-linguistic representations into a linguistic representation, which carries relevant information for the task at hand, but does not change the underlying non-linguistic representations. This proposal draws on studies relevant to both V1 and V2, and includes some detailed looks at the time-bound nature of changes in cognition with and without the use of language.
12.2.1 V1: The classic version: learning language x vs. y results in corresponding changes to our non- linguistic representations This version follows Whorf ’s original formulation in positing that cross-linguistic differences will inevitably organize what is otherwise a “kaleidoscopic flux of impressions which have to be organized in our minds” (Whorf, 1940, pp. 229–231). This hypothesis has drawn interest over the decades with studies in a variety of different domains. These include color (with a focus on cross-linguistic differences in color categories), objects and substances (which are distinguished by count vs. mass nouns in some languages, Soja, Carey, and Spelke, 1991), motion events (e.g., the representation of manner of motion and path, which are “packaged” differently across languages; Papafragou, Trueswell, and Hulbert, 2006; Gennari, Sloman, Malt, and Fitch, 2002), spatial location (specifically, variation in the use of different reference systems to linguistically encode location; Pederson, Danziger, Wilkins, Levinson, Kita, and Senft, 1998; Li and Gleitman, 2002, inter alia), and object categories (specifically, the ability to group the same objects in different ways; Malt, Sloman, Gennari, Shi, and Wang, 1999). I focus here on two domains: color and spatial location, because of their steady prominence in the literature and their clear predictions of change in non-linguistic representations consequent on learning different languages.
Language and thought 243
12.2.1.1 Color terms and color perception. The domain of color has occupied a special place in this arena, providing one of the most enduring tests of the Whorfian hypothesis, and for good reason: It is well known, since the seminal studies of Berlin and Kay (1969) that there is significant cross-linguistic variation in the color lexicon, with some languages having as many as 11 or 12 ‘basic’ color terms (e.g., English and Russian, respectively), while others have five or fewer (e.g., the Himba: Franklin, Clifford, Williamson, Davies, 2005; languages of the Dani peoples: Heider and Olivier, 1972). If differences in the color lexicon cause people to have different color perception, this would be a strong piece of evidence that linguistic experience can alter what is arguably a universal system that is rooted in the biology of the visual system, unlikely to be strongly affected by differences in culture or experience. Brown and Lenneberg (1954; for review, see Brown, 1976) were arguably the first psychologists to carry out empirical studies addressing this possibility. To do so, they studied the relationship, within speakers of English, between ‘codability’ (roughly, consensus and brevity in naming a color) and ‘memorability’ (roughly, the ease of recognizing a color after a delay). As Brown (1976) says, at the time they assumed that this relationship would hold across different linguistic communities, that is, consistent with Whorf ’s idea, variation in naming would lead to variation in memorability. As it turned out, in the early 1970s, this assumption was shown to be wrong. It was Eleanor Heider (Brown’s graduate student) who carried out the critical, groundbreaking studies. Heider and Olivier (1972) examined the differences in naming and memory between native speakers of English, which has 11 basic color terms, and the Dani peoples, whose languages include only two color terms, mili (cool/dark roughly including English blue, green, black) and mola (warm/light, roughly including English red, yellow, white). Heider and Olivier found the expected differences between these groups in how they named color chips, but little to no difference between the groups’ memory for the same colors. These findings would appear to put to rest the question of whether differences in color terminology lead to differences in color memorability. But more recent studies have re- opened the case with different methods and comparisons across different cultures (see, e.g., Roberson, Davies, and Davidoff, 2000), and with very different conclusions from those made by Heider and Olivier. Most notably, Gilbert, Regier, Kay, and Ivry (2006) rooted their approach in findings from Kay and Kempton (1984) who studied the perceptual judgments of speakers of Tarahumara, a Uto-Aztecan language of Mexico that uses a single color term for the range of hues that English speakers call ‘blue’ or ‘green.’ Gilbert et al. argued that Kay and Kempton’s findings suggest that language affects perceptual discrimination “through the spontaneous but unspoken use of lexical codes” (p. 489) and so might preferentially involve the left hemisphere of the brain, which houses language in the large majority of speakers. In their experiments, Gilbert et al. used a visual search task to test the hypothesis that discriminating ‘between-category’ hues (i.e., those from different named categories, e.g., ‘green’ or ‘blue’) would be faster when displayed in the right visual field (RVF, which projects to the left hemisphere of the brain) compared to the left visual field
244 Barbara Landau (LVF, which projects to the right hemisphere). They also conjectured that the between- category effects should be disrupted when participants’ language resources were subject to inference by forcing people to simultaneously carry out a secondary language task. Participants fixated a center location, and were then presented with a ring of colored squares to their right or left visual field. Each ring included a single ‘oddball’ square, which was a different hue from all of the other squares. On ‘between-category’ trials, the oddball was a hue that fell into a different named category from the other squares (e.g., a ‘blue’ oddball in an array with all other squares the same ‘green’). On ‘within- category’ trials, the oddball was a hue that fell within the same named category as the other squares (e.g., a ‘green’ oddball in an array with all other squares a different shade of ‘green’). Results showed faster response times for the between-category pairs (than within-category pairs) when presented in the right visual field, but no differences for the left visual field. When a verbal interference task was added, the between-category advantage disappeared in the right visual field, and in fact, reversed, with the within- category pair showing faster response times. Gilbert et al. drew strong conclusions from these findings. For example, they conclude that “the results of Experiment 1 are consistent with the hypothesis that linguistic categories selectively influence color discrimination in the RVF” (p. 490, author’s emphasis); “For the visual search task . . . the Whorf hypothesis is supported most strongly, even exclusively in the RVF” (p. 492); and “ . . . whether language affects perception, post-perceptual processes, or both, any influence of language on perceptual discrimination clearly falls within the broad formulation of Whorf ’s hypothesis” (pp. 492–493). These conclusions can be challenged on several counts. First, attempts to replicate even the basic lateralization of a RVF advantage for within-category colors have not always been successful (Regier and Xu, 2017) and other failures to replicate have not been published (Brainard, pc.), challenging the robustness of the findings. Second, it seems obvious that, if the principal effect is true—that between-category colors (as defined by their names) presented to the RVF are judged to be different more quickly than within-category colors—this could easily be a straightforward effect of (covert) naming of the stimuli during the task itself, rather than a permanent change in perceptual discrimination that is caused by a lifetime of naming different categories of colors with different names. Note that Gilbert et al.’s interpretation of their own findings (cited above) suggests that they believe the effects are on the perceptual system (“any effects of language on perceptual discrimination”), yet naming itself—the one variable that is implicitly manipulated, by the choice of comparative stimuli, is never raised as a possible causal mechanism. Moreover, Gilbert et al.’s abstract states, “It appears that people view the right (but not the left) half of their visual world through the lens of their native language” (p. 489). Gilbert et al.’s quite broad interpretation of what constitutes a Whorfian effect can be contrasted with that of Kay and Kempton (1984), who also sought to test the Whorfian hypothesis, but concluded that cross-linguistic variation did not result in non-linguistic change in color perception. As mentioned earlier, Kay and Kempton tested English speakers’ judgments of color similarity in the blue-green range with those of speakers of
Language and thought 245 Tarahumara, whose language uses a single term for colors in this range. In a first experiment, Kay and Kempton explained their results within the Whorfian framework, specifically suggesting, that English speakers ‘stretched’ the distance lying at the boundary of the green-blue category, and ‘shrunk’ the distance within it. That is, they tentatively concluded that language differences resulted in perceptual differences. But importantly, in a second experiment, they ruled out true differences in perceptual discrimination, finding that the critical effects disappeared when their method was changed to prevent a simple strategy of naming the hues in order to make discrimination judgments.2 These findings serve as a compelling and early example of using a task that is thought to be language-free, but in fact, depends on the very language whose effect is being tested. The findings are particularly compelling because Kay and Kempton’s goal was to test the Whorfian hypothesis—that cross-linguistic differences in color naming would lead to differences in perception—but they found along the way that their task had simply invited, perhaps even required, that people use language to solve it. Their final interpretation was to consider their findings plainly inconsistent with the strongest version of Whorf ’s hypothesis. Recent research on toddlers’ color naming also argues against the conclusion that naming induces changes in perceptual discrimination of color. Categorical color perception is known to be present among pre-linguistic infants, who of course have no color terms. Behaviorally, infants discriminate across exemplars from adult-defined between- category boundaries (e.g., blue-green) but not exemplars from within a category (e.g., two blues, two greens), even when the differences of the between-pair and within-pair are of equal physical distance (Bornstein, Kessen, and Weiskopf, 1985). Neurally, the
2 In the first experiment by Kay and Kempton (1984), speakers of English and Tarahumara were shown triads of color patches that span the green-blue boundary in English. For each triad, speakers were asked to judge “which of the 3 is most different from the other 2” (p. 70). English has two distinctive labels for hues on each side of the boundary, whereas Tarahumara has just a single term for hues in this range; Kay and Kempton sought to test the hypothesis that English speakers would show a larger subjective difference between these than speakers of Tarahumara, essentially ‘stretching’ the space across the two linguistic boundaries, and ‘shrinking’ the space within each linguistic grouping. Kay and Kempton did find such an effect, and interpreted this as a Whorfian effect (p. 72). But critically, they also proposed the following ‘speculative’ explanation: Faced with the difficult task of judging which of three quite similar hues is ‘most different,’ participants would reason that one way of solving the task is to use ‘the name strategy’ (p. 72); that is, the one that is called ‘blue’ is more different than the two that are called ‘green.’ They argue sensibly that this strategy would not be available to speakers of Tarahumara, who would then not show the effect. In a second experiment, Kay and Kempton altered the method by showing participants each triad with a slider that could cover one chip at each end of the array (the ‘green’ end, the ‘blue’ end) at a time, and asking them to compare the two other chips, judging each pair in terms of degree of greenness between the middle and green-end chip, and degree of blueness between the middle and blue-end chip. They then were asked to make the same judgments as in the first experiment. Kay and Kempton reasoned that the change in method could disable the ‘name strategy’ because participants had been forced to represent the middle chip as both green and blue, as they slid the top back and forth. The results of the judgment task showed that the “Whorfian effect . . . in Experiment 1 disappears,” and that subjective similarity judgments just followed discrimination distance without showing any effects of the linguistic boundaries.
246 Barbara Landau infant brain responds to between-category changes with increases in hemodynamic responses in the bilateral occipitotemporal regions (Yang, Kanazawa, Yamaguchi, and Kuriki, 2016). The maturity of such color categories in infancy raises the question of whether learning color terms changes categorical perception in accord with language- specific categories. Anne Franklin and colleagues (Franklin, Clifford, Williamson, and Davies, 2005) tested categorical perception and color term naming among 2-to 5- year-olds from two different linguistic groups (speakers of English, and the Himba of Namibia, who speak Otjihimba, a Bantu language). They found that the two groups had similar categorical perception, even though one of the categories tested was not marked in the parental language (blue-purple, Himba), and the strength of the categorical perception effect was not related to differences in color term knowledge among the children (i.e., naming and comprehension accuracy terms). Franklin et al. point out that, in other studies on adults, authors have concluded that cross-linguistic differences change perception, categorization, and memory (e.g., Roberson et al., 2000). Although they leave room for the possibility that effects of language on perception may evolve over development, rising to significant effects only in adulthood, they also clearly endorse an interpretation that is consistent with a weaker view on the role of language in color perception and categorization tasks. Specifically, they endorse Munnich and Landau’s (2003) interpretation of such tasks (along with tasks in fundamental domains of knowledge such as time and space, Majid, Bowerman, Kita, Haun, and Levinson, 2004; McDonough, Choi, and Mandler, 2003): That such tasks may invite linguistic encoding, thereby showing effects of cross-linguistic differences on performance—simply because linguistic encoding is the basis for performance. Although the strong view that language creates differences in color perception is ruled out both by data and by careful consideration of methodologies, the question of whether language has effects on color perception and/or categorization still lingers, with continuing pursuit of the hypothesis that differences in color names lead to differences in non- linguistic color perception and categorization (e.g., Winawer, Wittholft, Frank, Wu, Wade, and Boroditsky, 2007; Zhou, Mo, Kay, Kwok, Ip, and Tan, 2010; Athanasopoulos, Damjanovic, Krajciova, and Saski, M., 2011; Zhong, Li, Huang, Li, and Mo, 2018). In considering these studies, it is most critical to ask exactly what kinds of effects are being proposed. Are authors proposing that linguistic differences alter the most basic non-linguistic perception and categorization? Or is the point that using language to solve a perception or categorization task can indeed change the outcome? To decide, we need to ask how scientists choose to define and measure ‘non-linguistic’ color categorization; whether this categorization simply reflects linguistic categorization required by or inherent in the task; and whether non-linguistic categorization has been changed permanently by language, or only occurs with considerable flexibility, perhaps ‘in the moment’ of the task (see further discussion in Section 12.3). Beyond these unsolved questions, the literature is now seeing wholly new perspectives on the broader issues at play. As one example, there are growing attempts to place discussion of the relationship between color naming and non-linguistic color categorization or perception in a broader framework which draws on general principles
Language and thought 247 of probabilistic inference to explain the relationships between the two (Cibelli, Xu, Austerweil, Griffiths, and Regier, 2016; Regier and Xu, 2017). This is a welcome change in emphasis—from strong and permanent effects of language on perception and cognition to flexible, task-dependent effects in the domain of color, and will likely encourage a useful (if only temporary) retrenchment from strong Whorfian claims in this arena, which have not convincingly explained existing empirical findings.
12.2.1.2 Spatial terms and spatial cognition This section would not be complete without mentioning the literature on spatial representation, which includes several widely cited findings claiming to show that cross-linguistic variation in spatial terms leads to changes in people’s non-linguistic representation of space. One widely cited claim is based on variation in the linguistic encoding of spatial relationships between objects. Seminal work by Bowerman and Choi (Bowerman, 1996; Choi and Bowerman, 1991) revealed that simple spatial events of ‘joining’ and ‘separating’ are encoded somewhat differently across languages, highlighting different aspects of the same events. Thus for example, adult and child Korean speakers encode events of joining that end in a ‘loose-fit’ relationship (e.g., ball ‘in’ box, loose ring ‘on’ pole) with the verb ‘nehta,’ and encode joining events that end in a ‘tight-fit’ relationship (e.g., earplug ‘in’ ear, top ‘on’ pen) with the verb ‘kkita.’ By contrast, English adult and child speakers encode these same events using the preposition ‘in’ vs. ‘on,’ with these terms essentially “indifferent to whether the figure fits the container tightly or loosely” (Bowerman, 1996, pp. 406–407). Although Bowerman’s goal was to highlight children’s early sensitivity to the ways that their language encoded the very same configurations, some researchers have inferred that these cross-linguistic differences may also lead to differences in the perceptual (i.e., non-linguistic) encoding of spatial relationships. For example, Mandler and colleagues (McDonogh, Choi, and Mandler, 2003) argued that the ‘tight/loose-fit’ distinction in the encoding of simple spatial events, if used over a lifetime, could lead to significant differences in adults’ non-linguistic representations. They posited that adult speakers of Korean would attend more to the ‘tight’ vs. ‘loose’ fit distinction than adult speakers of English. They used a categorization task and examined adults’ linguistic descriptions to test this possibility, concluding that Korean speakers were more sensitive to the tight/loose-fit distinction than English speakers were. Several pieces of evidence argue against their conclusion, however. First, as occurred with Kay and Kempton’s (1984), it seems likely that the task used by McDonough et al. (2003) may have invited a linguistic solution. That is, when participants were asked to choose the outlier in a set of relationships that varied on tight/loose fit, they may have used their most accessible native language encodings to solve the problem, for example, deciding on the outlier by comparing their own linguistic descriptions of the different configurations. If English speakers are likely to encode the configurations using ‘in’ vs. ‘on,’ then this could easily invite them to select the odd one out on this basis (i.e., whether it is ‘in’ or ‘on’); similarly, Korean speakers who encoded the configurations using verbs that specified tight vs. loose end-states could select the odd one on that basis. Second, Norbury, Waxman, and Song (2008) gave adult native English and Korean speakers a task in which they were first familiarized over six trials to either a tight or loose-fit event,
248 Barbara Landau and then asked to rate the similarity of a new tight-fit and loose-fit configuration to the familiarized event. Both language groups showed the same sensitivity to tight-vs. loose- fit. When familiarized to a tight-fit relationship, both groups gave higher similarity ratings to other tight-fit configurations; when familiarized to a loose-fit relationship, the ratings were no different for loose vs. tight-fit configurations. These findings suggest that ‘degree of fit’ does not represent a single dimension, within which languages can distinguish ‘tight’ vs. ‘loose’.3 More importantly, the lack of difference between English and Korean speakers in Norbury et al.’s study shows that, at least for adults, both groups are equally sensitive to non-linguistic relationships of tight vs. loose fit. Another aspect of spatial representation relevant to the language-thought debate has been the use of different reference systems to encode the locations of objects in arrays, both large and small. Languages vary in the specific reference system that is most often used to encode object location. English, for example, uses terms such as ‘left/right’ to encode object locations in small arrays (using an ‘egocentric’ reference system) but ‘north/ south/east/west’ in larger arrays (using a geocentric, or absolute or cardinal system); Dutch uses a similar system. Other linguistic communities, such as Tseltal, spoken by the Tenejapan Mayans, do have words that encode something similar to English ‘left’ and ‘right,’ but the uses of these terms are rather limited and are not used to encode locations around the body (Brown and Levinson, 1993). By contrast, speakers of Tseltal have a robust system of geocentric terms, and these are used to encode object locations in both small and large-scale arrays. Brown and Levinson (1993; see also Levinson, 2003) asked whether these cross-linguistic differences would affect (presumably non- linguistic) spatial reasoning. They presented groups of Tseltal and Dutch speakers with a tabletop array of objects arranged from end to end on the table, effectively to the person’s left, right, and in front of them. They then asked people to move to a new table located so that locations to one’s left was now right (and vice versa), but locations to the north/ south remained the same as for the original table. Speakers were handed the same toys, and asked to make it the “same” as it was before they moved. Results showed that Dutch speakers arranged the objects in terms of their own (egocentric) body reference system, that is, preserving the objects’ locations relative to their body (left is left, right is right). By contrast, the Tenejapans arranged the objects using a geocentric system, maintaining the objects’ locations as ‘north/south’ . 3 The idea that differential non-linguistic sensitivity can follow from a lifetime of ‘habitual use’ with a language that encodes the tight vs. loose-fit distinction assumes that Korean, but not English (for example), systematically encodes the distinction. We have recently suggested that the linguistic facts are not so simple (Landau, Gürcanli, and Wilson, 2019). We asked native adult and child English speakers to describe events that were designed to vary whether they culminate in tight vs. loose-fit end states, and found that adults distinguished linguistically among these events, especially when the event of ‘joining’ or ‘separating’ involved two like elements coming into a tight-fit configurations through symmetrical motion. Adults distinguished these events by using both unique verbs and prepositions (e.g., ‘fit together’ vs. ‘put on’ for tight vs. loose fit); and both adults and children distinguished these events through variation in the syntactic frame used for the verb (e.g., ‘the parts were joined’ for tight-fit vs. ‘one part went on another’ for loose fit). Clearly, English speakers are sensitive to the tight vs. loose-fit distinction, and use a variety of linguistic devices—different from those used in Korean—to mark the distinction.
Language and thought 249 As with many other cases examining the possible effects of cross-linguistic differences on non-linguistic representation, one can question whether the effects shown in this study were truly non-linguistic. Li and Gleitman (2002) suggested that these results do not prove that linguistic marking forces changes in people’s non-linguistic spatial representations, but rather, the linguistic markings themselves could evolve due to geographical differences in the environment, which could itself lead to differences in the prominence of use of terms that are organized around egocentric and/or geocentric reference systems. Because English speakers have sets of terms that mark location in both systems (i.e., ‘left/right’ and ‘north/south/east/west’) it should be possible to induce uses of different codings by altering the task to emphasize available environmental cues. Using the same basic search method as Brown and Levinson (1993), they also varied the presence or absence of large, external landmarks in the testing space. They found that when large landmarks were visible, people tended to re-locate the objects using an absolute (‘north/south’) system, but when they were unavailable, they used an egocentric (‘left/right’) system. This, and other studies carried out by Li, Gleitman, and colleagues show that people’s use of one vs. another reference system in spatial tasks like these is unlikely to reflect permanent changes in the kind of reference system that is used for object location (see, e.g., Li, Abarbanell, Gleitman, and Papafragou, 2011). Indeed, effects such as the ones shown by Li and Gleitman (2002), in which people use different reference systems for search depending on the experimental context, have long been recognized to be a hallmark property of spatial search in humans and other species. For example, infants use both egocentric and non-egocentric reference systems to find hidden toys (Acredolo and Evans, 1980). Given this, it seems likely that our universal non-linguistic spatial representations include multiple reference systems selected on different occasions for different tasks. Given that different languages opt to mark location using more than one reference system, it also seems likely that tasks in which people are asked to copy an array might invite linguistic coding, for example, placing objects north to south or left to right, depending on which linguistic terms (reflecting the chosen reference system) are used. The upshot is that if language is used to solve the problem, then we have an effect of linguistic coding on solving the problem, but that does not entail that life-long experience with a given language will permanently change the non-linguistic nature of one’s spatial representations.
12.2.2 V2: Learning any language causes a radical transformation of thought, as the formal computational properties of language enable and support wholly new kinds of representations One obvious function of language is communication, that is, the conversion of our thoughts to an externalized signal by which we can transmit information to others as they decode this signal. Without such a function, there would arguably be no explicit transmission
250 Barbara Landau of information across generations, limiting the growth of cultural artifacts ranging from formal theories of number to music, cooking, team sports, and the like. Another function, equally significant, is internal communication—the mind’s capacity to carry out computations that draw on the information represented within and across different domains, and perhaps to build qualitatively different kinds of representation than would exist without language. Version 2 of the hypothesis that language affects thought proposes that language enables new kinds of cognition because it is a powerful computational system whose formal properties serve as the means by which humans come to represent wholly new kinds of thought. The specific properties in question vary in different proposals, as will be discussed, but the general idea is that, although infants possess rich conceptual knowledge in many domains, this knowledge is in some ways limited; learning a language enables infants to represent their knowledge in new and more powerful ways specifically related to properties that characterize languages. Proposals of this kind have been made in three different domains: spatial representation, representation of number, and theory of mind.
12.2.2.1 Spatial representation (with a focus on navigation) One particularly interesting proposal on the role of language in conferring new representational power draws on the now well-known case of spatial reorientation. Although the study of spatial navigation in animals and humans historically examined the mechanisms underlying oriented navigation, a new phenomenon was discovered in the early 1980s that led to powerful insights about how all animals navigate when they become disoriented in space. Cheng and Gallistel (Cheng, 1986; Gallistel, 1990) first reported that when rats were shown the location of food in a simple enclosed rectangular chamber, then were disoriented and returned to a second, identical chamber, and allowed to search for the food, they frequently searched at the correct corner of the chamber, but also at the geometrically equivalent corner, diagonally opposite to the correct one.4 Although there were clear markings at the individual corners of the chamber—which could in principle serve as beacons or landmarks— disoriented rats ignored these in favor of the geometric (rectangular) structure of the chamber, the configuration of walls, and their intersections. This pattern of responding led to the hypothesis that, for the purposes of reorienting in space, the rat represents the environment in terms of its geometric ‘shape’ in modular fashion, that is, impervious to other properties of the layout. This now well-known phenomenon has been replicated many times, and in many species, including chicks, fish, rhesus monkeys, and birds (for review, see Cheng and Newcombe, 2005; Landau and Lakusta, 2009).
4 Disorientation
occurred because the rats had no access to external room cues during their initial exposure and the chamber was rotated over trials, removing any external cues that could have allowed them to remain oriented in space.
Language and thought 251 Humans are also sensitive to the geometric shape of the environment, but there is a key developmental difference. When 18–24 month-old toddlers are tested in a small rectangular chamber, with or without clear markings on one wall, they show the same pattern of responding as rats, using the geometrical structure of the chamber, but not its surface markings to reorient themselves, erring by searching at the correct corner and its geometric equivalent (Hermer and Spelke, 1996). When 5–7-year-olds and adults are tested, they also show the pattern of geometric responding when there are no surface markings in the chamber; but when one wall is marked (by being a different color than the other three), they search at the correct corner, and not its geometric equivalent (Hermer-Vasquez, Moffet, and Munkholm, 2001). Spelke (2003) proposed that the difference between rats and toddlers vs. older children and adults is that the latter two groups have language: Given its compositional nature, and its neutrality with respect to domains of knowledge (i.e., the rules of linguistic composition apply regardless of domain), language provides a representational format that enables combination of the outputs of two different encapsulated modules. In the case of reorientation, combining the information from the geometric module and from the module that represents surface properties such as colors or other landmarks can lead to a unified representation in which ‘left of the blue wall’ is representable. This kind of representation would distinguish between the two geometrically equivalent corners that are otherwise indistinguishable, allowing a person to encode the unique correct corner, and use it to find the hidden object. Language, by virtue of its combinatorial format, is especially and uniquely poised to enable that kind of combination. This hypothesis makes a bold and important claim, one that has been endorsed by both linguists and philosophers. For example, Berwick and Chomsky (2017) say: “Why do humans have language at all?... language is the lingua franca that binds together the different representations from geometric and non-geometric modules, just as an ‘inner mental tool’ should” (pp. 164–165). And Carruthers (2002) says: “Hermer- Vasquez et al. provides strong evidence that the integration of geometric properties with other sorts of information (color, smell, patterning, etc.) is dependent on natural language” (p. 667). The logic of the proposal is clear, although the underlying mechanism is hard to test empirically. The empirical data that do exist to support this view is only modest at this time. For example, Hermer-Vazquez et al. (2001) tested 5–7-year-olds on the reorientation task along with tests of digit span, spatial memory span, and comprehension and production tests of ‘left/right’ along with other spatial terms. Presumably, comprehension and production of ‘left/right’ could serve as behavioral indicators that the child’s linguistic system represents the relevant information for solving the reorientation problem; if so, reorientation performance should be positively correlated with the child’s knowledge of those terms. The results showed that only production of terms ‘left’ and ‘right’ was correlated with correct search during the reorientation task. Hermer- Vasquez et al. concluded that this was consistent with a causal role for language in constructing novel representations rapidly, moving older children and adults beyond the confines of other, non-linguistic species. Pyers, Shusterman, Senghas, Spelke, and
252 Barbara Landau Emmorey (2010) make a similar argument. They studied two cohorts of deaf signers growing up in Nicaragua, who had acquired a newly emerging sign language during different time periods in the evolution of the sign language. The first cohort to develop the language had no models to learn from, and developed a fairly limited sign language; the second cohort then learned their sign language as they interacted with members of the first cohort, using them as a model, and acquiring a more complex language. Second cohort signers used more complex spatial language, and performed better than members of the first cohort on spatial tasks, including reorientation. And second cohort signers who used more consistent marking of ‘left’ and ‘right’ showed better reorientation performance, suggesting to the authors that language learning can modulate aspects of spatial representation. These findings are suggestive, but remain limited. First, the assumption that other species cannot combine the outputs of a geometric and non-geometric module has been disproved; many species can and do combine information from geometry and landmarks (see Cheng and Newcombe, 2005, for review). Second, there are several empirical limitations. Why should the production of ‘left/right’, but not comprehension, be correlated with successful reorientation? Relatedly, although Hermer-Vazquez, Spelke, and Katsnelson (1999) found that successful reorientation was disrupted under verbal shadowing in adults, attempts to replicate this finding have not succeeded, with some suggesting biases stemming from the specific instructions given to adults in the experiment (Ratliff and Newcombe, 2008). Finally, the findings that success in reorientation is related to stable development of spatial terms (e.g., ‘left/right’; Hermer-Vasquez et al., 2001; Pyers et al., 2010) has at least one alternative explanation. This will be discussed in Section 12.2.3, as part of the momentary recoding hypothesis.
12.2.2.2 Number Carey (2009) has proposed that language plays a causal role in the creation of the integer system. In her proposal, she assumes that three different number-relevant systems are innate: the approximate number system, the small number system of individuating objects, and the system of natural language quantifiers (e.g., a, some, etc.). Evidence in favor of an innate approximate number system is substantial: Infants discriminate large numerosities, and this is ratio-dependent, a signature property of this system shown across species (Feigenson, Dehaene, and Spelke, 2004). Six-month-olds discriminate object sets with ratios of 1:2, 9-month-olds 2:3, and older children and adults show increased sensitivity, ending with adult sensitivity to ratios of 9:10 (Halberda and Feigenson, 2008). Infants also possess a second system that represents numerosity implicitly; it allows them to individuate small sets of objects, up to three individuals, but not four or more (Feigenson and Carey, 2005). The system of natural language quantifiers allows toddlers to represent one vs. more than one. Carey argues that none of these three systems can support the full representation of integers, hence infants and young children do not have available to them the capacity to represent exact numbers beyond the small set individuation system, which is limited to three.
Language and thought 253 How then do we get to represent exact quantities beyond the limits of these systems? Carey argues that the causal mechanism is language, specifically, the surprisingly slow but sure acquisition by children of the count list. Although children can recite the count words at an early age, they only come to map these terms onto numerosities over a period of years, first learning what ‘one’ means, then ‘two,’ then ‘three’ and ‘four’ (Wynn, 1992). At this point, the number words whose meanings are truly known can be mapped together with small sets in the individuation system. This leads to an epiphany, whereby children recognize that the count system, which embodies the successor function (one more can always be added to the set, to yield a new numerosity, exactly one more)— which in turn is the key to the integer system. As an example, the child who can set up multiple sets of objects of small values (one car, two cars, three cars) and who knows the meanings of number words for these sets can map the two to each other. The count list, which continues, then drives the insight that there is a successor function, and this insight creates the new system of integers. A related hypothesis that bears a clear relationship to Carey’s proposal has been offered by Charles Yang (2018), who also argues that language—and the count system in particular—provides the representational format to enable children’s insight to the successor function. Yang adds an interesting perspective, relying on the cross- linguistic variation in the morpho-phonological forms that constitute the count list far beyond the number five. Specifically, he proposes that children are likely to gain the successor function insight earlier if they are learning languages that express numbers beyond ten with systematic morphology; that is, clarity of the mapping between form and meaning helps the learner. He argues that it is not surprising that children learning English have difficulty learning the first ten number names, as each is simply an idiosyncratic form. (Parenthetically, this is much like color terms, which also prove problematic for young children as they try to map the terms onto hues.) Further, Yang argues that children will not arrive at a ‘rule’ that embodies the successor function in the linguistic encoding of number until they have sufficient evidence that there should be one. Using his ‘Tolerance Principle,’ Yang argues that children will need to have ‘enough’ evidence of regularity before they will form a rule that reflects the underlying structure of the number naming system (see Yang, 2021, this volume, for further discussion of the Tolerance Principle). Because different languages embody different levels of transparency for their number naming conventions, children learning different languages will infer the rule at different ages; this is empirically supported using data from children learning Cantonese vs. English, with the former requiring mastery of fewer count names before generalizing (Yang, Lei, and Lee, 2019; see also, Miller, Kelly, and Zhou, 2005). Note that, unlike the classic Whorfian hypothesis, Yang does not argue that learning the number system changes the underlying non-linguistic system; rather, his hypothesis simply suggests that children learning language x will come to infer the successor function earlier than those learning language y. But everyone gets there in the end. Carey’s hypothesis that language plays a necessary and causal role in the creation of the integer system is clearly a very strong hypothesis. One of its strongest
254 Barbara Landau predictions is that individuals who do not have a linguistic system that enumerates exact numerosities beyond the values of the small set system should be incapable of representing integers. Here, the evidence is at least equivocal with respect to the hypothesis and may, in fact, lean counter to it. One relevant line of research concerns the numerical abilities of the Mundurukú, an Amazonian culture whose language lacks explicit linguistic terms for numbers greater than five. Pica, Lerner, Izard, and Dehaene (2004) showed that these individuals could compare and add large approximate numbers that lie outside of their naming range, but failed in tasks requiring computation of exact arithmetic that involved quantities larger than four or five. This suggests that at best, it may be difficult to carry out large exact computations without a linguistic number system, but the question remains whether large exact numbers can be represented in the absence of language. A second line of research has been carried out with home-signers from Nicaragua, who had never been exposed to linguistic input, but live in a numerate society and have developed their own home- sign language (Spaepen, Coppola, Spelke, Carey, and Goldin-Meadow, 2011). These home-signers can represent exact cardinal numbers, signing accurately to identify the numerosities in relative small arrays, specifically, arrays with one, two, or three items. However, as the numerosities in an array increase, the home-signers show decreasing precision in accurately representing these numerosities through gesture; they also appear to show increasing variability with increasing numerosity, echoing the signature of the approximate number system. This evidence suggests that input from a conventional language system may provide a crucial ‘ready-made’ model for explicitly representing large exact numbers. As with the evidence from the Mundurukú, this suggests some role for linguistic input in representing and computing large exact number, but at this point, the stronger conclusion—that representing these numerosities is impossible without language—remains unproven. One possibility is that language is essential as a kind of “cognitive technology” that facilitates precise communication of large exact numbers, but does not fundamentally change the existence of underlying numerical representations (e.g., Frank, Everett, Fedorenko, and Gibson, 2008). This is consistent with Version 3 of the language-thought hypothesis (Section 12.2.3).
12.2.2.3 Theory of Mind An individual has a ‘Theory of Mind’ (TOM), if he/she can impute mental states— such as believing, thinking, guessing, doubting—to others, as well as themselves (Premack and Woodruff, 1978). Abundant literature over the past 40 years has focused on whether and when young children develop a TOM, with early findings suggesting that it is not until about age 4 that children show convincing evidence of being able to reason about the contents of another person’s mind. Many of these studies have focused on the classic false belief task, in which a young child observes a scenario like the following. A person (Sally) places an object in one location (e.g., in a drawer), leaves the room, and another person (Ann) comes in and moves the object to another hidden location (e.g., in a basket; see, e.g., Wimmer and Perner, 1983); then the first
Language and thought 255 person (Sally) returns, and the child is asked, “Where will Sally look for the candy?” The child has seen that the candy is now hidden in the basket, but Sally has only seen it placed in the drawer. Three-year-olds tend to say that Sally will look in the basket (where it now is); whereas 4-year-olds will say that she will look in the drawer (where she has seen it placed). The 4-year-olds thus understand that their own state of knowledge is different from Sally’s; and that Sally will respond solely according to what she herself knows, which is a false belief. Numerous replications of the original false belief task have been carried out with much the same findings (see deVilliers and deVilliers, 2014, for review). The question is, what changes allow the 4-year-old (but not the 3- year-old) to answer the question correctly? One answer is that even much younger children can understand the mental states of others, but the complexity of the original task masks their ability to express their knowledge, severely limiting their ability to correctly answer the classic “Where will she look?” question. Indeed, recent evidence from Baillargeon and colleagues has convincingly shown that 15-month-old infants understand that others can have false beliefs. For example, Onishi and Baillargeon (2005) showed infants scenarios modeled after the classic experiment. In one condition, infants viewed a protagonist who did not see the object moved (and hence had a false belief when she returned to the room); in the control condition, they viewed a protagonist who did see the object being moved (and so had a true belief). Infants looked longer when the protagonist searched for the object in the wrong location (relative to their viewing experience), suggesting that they knew where the protagonist should be thinking the object was. In a later study by Ballargeon, Scott, and He (2010), the classic method was altered to reduce other task demands, and 2-year-olds succeeded. These studies leave little doubt that infants and toddlers possess at least an implicit understanding of the foundations of TOM—representing another person’s mental state, even if it contains a false belief. However, there remains the question of why 3-year-olds persistently fail the classic TOM task, which requires explicit responses to questions such as “Where will Sally look?” Jill deVilliers (2007; deVilliers and deVilliers, 2014) has proposed that solving the classic TOM problem requires mastery of specific linguistic structures—syntactic complementation—which then serves as a tool by which children can form and manipulate representations of the mental states of others, both true and false. This hypothesis ascribes a causal role to language, predicting that without mastery of syntactic complements, children will be unable to skillfully manipulate representations of others’ mental states. deVilliers conjectures, “ . . . at some critical point, fixing the language structures involved in complementation seems to be inextricably tied to ways of reasoning about other minds. False belief understanding seems to be inextricably tied to certain language prerequisites” (2007, p. 1873). More specifically, deVilliers’ hypothesis rests on the idea that the form of certain complement structures transparently relates to the relationship between an individual’s mental state and the propositional contents that are represented. As an example, she
256 Barbara Landau points out that one can truthfully state, “John thought that he was stung by a wasp” even if the proposition in the lower clause (‘stung by a wasp’) is false. This format, she suggests is an ideal representation of the separate components, one’s own (true or false) representation of some state of affairs, and the truth or falsity of the actual state of affairs. Sentential complements such as these are common for mental state verbs (‘John thought/believed/knew/guessed that he was stung by a wasp’) and although they are also combined with other classes of verbs (e.g., perception, communication verbs), they are particularly common for mental state verbs in the input to young children (Davis and Landau, 2019). deVilliers proposes that mastery of these forms allows the child to clearly represent the content of an individual’s mental state, which itself can be evaluated for truth or falsehood. deVilliers further argues that mastery of these structures plays a causal role in the development of TOM, by providing a format that enables the child to think about the mental states of others. This view predicts that mastery of the structures should be strongly correlated with success on classic false belief tasks, and deVilliers offers several pieces of evidence that are consistent with this hypothesis. For example, in a longitudinal study, deVilliers and Pyers (2002) found that at least one measure of complementation mastery predicted preschoolers’ later performance on the false belief task. In addition, this relationship between mastery of complementation and success on the classic TOM task held for deaf children with delayed language (Schick, deVilliers, deVilliers, and Hoffmeister, 2007). Finally, Nicaraguan signers who developed a new sign language (and had no explicit models for a fully formed language) showed delay in their understanding of false belief, but signers who further developed the sign language showed more success (Pyers and Senghas, 2009). deVilliers posits that mastery of complementation is essential for the early development of false belief understanding, and that language even plays a role later, among adults, whose performance on the false belief task suffers under verbal shadowing, but not rhythmic shadowing (Newton and deVilliers, 2007). One outstanding question is whether complement structures are necessary for the formation of TOM, or whether they are an ideal form that learners can use to encode and manipulate the expression of one’s mental state relative to a given proposition. deVilliers seems to lean toward the idea that language is a necessary condition for TOM. That is, to encode and transmit a thought such as “John thought that Ed believed that Sarah knew she was stung by a wasp” may actually require language as a vehicle of representation in order to compute its truth or falsehood. This strong view would be consistent with Version 2 (V2) of the language-thought hypothesis. Certainly, the empirical evidence that deVilliers offers does hint that linguistic structure may be an important vehicle for carrying nested propositions, thereby allowing for the efficient encoding of lengthy chains of embedding and inference. But given the growing evidence that infants know when an agent has a true vs. false belief, the idea that TOM requires language seems too strong. An alternative hypothesis— Version 3, discussed next—may be able to accommodate both deVilliers’ findings and the infancy work.
Language and thought 257
12.2.3 V3: Having a language supports momentary recoding, but does not change underlying non- linguistic representations. Proponents of Versions 1 and 2 of the language-thought hypothesis argue that learning a language causes radical changes to our thought. In Version 1, the changes occur to our non-linguistic representations; whatever these were, prior to learning a language, they change into what the language chooses to encode, and these changes in non- linguistic representation are permanent. In Version 2, the changes that occur result in the creation of new kinds of representation, ones which could not have formed had it not been for learning a language; these new kinds of representation may exist alongside of earlier non-linguistic representations. Version 3 proposes that language has profound effects on human cognition, not because it permanently changes our non-linguistic representations, nor because it is necessary for the creation of wholly new kinds of representation, but because it enables us to continually recode our experience in the moment, using a rich linguistic formalism. This recoding bears the imprint of linguistic structure, but leaves untouched the pre-existing non-linguistic representations that may have been used in any given task. As evidence for this kind of effect, I turn to a case study of the interaction between language and the visual system. The case concerns a well-known phenomenon in vision: When people carry out visual search tasks, they find it fast and easy to search for a single feature of the elements in the array (say, color x or shape y), but difficult to search for a conjunction of features (that is, an element that has color x and shape y). So, if I ask you to search for a ‘red L’ in a sea of ‘green Ls,’ the red L will ‘pop out’; but if I ask you to search for the same red L in a sea of red O’s and green L’s, it will be difficult, and you will find yourself searching item by item, with reaction time increasing over set size. Treisman and Gelade (1980) proposed that the latter search process requires attention, moving one’s visual search systematically over the array of elements one by one. This type of search—primarily for feature conjunctions—can be impaired in patients with visual attention disorder, who show deficits in conjunction, but not feature search (Arguin, Cavanagh, and Joanette, 1994). Moreover, young children between the ages of 4 and 6 also have difficulty when asked to match, say, a square that is divided vertically, with red on the left and green on the right (Hoffman, Landau, and Pagani, 2003), systematically erring in choosing the target’s reflection, that is, a square that has the red on the right and the green on the left (Dessalegn and Landau, 2008, 2013). Such errors, attested among young children as well as adults, suggest that the visual system shows a certain fragility in binding color to its proper location on an object, and retaining it over time. Is it possible for language to strengthen this fragile aspect of visual representation? Dessalegn and Landau (2008, 2013) carried out a line of experiments to answer this question, focusing on young children, who might be in the process of coming to use language to buttress otherwise fragile visual representations. In our experiments, we showed children a square, divided in half vertically, horizontally, or diagonally, with one
258 Barbara Landau half red and the other green. In a first experiment, we said to the children, “Look at this!”, let them do so, then removed it for one second, and brought up three alternatives, asking them “Which one is exactly the same as the one you just saw?” The alternatives included the identical square, its mirror image, or one with a different geometric division (e.g., a diagonally split square instead of a vertically or horizontally split one). Children chose the identical square more often than chance (around 65%), but their errors were almost exclusively the mirror image of the target, showing that they had difficulty representing and/or storing the correct color/location combination, and thereby establishing a basic fragility in their visual representation or memory. We then carried out several manipulations, each testing a different group of children, varying the non-linguistic mode of presentation while using the original instructions (“Look at this!”). In one manipulation, we made the target square more salient, by growing or shrinking it as we presented it; in another, we flashed the entire square or just the red half on and off as we presented it; in a third, we asked children to point to the square. None of these manipulations changed performance. We then moved to changing the form of the linguistic instructions as we presented the target. First, we gave the object a name: “Look! It’s a dax,” assuming that naming might draw attention to the visual details of the square. This had no effect on performance. Then, we told children where the red part was, relative to the green part: “Look! The red is to the left of the green.” This improved performance by approximately 20%, resulting in around 80% correct. We also instructed them using spatial terms that were more general than ‘left/ right,’ for example, “Look! The red is touching the green!” Performance returned to around 60% correct. After children were tested on the red/green manipulation, we tested them on their comprehension and production of the spatial relationships encoded by terms ‘top/ bottom/left/right,’ which had been used to describe the stimuli. Children were at ceiling for ‘top/bottom’ but considerably poorer for ‘left/right,’ with some children recognizing that these terms mapped onto the horizontal axis of the squares, but showing directional confusion, that is, which was ‘left’ vs. ‘right.’ We asked whether individual performance on these ‘left/right’ tests was correlated with performance on the matching task, but there were no reliable correlations, indicating that children were not encoding the spatial relationship between the colors and their locations on the target stimulus using complete and accurate long-term representations of which was ‘left’ and which was ‘right.’ This suggests that the improvement in performance when children were told, “the red is to the left of green” likely had a quite momentary effect. In essence, this linguistic instruction pointed the children to a new, enriched representation of the target square on its first presentation, establishing the location of the ‘figure’ (i.e., the red part) relative to the ‘ground’ (i.e., the green part). Using this enrichment tool helped children store the stimulus with more precision over the one-second delay, resulting in improved matching performance. We further asked exactly what kind of linguistic information would help children distinguish the two parts of the target square as figure (‘red’) and ‘ground’ (‘green’), thereby encoding the full square as having two parts, asymmetrically related to each other. We
Language and thought 259 identified two kinds of linguistic information. The first is the hierarchical location of the spatial term in its syntactic context. Saying ‘red is left of green’ places ‘red’ in the subject position, higher in the syntactic tree than ‘green,’ which is the object of the preposition. Specification of this hierarchical relationship in the sentence alters the prominence of the two entities involved, as discussed by Talmy (1983), Rosch (1975), and experimentally shown by Gleitman, Gleitman, Miller, and Ostrin (1996; for review, see Landau and Gleitman, 2015). For example, Gleitman et al. (1996) found that people judged nonce nouns occupying these two positions in the syntactic tree to have different properties. People who were told, “The zum met the dax” were likely to rate the ‘dax’ as older, bigger, more famous, and more important than the ‘zum’—properties that resonate with the properties commonly ascribed to ground objects (Talmy, 1983). Similarly, hearing “The red is (spatial term) the green” might immediately invite children to establish an asymmetric spatial relationship between figure (red) and ground (green). The second kind of information is lexical—that is, the directional content of the terms ‘left/right,’ compared to, say, the term ‘touching,’ which may or may not represent an asymmetrical spatial relationship, but which did not lead to better performance.5 The upshot of this analysis pointed us to a powerful but highly counterintuitive hypothesis, tested in Dessalegn and Landau (2013): If children were temporarily storing linguistic information that represented an abstract, asymmetric relationship with ‘red’ higher in the hierarchical structure than ‘green,’ then in principle they might also be able to use other asymmetric terms in the same syntactic contexts—even terms that have no spatial content. We tested this hypothesis by introducing 4-year-olds to the same task, but this time, using non-spatial terms that also represent asymmetric relationships. For example, we told them “Look! The red is prettier/nicer/brighter than the green!” Children’s performance again improved compared to the other conditions, to approximately the same degree as when hearing ‘red is left of green.’ This effect held for 4-year-olds, but not for 3-or 6-year-olds. Three year-olds’ performance did not vary across different linguistic instructions, indicating that they could not use the variation in language to support memory. By contrast, 6 year-olds’ performance was close to ceiling across all conditions, suggesting to us that they were automatically recoding the stimulus as an asymmetrical entity, whether given special linguistic instructions or not. Our conclusions from this line of research were that language plays a vital role in modulating fragile non-linguistic representations (in this case, outputs of the visual system), and that these effects are momentary, that they engage lexical and syntactic information, and that they are highly abstract. We also concluded that these effects of linguistically recoding the visual stimuli are quite powerful, conferring an enriched representation of the stimuli in the child’s mental representation. This enrichment can be characterized as a hybrid representation, containing both the visual representation of 5 Spatial terms are typically (though not always) asymmetrical; for example, if red is left of green, then it cannot be that green is left of red (and similarly for all of the spatial terms that led to improved performance, that is, ‘left/right/top/bottom’). By comparison, ‘touching’ is logically symmetrical in its stative form; if red touches green then it follows that green touches red.
260 Barbara Landau the square and its linguistic ‘tags,’ which specify figure vs. ground, and the directional relationship between the two parts of the square. This in turn supports a biased representation of the target square in which the asymmetric figure-ground relationship between its (red/green) parts is strongly established, leading to improved performance in matching the target over the one-second delay in our task. The conclusion that such effects are powerful is in some ways not new. Over half a century ago, George Miller (1956) argued the linguistic recoding was a powerful tool, used by us constantly: . . . if you think of this [verbal recoding] merely as a mnemonic trick for extending the memory span, you will miss the more important point that is implicit in nearly all such mnemonic devices. The point is that recoding is an extremely powerful weapon for increasing the amount of information that we can deal with. In one form or another we use recoding constantly in our daily behavior. In my opinion the most customary kind of recoding that we do all the time is to translate into a verbal code. (Miller, 1956, pp. 94–95)
The momentary recoding hypothesis accounts for more than our own findings; in fact, it has widespread explanatory value and can account for several phenomena discussed earlier in this chapter, under either V1 or V2. First, V1 was discussed largely in the context of research on the relationship between color names and color discrimination. Under V1, the strong Whorfian claim would posit that linguistic differences lead to differences in non-linguistic perception or categorization. However, many findings in this literature make clear that variation in color naming across cultures leads to differences in judgments of color similarity only ‘in the moment’ of the task, consistent with V3; as Kay and Kempton (1984) put it, any differences in the latter were likely due to people’s use of the ‘name strategy,’ that is, the re-encoding of the observed hue in terms of lexical labels. The findings of Gilbert et al. (2006) can be interpreted in the same way— as a momentary tagging of a hue with a linguistic term, which then influences performance in the moment of the task (consistent with V3), but does not fundamentally change the underlying non-linguistic representations of color (as would follow from V1). V2 was discussed in the context of spatial reorientation, number, and theory of mind. In the case of spatial reorientation, conjectures about the role of language have focused on the causal role of language in providing a ready-made system allowing combinations of content across otherwise modular systems (Spelke, 2003; Berwick and Chomsky, 2017); supporting evidence has been offered to argue that active production of spatial terms does, in fact, support increased integration of geometry plus landmarks (Hermer- Vasquez et al., 1991). The general theoretical conjecture about the importance of combinatorics is deep but difficult to empirically prove. Moreover, the evidence offered is completely consistent with the hypothesis proposed in V3—that on-line recoding by language during the reorientation task (especially, phrases that express both surface and geometric properties of a layout) leads to better performance in virtue of on- line, momentary linguistic encoding. In the case of number, Frank et al. (2008) have
Language and thought 261 argued compellingly that linguistic representation of number provides a powerful ‘tool’ that enables representation and manipulation of large exact numbers using words for numbers, but that this tool is not necessary for the computation of the cardinality of large sets, except in cases where the values must be carried through memory, because of variation in time, space, or modality. This hypothesis lines up clearly with V3, the ‘momentary recoding’ hypothesis, especially in emphasizing the crucial role of language in creating and retaining a new, linguistic representation (spatial or numerical) over time. Finally, although V2 arguments about the role of language in the development of theory of mind suggest that language is necessary for representations of another person’s false beliefs, the empirical findings from infancy make it unlikely that such a strong hypothesis is warranted. This is not to deny that language may play an extremely important role in computing another person’s false belief. Rather, the findings are consistent with the idea that mastery of complement structures is neither necessary nor sufficient for representing the contents of another mind, but that the format of these structures provides a highly convenient ‘tool’ which supports the on-line encoding and manipulation of propositions that are embedded within different mental states (believing, thinking, knowing, etc.). Consistent with V3, this on-line and temporary linguistic encoding may result in much greater efficiency in ‘reading off ’ the meanings of sentences which have this structure. This powerful use of on-line, temporary encoding supports the solution of complex TOM problems, but need not create the cognitive system that originates in infancy, prior to learning a language.
12.3 Conclusions This review has made clear that the answer to whether language changes thought depends heavily on how one defines ‘language,’ ‘thought,’ and even ‘change.’ Version 1, originating with Whorf, considers whether learning a specific language leads to changes in thought; in this version, the classical position assumes that ‘thought’ means ‘non- linguistic thought,’ that is, representations that exist prior to learning a language and/ or in species who do not have the capacity for language. Admittedly, Whorf ’s own position has been subject to stronger and weaker interpretations, with some arguing that any kind of effect of language in a perceptual or cognitive task supports the Whorfian position (see Section 12.2.1.1, e.g., Gilbert et al., 2006). However, in the extreme, this becomes circular; if a task requires language, and performance follows what is encoded by that language, then this is simply an effect of language on language (see Gleitman and Papafragou, 2012, for additional discussion). Version 2 proposes that learning any human language confers new computational powers, creating wholly new kinds of ‘thought’—that is, representations that could not have existed without language and its particular formal properties. Here, too, one must assume that ‘thought’ without language must be pre/non-linguistic representations and that language causes the emergence of new, more powerful modes of thought. V2
262 Barbara Landau suggests that altogether new computations can be carried out using the formalism afforded by language, whether it be combining outputs of independent modules, using number names to provide insights into a kind of number system impossible to represent without language, or representing and manipulating with ease the kinds of embeddings required for computing the relationship between the truth of a proposition and a person’s mental stance toward it. Each of these depends on according language a powerful role in the way we ‘think,’ but it remains unclear whether language is a necessary condition for these accomplishments. Version 3 represents a different view, proposing that linguistic formalisms do indeed confer a powerful advantage, but that the mechanism for this advantage is largely that of momentary, on-line recoding of our non-linguistic representations. This representation will reflect some crucial aspects of our non-linguistic representations, but discard or ignore others; moreover, different aspects of language will focus on different aspects of our non-linguistic representation, with cross-linguistic equivalences being established through a variety of linguistic mechanisms. The resulting linguistic representation of our thoughts will be selective, compact, and perhaps, also more easily manipulable. Each of these versions embraces a particular view of which properties of language make the biggest difference, what domains of thought afford the clearest example of the effects of language, and what kind of change follows from learning language. Although the classical Whorfian view still appears to have sway for lay interpretations of the relationship between language and thought, the actual empirical evidence has shaped up to rule out the strongest version of this view. More promising are newer views that suggest that having a language—any language—can have powerful effects on human cognition, either by supporting the creation of wholly new forms of representation, or by using the formalism of language to compactly and efficiently express and manipulate selected aspects of our representations. Such constant recoding of our experience may well lead to the intuitive sense that we think ‘in’ language. But as pointed out in the opening pages of this chapter, nothing could be farther from the truth. Language and (non-linguistic) thought are separate, complementary systems of cognition, peacefully co-existing while powerfully interacting.
Acknowledgments Thanks to Lila R. Gleitman for substantive comments on this chapter, and for many illuminating discussions on this topic over decades. Thanks also to the following attendees of the Language and Cognition Lab meeting, who read and gave substantive comments on a previous draft of this chapter: Nico Cesana Arlotti, Emory Davis, Alon Hafri, Jane Lutken, Robert Medina, Rennie Pasquinelli, Jasmin Perez, and Zekun Sun.
PA RT I I
AC QU I R I N G T H E M E N TA L L E X IC ON
Pa rt I IA
F OR M
Chapter 13
Infants’ Lea rni ng of Speech Sou nd s a nd Word-F orms Daniel Swingley
13.1 Introduction We speak to babies from the moment they are born. To the infant, the sound of the human voice is already familiar from prenatal experience (e.g., DeCasper, Lecanuet, Busnel, Granier-Deferre, and Maugeais, 1994; Kisilevsky, Hains, Lee et al., 2003), but after birth, the voice plays a new role. Babies see us leaning in toward them, looking in their eyes, and singing, cooing, or speaking. They notice how our own vocalizations can be timed to the infant’s actions, and how our faces move when we talk (Guellaï, Streri, Chopin, Rider, and Kitamura, 2016). What do they think we are doing when we speak? They probably think the melody is the message, as Fernald (1989) put it: infants seem to resonate to the emotional meaning of our intonation contours, at least for some intonation patterns, probably without having to learn to do so. Parents are adept at using the infant-directed speech register to keep infants’ attention and modulate their emotional state (e.g., Fernald, 1992; Lewis, 1936; Meumann, 1902; Spinelli and Mesman, 2018). To the extent that young babies reflect on language at all, they might begin with the hypothesis that speech is similar to cuddling—auditory rather than tactile, but nonetheless primarily a personal and intimate source of emotional support. Of course, this hypothesis is decidedly incomplete as a description of what speech does, covering almost none of what we think of when considering language. In learning a particular language, additional speech features come to the fore: sets of consonants and vowels, their nature when appearing in context, the longer and more complex sequences that form words, and most generally the partitioning of phonetic variation into its myriad causal sources. Infants make progress in all of these areas during their first year, and also make substantial headway in learning aspects of the meanings of
268 Daniel Swingley many early words. Here, we outline what is known, and what is still unknown, about how this process typically unfolds in early development.
13.2 Infants’ categorization of speech sounds By common consensus, this story begins in 1971 with the publication of a paper by Eimas, Siqueland, Jusczyk, and Vigorito examining 1-and 4-month-old infants’ discrimination of the syllables /ba/and /da/. They chose to test infants on these syllables for an interesting and theoretically significant reason. Researchers at Haskins Labs had developed a procedure for synthesizing consonant-vowel syllables, allowing precise control over the syllables’ acoustic features. They found that they could simulate the phonetics of the long voice onset time appropriate for stop consonants like /p/or /t/by manipulating two features of the resonances (formants) leading into the following vowel: silencing the first formant, and replacing the second and third formants with noise (Liberman, Delattre, and Cooper, 1958). The short voice onset time appropriate for a /b/could be synthesized similarly, by effecting these alterations for a shorter period (thereby allowing voicing to begin more closely following the release of the consonant). Remarkably, gradually increasing the voice onset time between the two did not gradually increase the likelihood that a given syllable would be heard as voiceless; instead, listeners consistently perceived the syllable as ‘ba’ until voice onset time reached about 25 ms, at which point most responses went to ‘pa.’ This phenomenon, in which perceptual discrimination is governed by the stimulus sitting either left or right of a boundary point on a continuum, rather than by the acoustic distance along that continuum, came to be known as categorical perception (Liberman, Harris, Hoffman, and Griffith, 1957; Repp, 1984). The early Haskins papers acknowledged uncertainty regarding where the boundaries might come from: perhaps they were innate, or perhaps they were learned; if learned, mature listeners may once have been highly sensitive to many decision boundaries in phonetic space and learned to disregard some of them; or rather quite poor at the start and learned to sharpen discrimination at these boundaries (Liberman, Harris, Kinney, and Lane, 1961). Eimas et al. (1971) aimed to address this question of origins, testing infants of one and four months of age. They used a recently developed habituation technique (Siqueland and DeLucia, 1969), in which infants’ sucking on a pacifier was rewarded, in this case by presentation of syllables; once the initial reward-prompted increase in sucking rate tailed off (habituation), the habituation syllable was replaced by another, with a voice onset time 20 ms different from the initial one; or it was kept the same, for infants in the control group. When the habituation and test stimuli straddled the ~25 ms boundary (at 20 and 40 ms), sucking rates rebounded; when the stimuli sat on the same side of the boundary (at -20 and 0 ms or at 60 and 80 ms), sucking rates continued to sink after the
Infants’ Learning of Speech Sounds and Word-Forms 269 change, just as much as in the no-change control condition. Eimas et al. concluded, “ . . . the means by which categorical perception . . . is accomplished may well be part of the biological makeup of the organism and, moreover . . . must be operative at an unexpectedly early age” (p. 306). The usual interpretation of the Eimas et al. (1971) result is that languages tend to settle into a position where at least some of the phonological distinctions required for differentiating words come naturally to the auditory system, presumably reducing the burden on learning and helping make speech interpretation more accurate in the mature state. These nonlinearities in discrimination may not really be adaptations for spoken language per se, because similar nonlinearities are found in other mammals who do not produce consonants (e.g., Kuhl and Miller, 1978). This is why these perceptual boundary phenomena are viewed as influencing how languages evolve over historical time, rather than as human evolutionary adaptations to some fixed set of phonetic standards. Additional patterns of selective sensitivity were found in subsequent studies of infants, showing that the alignment between naive speech perception and language structure is not unique to the voicing distinction in stop consonants. For example, the difference between /d/and /ɡ/at the start of a syllable is signaled primarily by changes in vocal tract harmonics (the second and third formants) during the transition from the consonant into the vowel. Eimas (1975) used synthesized syllables that adults identified as either [dæ] or [ɡæ] to test their discriminability to 2-and 3-month-olds, and found again that infants distinguished the sounds that were considered different by adults, but did not distinguish sounds that adults rated as falling within the [dæ] category. Similar conclusions were drawn from other early studies showing that infants treat speech variation in ways that would apparently facilitate linguistic categorization. For example, a major part of the difference between /b/and /w/is the speed of the formant transitions into the vowel: fast for /b/, slower for /w/. What counts as “fast” or “slower” depends on speaking rate. If the speaker is talking quickly, a /b/’s transitions must be especially speedy, or the consonant may be interpreted as /w/; if the speaker is talking slowly, the transitions of a /w/must be even slower. It follows, then, that a /w/near the boundary can be turned into a /b/just by making the syllable shorter to signal a faster speaking rate (Liberman, Delattre, Gerstman, and Cooper, 1956). Is this relationship learned through extensive exposure to language? Probably not. Using habituation methods, Eimas and Miller (1980) found that 2–4-month-olds also were sensitive to syllable length in categorizing a consonant as [b]or [w] in the same manner as adults. Thus, infants detected a change from a syllable with a transition of 16 ms to a syllable with a transition of 40 ms only when the syllable was short (when those transition times are consistent with a change from [b] to [w]) and not when the syllable was long (two [b]s). Likewise, infants detected a change from a syllable with a 40 ms transition to one with a 64 ms transition only when the syllable was long, and not when it was short. Once again these effects are probably not specific human adaptations for language (for example, birds have shown the same effect with human syllables; Welch, Sawusch, and Dent, 2009), but they point to the fact that the perceptual similarity space
270 Daniel Swingley in human infants and adults is peculiar in some of the same ways, an alignment that surely facilitates the intergenerational transfer of linguistic conventions. Through the 1970s and early 1980s a wave of studies using precisely controlled speech materials followed up on the Eimas et al. (1971) study (and also early experiments by Morse, 1972; and Moffitt, 1971), revealing infants’ ability to detect subtle linguistically- relevant phonetic distinctions (e.g., Werker and Lalonde, 1988). Many of the initial studies were essentially infant versions of several of the speech perception experiments performed at Haskins Labs (Liberman, Cooper, Shankweiler, and Studdart-Kennedy, 1967). These had characteristic similarities: very simple (often synthesized) stimuli, precise parametric manipulation of auditory cues, and discrimination or simple categorization as the response measure. Later studies with infants retained these features, but as time passed, the field’s emphasis transitioned from measurement of specific acoustic determinants of phone identification, into evaluation of whether infants could distinguish a broad range of consonants or vowels of various languages (Aslin, Jusczyk, and Pisoni, 1998). The main conclusion was simple: within the earliest months of life, infants can distinguish the speech sounds of any language, if those speech sounds are clearly realized. Researchers have continued in this line of work, though present-day researchers tend to use more varied speech samples. The phenomenon of early speech-sound discrimination is robust enough that exceptions are newsworthy (e.g., Narayan, Werker, and Beddor, 2010) and spark empirical efforts to rescue the broader generalization (e.g., Sundara, Ngon, Skoruppa et al., 2018). In addition to discriminating many sounds, young infants readily group some speech sounds together despite substantial acoustic variation. For example, Marean, Werner, and Kuhl (1992) found that 2-and 3-month-olds trained to respond to a change from [a] to [i] from a synthesized male voice generalized this response to a synthesized female voice, suggesting an early-emerging ability to track criterial features of vowels over variation in other acoustic properties. The generality of this result is not clear; subsequent studies examining generalization of a familiarized syllable or a word from one talker to instances from another talker, one emotional state or another, and so on, has revealed a mixed picture in somewhat older infants (e.g., Bergelson and Swingley, 2018; Singh, 2008). At present we are not in a position to quantitatively characterize young infants’ naive similarity space well enough to say whether infants have a general predisposition to weigh especially heavily formant values or other features that serve to differentiate speech sounds crosslinguistically. Of course, over time infants move beyond this initial state. After all, they are learning the language or languages of their environment, and languages vary in the categories they use and where the category boundaries fall. One might imagine that infants would simply get better and better at resolving their native-language categories, while retaining their initial capacity for sound discrimination, but this is not what happens: as infants develop, improvements in native-language speech-sound categorization are accompanied by decrements in categorization of nonnative speech sounds.
Infants’ Learning of Speech Sounds and Word-Forms 271 Two foundational experiments demonstrating these early changes are Werker and Tees (1984) and Kuhl, Williams, Lacerda, Stevens, and Lindblom (1992). Both used a conditioned headturn method in which infants were rewarded with an audiovisual display that was available only when a soundtrack of repeated syllables changed to another, different syllable. Because infants had to turn to see the reward, headturns indexed infants’ detection of the change. Werker and Tees (1984) found that English-learning 6– 8-month-olds nearly all succeeded in learning this contingency for a pair of unfamiliar stop consonants (drawn either from the Salish language Nthlakampx, or from Hindi); by 10–12 months, infants only rarely succeeded, despite performing well with English consonants. Thus, unless sounds are present in the infant’s language environment, they may become difficult to distinguish. Kuhl et al. (1992) examined what is, in a way, a mirror image of this effect, showing that by 6 months of age, variant instances of a given native-language vowel category can become more difficult to differentiate. Compared to Swedish infants, English- learning infants were not as good at discriminating altered versions of English [i]from one another; but English infants were better than Swedish infants at telling variants of Swedish [y] apart. This might seem paradoxical: why would experience with language make infants worse at noticing variation within a familiar vowel category? (It would be surprising, for example, if dog experts were inferior to cat experts in noting an uncharacteristic bark in a terrier.) Nonetheless, Kuhl et al.’s result makes sense functionally: if a word has an /i/in it, it does not matter exactly what shade of /i/it is, and so English learners should eventually learn to collapse irrelevant differences within the category. Computational models that make intuitively sensible assumptions about the learning process have suggested ways this result could come about in infants even though they are not being explicitly trained to differentiate vowels (e.g., Guenther and Gjaja, 1996). Similar acquired-equivalence effects have been found in other domains, often paired with enhanced discriminability effects at category boundaries (Goldstone, 1998; for a review and a model, see Feldman, Griffiths, and Morgan, 2009). In the domain of speech, the phenomenon of reduced discriminability near the center of a category relative to the periphery became known as the “perceptual magnet effect,” based on the metaphor of the category prototype being a magnet attracting other things toward it (Kuhl, 1991). Further experiments testing infants’ discrimination of native and nonnative speech- sound contrasts supported the patterns revealed in these initial studies: in general, infants whose native language uses a phonetically “close” pair of sounds will be able to distinguish clear instances of those sounds throughout development; infants whose native language does not use this pair of sounds will begin to show reduced discrimination starting between 6 and 12 months of age (reviews, e.g., Jusczyk, 1997; Kuhl, Conboy, Coffey-Corina et al., 2008; Werker, 2018). This decline in performance with nonnative sounds appears to take place somewhat sooner for vowels than for consonants, with evidence of nonnative discrimination failures at 6–8 months old in vowels (Bosch and Sebastián-Gallés, 2003; Polka and Werker, 1994) and analogous declines in consonants about three months later (e.g., Rivera-Gaxiola, Silva-Pereyra, and Kuhl, 2005; Segal, Hijli-Assi, and Kishon-Rabin, 2016; Tsao, Liu, and Kuhl, 2006). Note, though, that this
272 Daniel Swingley observation, while frequently cited, is not yet conclusively supported (Tsuji and Cristia, 2014), and this pattern of results would not imply that vowels are easier to learn; it might instead imply that consonants take longer to unlearn. Infants’ adaptation to their native language is also characterized by quantitative improvements in categorization performance for familiar speech sounds. For example, Kuhl, Stevens, Hayashi et al. (2006) tested English learners’ responses to [ɹa] and [la] using a headturn method and found that 6–8-month-olds averaged 64% correct, whereas 10–12-month-olds averaged 74% correct. While this kind of improvement is perhaps not surprising, it does exclude any theory holding that perceptual development is just a matter of weeding out pre-existing category boundaries that are not relevant in the local language. The priority and eminence of the Eimas et al. (1971) study and the common characterization of young infants as “universal learners” or “universal perceivers” might give the impression that infants are born with a highly reticulated phonetic perceptual space, and that development could consist of collapsing most of these distinctions to eventually arrive at the mature native speaker’s state. Such a view is not really credible: mature speakers of different languages or dialects implement the “same” sounds in measurably different ways, which would imply a greater number of innate boundaries than there are sounds in all languages, and would still require a procedure for selecting among them. Indeed, innate facilitation of the voicing boundary in English stop consonants is probably more the exception than the rule. In the case of vowels, for example, categorical perception effects are weak. The vowel space is continuous, and languages impose vowel categories upon it (e.g., Kronrod, Coppess, and Feldman, 2016; Swoboda, Morse, and Leavitt, 1976; but see Kuhl, 1994 for a different view). Uncontroversially, these categories must be detected via a data-driven learning process. One way to characterize the developmental transition we have described so far is the illustration in Figure 13.1. The newborn infant hears continuous speech (a), breaks it down into a sequence of consonants and vowels (b), and projects each token into a multidimensional, speech-specific similarity space (c). By 12 months of age, this phonetic space is divided into discrete categories, or reference points, to which experienced speech sounds are matched as they are heard. There is some debate about how closely these phonetic categories line up with the phonemes of the language and how readily they should be expected to serve the phoneme’s role as defining lexical contrast (Swingley, 2016; Werker and Curtin, 2005). If the categories can work as phonemes, the infant would have achieved a significant linguistic goal: conversion of the continuous acoustic signal into a set of discrete representations suitable for concatenation into distinct, identifiable words (d). This illustration is both incomplete and misleading in some important respects. It is incomplete in ignoring suprasegmental features of speech, for example, although infants certainly encode and learn about accent, tone, prosodic phrasing, and so forth, and these features interact with phonetic categorization in complex ways. Worse, though, the conception of the initial state shown in Figure 13.1 is that infants can easily identify which portions of the speech signal correspond to phones to be analyzed and learned. Figure
Infants’ Learning of Speech Sounds and Word-Forms 273 13.1 also implies that by 12 months of age, infants correctly and exhaustively account for the speech signal in terms of phonological units ready for service in representing and differentiating words. These assumptions are debatable, as we will see. Several studies have attempted to determine whether infants represent speech in terms of sequences of consonants and vowels. One approach has been to use number: if infants interpret speech in terms of phones, they might detect when a sequence of four- segment words (like rifu, iblo . . . ), changes to a list of six-segment words (suldri, kafest . . . ). But 4-day-olds do not respond to this change, though they do respond to changes from two-syllable words to three-syllable words (Bijeljac-Babic, Bertoncini, and Mehler, 1993). Another approach has been to present infants with sequences of syllables that exemplify some regularity, like rhyming, and see if they prefer such a list over one without any such regularity. Experiments of this sort have yielded mixed results. Jusczyk, Goodman, and Bauman (1999) found that 9-month-olds preferred lists of consonant- vowel-consonant (CVC) syllables matching in onset consonant and in onset consonant and vowel, but not in the rhyme (VC). Hayes and Slater (2008) found that 3-month- olds preferred onset-matching syllables over more miscellaneous syllables. Infants can be trained to detect changes to a series of rhyming syllables even if the change is only to alter the vowel or the final consonant (Hayes, Slater, and Brown, 2000; Hayes, Slater, and Longmore, 2009), but this does not necessarily require that infants interpret these CVC syllables as comprising three parts. Because these experiments each used quite various sets of syllables in setting up the tested regularities, infants’ recognition of the patterns may implicate a similarity comparison that did not require generalization specifically over segmental units.
(a)
(b) (c)
(d)
Figure 13.1 Complete analog to digital conversion in infants: tempting, but unsupported.
274 Daniel Swingley Some more precise attempts at this question have also yielded mixed results. Jusczyk and Derrah (1987), using a sucking method like Eimas et al.’s (1971) found that infants who were habituated to the sequence [bi, bo, bɝ, ba] did not show a greater response to the addition of [du] than to the addition of [bu]. Eimas (1999) obtained concordant results using a looking-preference habituation method. These authors argued that infants represent the information that distinguishes all of these syllables, but, on the basis of parsimony, hypothesized that infants do not also segment syllables into consonants and vowels. The opposite result was found by Hochmann and Papeo (2014) who indexed surprise with a pupillary dilation response, and found evidence that infants recognized the distinctiveness of the added consonant despite the variation in the following vowels, a variation that can be quite subtle relative to the changes induced by context. At this point, this line of research taken together does not present a clear picture. A number of studies have suggested that infants can learn phonetic generalizations (“phonotactics”) based on specific features (like being a fricative, having a labial place of articulation, and so forth), but in many cases these do not necessarily implicate a segment-level representation. For example, an English-learning infant could be put off by the Dutch consonant sequence [kn] (as in knie, knee), without representing the two consonants as distinct sounds (Jusczyk, Friederici, Wessels, Svenkerud, and Jusczyk, 1993). Other studies that might be harder to explain without implicating segments involve infants of 9 or 10 months old (e.g., Chambers, Onishi, and Fisher, 2011; for a skeptical overview and meta-analysis, Cristia, 2018). In addition, putting aside any questions of interpretation, it is important to recognize that all of these experimental studies make exclusive use of very short utterances delivered in a hyperarticulated register that is only sometimes characteristic of words in infant-directed speech. Still, based on these studies it seems that at least some of the time, infants capture subsyllabic units from speech, and that by 9 months old or so, they have enough of a sense of their language’s phonological patterns to treat rare phonetic chunks as unfamiliar. Finally, the fact that infants learn the speech-sound categories of their language might imply a segmentation of the signal into units the size of those categories. But it might not: we can be familiar with a thing and not realize that its parts are parts, or give them any significance. A holistic understanding is not necessarily vaguer than an analytic one. Hypotheses in this domain are testable in specific cases. If, for example, infants build categories of vowel-nasal sequences that conflate the two sounds (and that are distinct from the category they are building for the vowel on its own in a non-nasal context), they would show stronger habituation or surprise effects across tokens within that vowel- nasal pairing than across tokens with non-nasal codas (in the manner of Hochmann and Papeo, 2014, for example). Given the evidence summarized above, it does not seem necessary to assume that during the first year, perhaps even in the first half of the first year, infants spontaneously break down the speech signal into consonants and vowels. They might do so, perhaps even most of the time, but this appears to be an open question. There are hybrid alternatives to full segmental decomposition that are plausible, in my view, despite having no detectable currency in the literature. For example, infants might have innate or
Infants’ Learning of Speech Sounds and Word-Forms 275 very early-developing parsing skills that lead them to draw boundaries at areas of salient phonetic change—dividing fricatives from everything else, or continuants from stops, and so forth. Perhaps they interpret vowel-glide sequences in much the same way we think of diphthongs: complex sounds with trajectories built in. Even in the case of mature native speakers, borderline cases are not uncommon. English speakers might assert that there is a /w/in you wonder but not in you under, but it seems a stretch to assume that infants would share these intuitions in an adult-like way before having learned to do so. These questions about parsing and representation have not been addressed much but may be amenable to investigation with existing techniques (see Martin, Peperkamp, and Dupoux, 2013, for discussion; see also Magnuson and Crinnion, this volume).
13.3 Hypotheses about phonetic category learning When infants do isolate consonants or vowels, how do they learn the categories to assign them to? We know more about the developmental timing of this learning than we know about the learning process. An early assumption that children learn speech sounds by observing patterns of contrast in the lexicon (Trubetskoy, 1939/1969) became difficult to sustain once the precocious nature of early learning became clear. If knowledge of words like boat and goat were needed to differentiate /b/and /ɡ/, then either stop consonants must be learned much later than the infant perception studies showed, or infants must have a large enough vocabulary to support these distinctions by 6–12 months of age. Neither proposal seemed plausible. Lexically driven theories require additional capacities for phonetic learning anyway, because knowing that boat isn’t goat doesn’t itself reveal what phonetic features are criterial for the distinction. Consequently, the dominant explanation for infant phonetic learning abandons the lexicon and relies on distributional learning over experienced speech tokens. Distributional learning here refers to inducing categories by detecting clusters of similar sounds. The premise of this proposal is that for each sound of a language’s phonology, spoken instances of that sound will be similar to one another, and separate from members of other categories. In principle, statistical modes in a set of perceptual experiences can be detected without labeled training data (Duda and Hart, 1973). For example, if a sample of vowels with a first formant of about 450 Hz includes half with a second formant below 2,000 and half with a second formant above 2,250, there is a basis for inferring that there are two separate categories in that region of phonetic space. Kuhl et al.’s (1992) finding of language-dependent prototype structure in learned vowel categories was hypothesized to come about through this sort of unsupervised distributional clustering (Guenther and Gjaja, 1996; Kuhl, 1992; Lacerda, 1995). The hypothesis makes sense: how can one learn a category prototype structure, if not by attending to surface distributions of phonetic features? Laboratory experiments with adults and with
276 Daniel Swingley infants have shown that brief but concentrated exposure to instances of well-separated acoustic or phonetic categories can modify listeners’ interpretation of similar sounds (e.g., Francis, Kaganovich, and Driscoll-Huber, 2008; Goudbeek, Swingley, and Smits, 2009; Liu and Holt, 2015; Maye, Werker, and Gerken, 2002; Yoshida, Pons, Maye, and Werker, 2010). In one carefully studied case, infants were shown to have learned a native- language distinction better if their parent tended to articulate one of the sounds in a more acoustically distinct way (Cristià, 2011). Thus, this evidence suggests that distributional clustering is a learning mechanism that might explain early phonetic attunement. Of course, distributional clustering can only succeed in yielding language-specific phonetic categories if the categories are present to be found in the child’s linguistic environment. If instances of a given category are not particularly similar to one another, or if categories are close to one another relative to their spread, unsupervised distributional learning cannot work. Some early studies appeared to support the feasibility of distributional learning by describing the difference between infant-directed speech and adult conversation, and showing greater separation among vowel category centers in infant-directed speech. Further study has produced a mixed picture, with some research suggesting that mothers clarify their speech (e.g., Bernstein Ratner, 1984; Kalashnikova and Burnham, 2018; Kuhl et al., 1997), and others suggesting that they do not (e.g., Bard and Anderson, 1983; Cristià and Seidl, 2014). This discrepancy may mean that increased clarity in parental vowel production varies according to aspects of the context, the sampling methods, the child’s age, and various features of the parent. For present purposes, though, our question is a bit different–not so much how infant- directed speech may be special, but to what degree an infant might learn speech sounds from it. Answering this question requires an estimate of what information infants might extract from each instance of a sound (and which instances count), and a guess about how the infant’s mental clustering algorithm works. In general, researchers have started from the assumption that infants extract formant values from all instances of vowels they hear, and cluster them in a way that can be approximated by either statistical clustering models (like k-means or hierarchical cluster analyses) or more complex computational models (Vallabha, McClelland, Pons, Werker, and Amano, 2007). Almost all such studies have been unable to show that distributional information in infant-directed speech is adequate for category learning of the sort apparently demonstrated by infants. The vowels overlap too much (Adriaans and Swingley, 2017; Antetomaso et al., 2017; Jones, Meakins, and Muawiyath, 2012; Swingley and Alarcon, 2018). In general, these models do not just fail; they are appallingly bad. For example, Swingley and Alarcon (2018) found that the basic morphology of a clustering solution was so arbitrary, it was usually significantly altered by sampling random sets of 99.5% of the data rather than the full 100%. The exception to this pattern, presented by Vallabha et al. (2007), did show successful learning of vowel categories. The speech data the model was given were not measurements of vowel tokens, but samples drawn from Gaussian distributions whose parameters were derived from recordings of mothers talking to their infants. The speech was not free conversation, but mothers reading nonce words from a storybook, where
Infants’ Learning of Speech Sounds and Word-Forms 277 many of the words were phonologically similar to one another (peckoo, payku, kibboo, keedo . . . ). Under these conditions is it very likely that mothers produced relatively hyperarticulated speech. This result suggests then that unsupervised learning models do not fail because they have in-principle flaws; they fail because ordinary parent-infant conversation is phonetically messy. Why do infants succeed where our own efforts fail? The main contending explanations are these: (a) we are using the wrong phonetic characterization; (b) we are measuring the wrong set of cases; (c) we are neglecting helpful regularities at other levels of description.1 On current evidence, each of these is plausible. Concerning the phonetic characterization, it is widely understood that the usual technique of describing vowels by measuring their first and second formants at the midpoint, and sometimes adding the vowel’s raw duration, only offers an approximation of the full information provided in a vowel. There are many other features, such as change in formant structure over time, spectral tilt, creak, and pitch movement, as well as features outside of the auditory domain entirely, such as visual features in the talker’s face (Teinonen, Aslin, Alku, and Csibra, 2008). It is sometimes suggested that measuring these features and adding them to categorization models would make the models more successful. This may be true (though it is difficult to be sure, as these are hard to measure in large natural corpora). It is also likely that categorization models would be more successful if they incorporated more of infants’ natural biases in the interpretation of speech (e.g., Eimas and Miller, 1980). A barrier to this implementation is that we do not have a complete accounting of these biases—the infant studies that have been done have been more like demonstration projects or existence proofs than like reference manuals. So perhaps speech perception is easier for infants than it seems, because of the information they have native access to. It is also possible that our models inflate the difficulty of learning speech sounds because of the assumption that all speech tokens drive the learning process. Infants may attend to some instances more than others (perhaps because they are unusually salient: louder, sitting on pitch peaks, longer . . . ), and perhaps the ones they attend to exemplify their categories more clearly. If this is the case, the random-looking explosions of points that make up typical speech first-formant x second-formant plots present too pessimistic a picture of the learning problem. For the purpose of recognition, every missed sound or word is an error; for the purpose of learning, relying on just a few very good tokens might be a perfect strategy. Adriaans and Swingley (2017) tested this idea by comparing the separability of vowels that had been independently labeled as being emphasized by the mother, and vowels that had not been so labeled. Certainly, the “focused” vowels were significantly more distinct from one another, and showed indications of hyperarticulation. This being said, the focal vowels still overlapped considerably, suggesting that attending only to prominent vowels does not make the variability problem go away. 1 There
is also (d), we have mischaracterized the learning that infants show in our categorization experiments. This is worth taking seriously (Schatz, Feldman, Goldwater, Cao, and Dupoux, 2019).
278 Daniel Swingley A third possibility is that infants make use of contextual information outside the segment for helping to identify those segments. As discussed above, one version of this is a very old idea: the notion that children’s knowledge of the different meanings of minimal-pair words could inform children about phonological distinctions. The problem with this idea for explaining infant development is that infants were supposed to start learning the meanings of words between 9 and 12 months of age or so, after the age at which experiments showed that they had begun to learn some of their language’s speech-sound categories. More recent research has indicated two ways around this sensible developmental objection: first, perhaps infants are already building a meaningful lexicon by midway through the first year, and if so, these words may be numerous and diverse enough to contain minimal or near-minimal pairs that in ensemble could drive phonetic learning. Second, infants might come to represent chunks of speech corresponding approximately to words, and rely on these chunks (the “protolexicon”; Swingley, 2005b) to serve as recognizable islands that could form the basis of phonetic generalizations. I will return to this possibility after characterizing what infants know of words.
13.4 Infants learning words In the 1980s and through the 1990s, researchers interested in what infants know about language diversified in questions they asked. Does language help infants learn categories? (e.g., Balaban and Waxman, 1997). Do infants relate speech sounds to talking faces? (Kuhl and Meltzoff, 1982.) Can newborns tell one language from another? (Mehler, Jusczyk, Lambertz et al., 1988). Can infants learn abstract rules? (Gomez and Gerken, 1999; Marcus, Vijayan, Rao, and Vishton, 1999). Much of this work was possible because of innovations in the methods used to assess infant cognition in the domain of language. By far, the most influential among these has been the Headturn Preference Procedure. In one early study, Fernald (1985) seated infants in a three-sided booth containing a loudspeaker on the left and right, and a light on the left, in front of the infant, and on the right. On each trial, one of the lateral lights was illuminated. When infants turned to look at it, speech was played from the speaker on that side. Fernald (1985) found that four-month-olds turned more to a side that played speech in an infant-directed speech register (with the pitch contours, elongations, and positive affect typical of speech directed to infants) than to the side that played speech in an adult-directed conversational register. Later studies adapted the method so that infants’ time to orient to a particular trial type became the dependent measure, with side of presentation randomized (Hirsh-Pasek et al., 1987; see also Colombo and Bundy, 1981). Infants turned out to be willing to look longer to hear some kinds of speech than to hear others. For the most part, studies revealed a preference for listening to speech samples that had more features of the native language, indicating that infants had
Infants’ Learning of Speech Sounds and Word-Forms 279 learned from their experience. Several studies exploited this to evaluate whether infants would listen longer to lists of words that they might have heard, as opposed to infrequent or invented (nonce) words. Hallé and de Boysson-Bardies (1994) were the first to show just that, in 11-and 12-month-olds, inspiring a series of follow-up studies testing whether this preference would be maintained if the words were altered phonologically (Hallé and de Boysson-Bardies, 1996, Poltrock and Nazzi, 2015; Swingley, 2005a, Vihman, Nakai, De Paolis, and Hallé, 2004, Vihman and Majorano, 2017). The motivation for testing “mispronunciations” like this is to learn whether infants’ knowledge of frequent word-forms should be viewed as vague, retaining only gross acoustic aspects of words; or more detailed, retaining sufficient phonetic information to cause phonological deviations to reduce familiarity. At 11 months, the familiar-word preference often disappears when the stimulus words are mispronounced: infants might gaze at a blinking light longer to hear ‘dog’ but not ‘tog.’ In some studies, infants also reveal a preference for canonically realized words over deviant pronunciations of those words. These changes in response seem most reliable in stressed syllables, and less consistent when less prominent portions of words are altered. This suggests that either infants only accurately represent the more salient parts of words, or less salient changes interfere less with whatever motivates infants to listen longer to familiar words. At five months, the familiar-word preference has been shown using the infant’s own name (Mandel, Jusczyk, and Pisoni, 1995). Experiments altering the phonological form of the name have yielded a more complex pattern of results, with infants detecting some changes and not others, in ways that may depend on the language being learned (Bouchon, Floccia, Fux, Adda-Decker, and Nazzi, 2015; Delle Luche, Floccia, Granjon, and Nazzi, 2017). What these studies show is that by 11 months and perhaps earlier, infants, in their daily experience with language, hear some frequent words and store them in memory with a level of fidelity that would be adequate for differentiating similar-sounding words as the language requires. These studies do not provide much of a basis for estimating how many words this amounts to, however; infants might know a few dozen word-forms, or they might know several hundred. How do infants find these words to begin with? Most utterances that infants hear contain more than one word, so short of assuming each utterance is a word, infants need a way to isolate them. Researchers have tried to characterize infant word-finding primarily using brief training procedures. Infants are first presented with a set of sentences, like a very short story, that contains several instances of a given word, such as ‘bike.’ Then, a series of test trials evaluates infants’ listening times to repetitions of that word (‘bike . . . bike . . . ’) or another word (‘feet . . . ’). In most studies, materials are counterbalanced across infants, so some children’s familiar target serves as other children’s unfamiliarized distracter. If, across the sample, most children listen longer to the familiarized word, it shows that they were able to recognize the match between the word as it appeared in the sentence, and as it appeared in isolation. This, in turn, would require that infants pull that word out of the sentence to begin with, and remember it for a brief period. The procedure can also be done in reverse, starting with isolated words and testing on different passages (Jusczyk and Aslin, 1995).
280 Daniel Swingley The introduction of this method sparked an explosion of studies about infants’ detection of words in continuous speech (for a review, see Johnson, 2016). Some of these studies asked about memory and representational format. For example, Jusczyk and Hohne (1997) exposed 8-month-olds to words read from a storybook over a period of two weeks, and then tested them in the lab after a delay of another two weeks. Infants revealed a preference for common words from the storybook (e.g., jungle, python) relative to control words matched for syllable number (e.g., lanterns, camel). Other home- exposure/ delay studies using these methods have yielded some variations: 7.5-month-olds perform better when the familiarization recordings use an infant-directed speaking style, if the familiarization is done from audio recordings played without coordinated social engagement (Schreiner, Altvater-Mackenson, and Mani, 2016). Keren-Portnoy, Vihman, and Fisher (2019) found that 12-month-olds could recognize home-exposed words, but only when those words were uttered as one- word utterances (in exposure and at test). The latter study suggests a less generous picture of infants’ word-segmentation ability. The authors propose that previous studies may have highlighted the trained words by presenting them in a voice other than the mother’s, making them more memorable at test. There are more variables in play here than there are datasets, but these conclusions seem safe: first, infants do show durable memory of some of the words that they hear in fairly typical at-home contexts, even when they have little or no information about what the words mean. Second, utterance position and talker identity probably play a role in determining which words children will recall (at least when recall is measured using this preference measure). The bulk of the headturn preference studies in this domain have been dedicated to characterizing infants’ capacities for speech segmentation, identifying which features of the speech signal they interpret as cohesive. Several studies have shown that infants do not group together portions of speech that fall on either side of a prosodic boundary, like a clause boundary; furthermore, words that are aligned with these boundaries are easier for infants to detect (Seidl and Johnson, 2006; see also Hirsh-Pasek et al., 1987). This is true even in infants as young as 6 months (e.g., Johnson, Seidl, and Tyler, 2014; Shukla, White, and Aslin, 2011). Defining exactly what counts as such a boundary to an infant, and then characterizing where these boundaries actually fall in infant-directed speech, would help provide quantitative estimates of how much these boundaries reduce ambiguity about word boundaries in the speech signal. A second way infants might discover words is by identifying chunks of speech whose component parts appear together in sequence more often than might be expected (which suggests they may exist together as an independent unit), or whose component parts seem to appear together only infrequently (which suggests they may include a boundary). Considering the Friederici et al. (1993) result mentioned above, an English learner hearing /kn/might well consider there to be a boundary between these two sounds (as in ‘walk now’), whereas a Dutch learner might instead consider them an onset (as in ‘knoop,’ button). Experimental tests of this possibility have consistently shown that infants extract words more easily when the contexts favor their segmentation phonotactically. For example, in English the sequence /nɡ/is rare within words. On
Infants’ Learning of Speech Sounds and Word-Forms 281 the other hand, /ŋɡ/and /ft/are quite common within words (‘finger,’ ‘lefty’). Likewise, /fh/is rare within words, appearing only in compounds like ‘wolfhound.’ Consequently, a word like /ɡæf/should fairly leap out of its context in ‘bean gaffe hold,’ and melt comfortably into ‘fang gaffe tine.’ Indeed, 9-month-olds detect ‘gaffe’ more readily in the ‘bean . . . hold’ context, suggesting that they not only have some sense of the phonotactic facts, but also use them to pull words out of their context (e.g., Mattys and Jusczyk, 2001). Infants also appear to learn language-particular prosodic generalizations that can help with speech segmentation. The best known of these is the “trochaic bias,” or the tendency to assume that stressed syllables (or syllables with full vowels: Warner and Cutler, 2017) start words. In Germanic languages like English, bisyllabic words tend to have stress on the first syllable, and mature listeners have a bias against interpreting weak-strong syllable pairs as words (Cutler, 2012). Studies of English-learning infants show that infants in headturn preference experiments more readily recognize strong- weak bisyllables like ‘kingdom’ than weak-strong bisyllables like ‘beret’ (e.g., Jusczyk, Houston, and Newsome, 1999). Given that only some languages exhibit this phonological regularity, it is probably learned through experience with the language rather than being an innate default. Infants in other language environments use regularities present in their language (e.g., Nazzi, Mersad, Sundara, and Iakimova, 2014). How could infants learn consonant-sequencing regularities and typical stress-pattern regularities at the word level, before knowing words? Is there a generic mechanism that could gain infants a foothold regardless of language? Certainly, phonotactic regularities concerning words might be guessed at using utterance edges or other innately available boundaries. If a given consonant cluster comes at the start of an utterance, it’s a good bet as a word-initial sequence. Models that learn phonotactic regularities in this way perform well in locating word boundaries in phonologically transcribed corpora in English (e.g., Daland and Pierrehumbert, 2011). Another potential generic mechanism that could help group together elements of a word is to evaluate whether those elements tend to appear together frequently, where “frequently” could refer to some absolute definition (is this pair of syllables, AB, more common than most pairs of syllables?) or a more complex conditional definition (when A occurs, does B usually follow, and vice versa?). Laboratory studies using “artificial languages” and, less often, carefully tailored natural-speech recordings, have used familiarization–preference designs to test this possibility. These studies have shown that infants are capable of computing both the absolute and the conditional kinds of frequency (Aslin, Saffran, and Newport, 1998; Goodsitt, Morgan, and Kuhl, 1993; Pelucchi, Hay, and Saffran, 2009; Saffran, Aslin, and Newport, 1996). Although most of this research has tested infants of about 8 months, recent evidence suggests that newborns perform similar frequency computations over simple syllabic stimulus materials (Fló et al., 2019). These possibilities give us a plausible narrative for word-form discovery. First, infants detect high-frequency elements at readily identifiable, prosodically defined edges, and high-probability transitions from one element to another. This yields a stock of familiar
282 Daniel Swingley phonetic forms, or “protolexicon” (Swingley, 2005b). Because the protolexicon is made up of elements that were identified using imperfect heuristics, its fidelity to the language is rather poor at first, containing numerous word clippings and spurious portmanteaus (e.g., Loukatou, Moran, Blasi, Stoll, and Cristia, 2019; see also Saksida, Langus, and Nespor, 2017). But it may be just correct enough to feed discovery and refinement of additional heuristics (such as the trochaic bias in English; Swingley, 2005b; Thiessen and Saffran, 2003), which in turn support additional growth and refinement. As this growth proceeds, words (or hypothesized proto-words) can aid in finding additional words, through “segmentation by default” (Cutler, 1994), that is, if a word is identified for sure, whatever comes after it is the start of another word (Brent and Cartwright, 1996). Infants use this strategy from a young age, according to Bortfeld, Morgan, Golinkoff, and Rathbun (2005), wherein infants hearing their name (or ‘mommy’) in a sentence were able to extract the following word, but were otherwise unsuccessful (see also Altvater-Mackensen and Mani, 2013; Shi and LePage, 2008). A difficulty with segmentation by default, of course, is that words are embedded in other words; the child might hear the familiar syllable ‘can,’ identify it as the word can, and assume that ‘teloupe’ must be its own word. Anecdotes about children protesting that they needn’t behave because they are already ‘being have’ fit this picture. Ultimately, determining how often this heuristic should be useful is a matter for computational models of corpora. Researchers have crafted a range of computational learning models inspired by infants’ performance in laboratory experiments and on a priori theoretical considerations. We cannot review them in any detail here, but Loukatou et al. (2019) and Saksida et al. (2017) provide some quantitative comparisons. Such models are critical for evaluating the real- life utility of the capabilities infants reveal in lab demonstrations. The authors of these models concede that they generally make utopian assumptions about the infant’s ability to interpret every phone in the input correctly, where “correctly” means “spoken like a dictionary” (see Figure 13.1). This is usually justified by appealing to the infant speech- categorization experiments reviewed above, and by the assumption that infant-directed speech, being hyperarticulated relative to adult conversation, does not suffer the same crushing levels of reduction (Warner, 2019). Another concern about most presentations of the computational models is that they are evaluated on their ability to produce correct segmentations, rather than their ability to mimic infants’ performance (that is, a great model that finds all the words is probably a very poor characterization of real human infants). Both of these problems have clear origins: canonical-pronunciation corpora are used because alternative corpora are not yet available, and gold standard evaluation is done because we do not know the actual contents of infants’ protolexicons. These are hard problems. We will point out a few partial solutions in Section 13.7 (“Quantitative analysis . . . ”; see also Creel, this volume, and Magnuson and Crinnion, this volume). What is the developmental role of infants’ speech segmentation? It was widely assumed in the 1990s and 2000s that infant word-finding at around 8 months old was a precursor to word learning. Words whose forms were already known might be easier to learn as real words later on, when infants were older and beginning to build a
Infants’ Learning of Speech Sounds and Word-Forms 283 true, meaningful lexicon (Graf Estes, Evans, Alibali, and Saffran, 2007; Hay, Pelucchi, Graf Estes, and Saffran, 2011); or words whose forms were familiar might have more robust phonological representations (Swingley, 2007a). And of course, enlarging the protolexicon would yield a larger database of phonological patterns, which could in turn improve the quality of speech segmentation. The picture that emerges from all we have discussed thus far is that infants are competent learners of phonetic structure at multiple levels. They can learn categories of sounds, they can detect how those sounds tend to co-occur within syllables, and they can learn stretches of sound that in many cases line up with words of the native language. Once they are familiar with these words, they can use them in turn to locate more words in running speech. By 12 months, they are primed and ready for word learning. Of course, implicit in the discussion of these studies is the assumption that infants start learning language by solving the conversion of speech into the correct sequence of consonants and vowels, and then using these segmental units as the elements from which syllables are counted and words are constructed.
13.5 Words and speech sounds This tale of the statistically prodigious but phonology-driven infant turned out to have a little flaw and a bigger flaw. The little flaw is the one mentioned earlier: unsupervised learning of speech-sound categories solely from distributions of infant-directed speech tokens seems difficult, and might be impossible. The bigger flaw is that word meaning seems to come into the developmental picture much earlier than previously supposed. We will take these up in turn. Infant speech-segmentation research has shown that infants learn word-sized chunks of speech at around the same time they are learning speech sounds. As we have seen, maternal speech does not seem to support speech-sound categorization by presenting sounds in phonetic clusters. Could forms from the protolexicon somehow aid in the discovery of phonetic categories? Perhaps variability in the environments of speech sounds could help render those sounds distinct from their neighbors. For example, analysis of a speaker in the Buckeye corpus of adult-adult conversation (Pitt, Dilley, Johnson et al., 2007) showed that vowel pairs like [ɛ-æ] do not form a bimodal distribution in the space formed by [duration, first formant, and second formant]. A clustering algorithm would not isolate two categories. However, the [ɛ] sound is vastly more likely to be followed by [n]than [æ] is; indeed, in the sample measured by Swingley (2007b), the presence of [n] as a coda consonant cleanly separated the [ɛ] tokens from the [æ] tokens. Analysis of phonotactic and phonetic distributions reveals several such cases, driven mainly by accidents of lexical distribution and lexical frequency (and sometimes by phonological rules, like the English ban on syllable-final lax vowels). Wherever these distributional differences are statistically strong, they might help indicate to infants a difference in category identity.
284 Daniel Swingley However, while there is evidence of 9-month-olds learning phonotactic rules, there are more abundant data on somewhat younger infants learning word-forms, which suggests the question: could word-forms themselves point infants to speech-sound categories? Consider the position of an infant learning Spanish and unsure whether /i/ and /ɛ/are the same or not. The instances of these sounds overlap substantially, although as expected the /i/s tend to exhibit a higher second formant and lower first formant than the /ɛ/s. Based on these data, the evidence to the child that there are two categories is slim at best. If the child were familiar with some words, such as quieres and mira (with [i]in the first syllable), and bueno and esta (with [e]), his or her representation of those words, though it derives from multiple instances, might more closely resemble an average or other central tendency of those instances. It appears that clustering over these averages is more successful than clustering over the tokens that gave rise to them (Swingley and Alarcon, 2018). The proposal, then, is that infants learn words and refine speech sounds at the same time, with their first guesses about word-forms providing an additional source of constraint on phonetic category boundaries (Swingley, 2009; for a fuller discussion and a computational model, Feldman, Griffiths, Goldwater, and Morgan, 2013). That infants might use contexts in this way is supported by laboratory studies in which meaningless phonetic contexts help shape infants’ categorization of speech sounds. For example, Thiessen (2011) found that 15-month-olds familiarized with repetitions of words like ‘dabo’ and ‘tagu’ (distinct contexts for [d]and [t]) were more likely to succeed in a difficult minimal-pair word learning task contrasting ‘da’ and ‘ta,’ than children familiarized with ‘dabo’ and ‘tabo.’ Feldman, Myers, White, Griffiths, and Morgan (2013) took this a step further, testing much younger children (8-month-olds) on a vowel discrimination task. In a familiarization phase, some children heard the vowels [a] and [ɔ] in distinct phonological contexts (like [ɡuta] and [litɔ]), and others heard these vowels in a minimal-pair context (like [ɡuta] and [ɡɔta]). All infants were then tested on discrimination of [ta] and [tɔ], and on this task, only the infants who had been familiarized to these vowels in phonologically distinct contexts discriminated them. This idea turns on its head the minimal-pair mechanism of establishing contrast. But in a sense, the ideas are similar. Vowels as instances populate the phonetic space too uniformly to be readily clustered. Words populate the phonetic space in a lumpy, nonrandom way, such that many words of the early lexicon are identifiable (if quite imperfectly) even before the precise phonetic bounds of their constituent units are defined; as a result, words can serve as identifying contexts for their components. On such an account, minimal pairs are predicted to make learning harder at first, because the infant does not have a strong basis for differentiating them before their meanings are known; by hypothesis, if similar contexts bring categories together, minimal pairs would count as identical contexts. But once the members of a pair can be distinguished (by aspects of meaning, or perhaps by cues to syntactic category), minimal pairs should be helpful in guiding children to the right phonological analysis. Indeed, infants can use semantic evidence to guide their attention to phonetic distinctions, including distinctions that are not used lexically in the native language.
Infants’ Learning of Speech Sounds and Word-Forms 285 This phenomenon has been demonstrated in a series of studies by Yeung (Yeung and Nazzi, 2014; Yeung, Chen, and Werker, 2014; see also ter Schure, Junge, and Boersma, 2016). For example, Yueng and Werker (2014) familiarized 9-month-olds to two unusual objects, labeling each one with either the word [ɖa] or the word [d̪a] (i.e., contrasting in the nonnative Hindi retroflex and dental /d/, which English-learning 9-month-olds do not discriminate). This consistent sound-to-word pairing, as opposed to no such pairing or an inconsistent one, led infants to differentiate these consonants. What this suggests is a set of interdependent relationships between phonetic categorization, the growth of the protolexicon, and the emerging lexicon (Werker and Curtin, 2005).
13.6 Word meanings Is there a developmental transition from initial reliance on a protolexicon of phonetic word-forms, into a “true” lexicon of words with semantic content? A key question is when infants begin to link words and meanings. Certainly, infants seem ready to detect connections between utterances and things in the world. When infants are shown pictures of objects and hear a word repeated along with the pictures, hearing the word seems to bind together these objects into a category in the infant’s mind; likewise, hearing different words applied to distinct objects seems to set the objects apart. These phenomena were first demonstrated in children 9–12 months old (e.g., Plunkett, Hu, and Cohen, 2008; Waxman and Markow, 1995; Xu, 2002) but have been extended to infants as young as 3–4 months (e.g., Ferry, Hespos, and Waxman, 2010; for a review, Perszyk and Waxman, 2018). These studies would seem to rule out any account in which words are simply carriers of emotional prosody by 3–4 months. Laboratory training studies have shown that it is possible to teach 6–7-month-old infants the connection between a novel word and a picture or an object (e.g., Gogate and Bahrick, 2001; Shukla et al., 2011). However, a popular argument holds that the referential uncertainty in children’s language environments prevents children from learning word meanings until they have developed sufficient skills of social cognition to understand the intentions behind communicative acts. If such skills are not in place before about 9 months, that age should also mark the onset of word understanding (e.g., Tomasello, 2001; Bloom, 2001). On this line of thinking it is usually assumed that laboratory word learning is either unrealistically simple, or reflects something more pedestrian, like audiovisual association, which is then argued to not qualify as word learning. Early experimental tests attempting to evaluate infants’ knowledge of the meaning of actual words, learned through daily life and not lab training, showed little evidence of word comprehension before 12 months (e.g., Thomas, Campos, Shucard, Ramsay, and Shucard, 1981). Tincoff and Jusczyk (1999) showed that 6-month-olds would look at a video of their mother when hearing the word ‘mommy,’ and at their father when hearing ‘daddy,’ but it was not clear how broadly to generalize this result given that those words are probably proper names for infants. Still, even infants not understanding reference
286 Daniel Swingley and restricted to a meaning-free protolexicon might detect that certain word-forms appear in particular (proto)lexical contexts, distinct from others, in the same way that Latent Semantic Analysis and similar approaches represent word meaning (Landauer and Dumais, 1997). In principle, this might provide infants a leg up in connecting words to broad semantic categories. Elika Bergelson and I aimed to test this “proto-semantics” idea, and set about a prolonged attempt to develop an anticipatory-eye-movement categorization procedure for this purpose. Being unsuccessful, in the meantime we embarked on a control experiment using a better-established language-guided looking method that tests whether children understand words. Pairs of images were presented on a screen, and parents named one of the images aloud. Infants’ gaze was monitored to determine whether they would look more at the named picture. This study was intended to confirm that indeed 6–9-month-olds do not know what common words mean, and to lay out the developmental course of word comprehension over the 6–20-month period (Bergelson and Swingley, 2012). In this, we seem to have failed, because 6–9-month-olds showed evidence of at least partial understanding of several words. Since that study appeared, a few studies have confirmed that by about 6–7 months of age, infants know at least a little about what some words mean (Bergelson and Swingley, 2015; 2018) and other studies have affirmed this in 9-or 10-month-olds (Nomikou, Rohlfing, Cimiano, and Mandler, 2019; Parise and Csibra, 2012; Syrnyk and Meints, 2017). On current evidence, infants’ lexical knowledge is quite sketchy; Bergelson and Aslin (2017) found that 6-month-olds looked at named pictures when the alternative was semantically unrelated (see car and juice, hear ‘car’) but not when it was related (see car and stroller, hear ‘car’). Perhaps 6-month-olds think ‘car’ is a decent word for a stroller, ‘bottle’ for a spoon, and ‘juice’ for milk. If so, this raises interesting questions about what the semantic contents of the 6-month-old lexicon. Are words linked to objects within broad situational contexts, rather than to specific object categories? Or are words linked to just a few salient features and are therefore over-inclusive? (Or do infants actually have more precise semantic representations, but smaller semantic mismatches drive their fixations less efficiently?) These questions merit further study. What does this mean for the notion of the protolexicon? It is difficult to say what proportion of spoken language is comprehensible, even minimally, to infants halfway through their first year. Intuitively, it seems that infants probably remember many word- forms as familiar sequences of speech without knowing their referent. Swingley (2007a) used the Brent and Siskind corpus (2001) to estimate how many words a child might hear with high frequency in a period of three weeks and suggested that in this period a child would hear almost 1,000 words 50 times or more. This count is probably an overestimate, because it extrapolates to the whole day data from sessions when parents knew they were being recorded (Bergelson, Amatuni, Dailey, Koorathota, and Tor, 2019). Based on those more recent results, we might estimate that if an infant needs to hear a word 50 times to enter it into their protolexicon, they could reach 1,000 word-forms in two or three months. This would likely exceed the stock of words to which they attach some semantic content.
Infants’ Learning of Speech Sounds and Word-Forms 287 This speculative line of reasoning suggests that there is a period on early development in which the infant lexicon contains several words with detailed referential content (mommy vs. daddy, hand vs. foot; Tincoff and Jusczyk, 2012), dozens or perhaps a hundred words with some broad semantic (and possibly syntactic) features, and several hundred that would be recognized as familiar but that are not yet meaningful. If this is anywhere near the truth, it suggests a lexicon that mixes both meaningful words, and also primarily phonetic entries akin to those of the previously hypothesized protolexicon, with experience filling in increasing semantic detail over time. How that works is, of course, a large topic on its own (see Gleitman and Trueswell, this volume).
13.7 Quantitative analysis and the poverty of the stimulus Learning is turning experience into knowledge. In their first year, infants learn a lot about their native language: they learn the basics of how it sounds, and they begin to build their vocabulary. As reviewed above, our strengths in characterizing this learning lie primarily in evaluating the hidden knowledge infants possess. In infants’ daily lives, their mental categorization of speech sounds is not visible by parents; their recognition of a word as familiar causes no consistent behavioral response. Part of the excitement of research on early language development has come from revealing this hidden knowledge. This work has given us a developmental timeline. To take the two major developments we have focused on here, infants learn to categorize clear instances of their language’s speech sounds (and presumably get better at categorizing more atypical instances too), and infants come to understand something about their first words, at around 6 months of age (and quickly make considerable progress from this shaky start). Along with a developmental timeline, his work has also spoken to individual differences to some degree. When we can measure variability in performance, this variability often turns out to be correlated with later measures of linguistic performance (e.g., Kidd, Junge, Spokes, Morrison, and Cutler, 2018; Kuhl, Conboy, Padden, Nelson, and Pruitt, 2005). Still, in studying infant language development we are better at measuring knowledge than at measuring learning. This is not unusual in developmental psychology (e.g., Siegler, 2000). Our attempts at measuring learning, as opposed to knowledge, tend to take the form of training experiments in which some phonetic item, word, or pattern is presented over a short period (measured in seconds or minutes) with maximum density (most or all items being relevant to the pattern). The pattern of successes and failures is then informative about the capacities of infants in the tested age group; and patterns of correlation with real-world outcomes (like vocabulary size counts) are informative about the skills that bear on success.
288 Daniel Swingley All the same, it remains difficult to make quantitative predictions about how the variations under study bear on the course of a child’s progress toward mastery of his or her native language. To take a common example, many studies have placed infants in a learning situation where the materials are delivered either in stereotypically “infant-directed” speech, or in an “adult-directed” register. Typically, infants perform better with the infant-directed register. We conclude: something about the infant- directed register is working. What we cannot say from this is how much of a benefit it provides. Effect sizes in laboratory measures are not effect sizes outside the lab, because interventions in the lab are often not similar to real-world experience. (There are exceptions—for example, studies with natural, normal-density exposure, such as Kuhl, Tsao, and Liu, 2003.) A consequence of this is that we have theoretical frameworks, rather than models designed to make quantitative predictions. In many cases, it is difficult to place competing frameworks on a footing that allows for direct comparison. Often, frameworks differ more in their domain of prediction than on the outcomes they predict. This does not mean that frameworks are not useful; they are. To take two examples from our field’s most accomplished scholars: The PRIMIR framework (Werker and Curtin, 2005) encourages consideration of the difference between phonetics and phonology, and exhorts us to be aware of the influence of task demands. These are critical reminders, and considering them helps clear up some puzzles in the literature. The Native Language Magnet: Expanded framework (Kuhl et al., 2008) proposes “native- language neural commitment” as an explanatory mechanism and encourages consideration of the neurological underpinnings of learning, knowledge, and on-line processing. These notions allow us to place a wide range of results in a common space for evaluation and consideration of a broad developmental picture. For many verbal models of this sort that conceptually integrate information from a many datasets, asking for quantitative predictions seems downright unfair, and asking which of two frameworks is correct feels like a category error. Ultimately, we would like to make quantitative predictions, and understand learning at a finer grain. Doing this will require that we characterize the experience of the child. To repeat the slogan given earlier, learning is turning experience into knowledge. But what is that experience? Often, the lack of adequately annotated datasets means that we are reading “experience” off the characterizations of language given in grammars, phonetics-lab recording studies, and idealized corpora. A problem with this is that children’s actual environments may present poverty-of-the-stimulus problems that we are not aware of, until we look (e.g., Bion, Miyazawa, Kikuchi, and Mazuka, 2013; Swingley, 2019). In many cases poverty-of-the-stimulus problems are quantitative: infants might have an in-principle “sensitivity” to feature x, but is this sensitivity good enough under day-to-day conditions? Are the conditions good enough for even perfect sensitivity to win the day? (According to the two studies just cited, even perfect measurement of vowel duration would not suffice for characterizing the phonological implications of vowel duration in Japanese, English, or Dutch.)
Infants’ Learning of Speech Sounds and Word-Forms 289 One way to achieve a better quantitative understanding of early language development is to measure connections between the language environment and linguistic outcomes more precisely. To take one example, Swingley and Humphrey (2017) examined which features of words make them most likely to be learned. We started from the Brent and Siskind corpus of child-directed speech, and parent report checklists of vocabulary among those same children. Following Brent and Siskind (2001), we used regression analysis to evaluate which aspects of words in the corpus made them most likely to be reported as understood or said by the children. For each word on the CDI (Fenson et al., 1994), and for each child’s corpus, we computed its frequency, its frequency in one- word utterances, its frequency sentence-finally, how much parents tended to speak that word with exaggerated duration, and a few other predictors such as the word’s form class and concreteness. Among the results were two key findings: first, that overall frequency was by far the strongest predictor of whether a given child would be reported to know a given word; second, frequency in one-word utterances was also a predictor (just as Brent and Siskind claimed), and was therefore not merely a proxy for other measured variables (such as elongated duration) that correlate with appearance in isolation. The point here is not so much the results, but the method: using regression, it is possible to evaluate the relative importance of several aspects of the language environment on the learning of specific items. This kind of study can provide an important counterpart to laboratory word-teaching studies that generally can evaluate only one or two variables at a time, and in an unusual corner of the frequency-of-exposure x density-of-exposure space. Similar studies are helping to differentiate theories of the infant’s capacity for word segmentation. Larsen, Cristia, and Dupoux (2017) implemented several word- segmentation algorithms that have been proposed in the literature, using the Brent and Siskind (2001) corpus as the environment, and parental report data from the WordBank repository (2017) as the outcome measure. Larsen et al. found that the models with the “best” performance in extracting words (i.e., a gold standard determined by the language) were not the models that most accurately predicted children’s word knowledge. Results like this signal the importance of using child data rather than gold-standard perfection in evaluating models of learning. For the near future the largest hurdle in making quantitative models of infant phonetic and word-form learning is the difficulty of simulating the infant’s innate similarity space for speech (Dupoux, 2018), and the lack of annotated corpora of infant-directed speech. Ideally, a computational learning model should take as its input the speech signal itself, or rather this signal represented as the transformation effected on the signal by the neonatal speech perception system. This is a difficult and unsolved engineering problem (e.g., Jansen et al., 2013). In principle, a reasonably accurate model of infant speech perception, supported by the many experiments that have been done to date (and, undoubtedly, by others, designed to fill in the most important gaps in our knowledge), would help us to evaluate how much of the developmental course of early language learning can be attributed to the infant’s processing of the information in the speech signal itself, and how much to other sources of information. Achieving this will require substantial effort in corpus collection and annotation, and in developing
290 Daniel Swingley speech-engineering tools. Ideally, this should be done in parallel with quantitative characterization of infants’ visual environments (e.g., Smith, Jayaraman, Clerkin, and Yu, 2018), since what infants see and hear obviously both contribute to word learning, and each may reinforce phonetic learning as well, via the lexicon.
13.8 Conclusions When infants experience people, they experience people talking. Parents, siblings, family friends, strangers, and even some sounding objects engage in this familiar, intricate, emotion-laden activity, which is as much a part of the infant’s early environment as smiles and milk. Infants are born recognizing the sound of speech, and very quickly, they capitalize on innate auditory abilities to characterize many aspects of the speech of their native community. In reviewing past research in this topic I have focused primarily on the questions and characterizations that have driven this work. The common narrative in which infants begin this process by clipping speech sounds into segments, statistically clustering those segments into categories according to the consonant and vowel inventories of their language, and then using those categories to build the vocabulary, appears to be too simple. The clipping is imperfect, the clustering may well depend on the precursors to the lexicon, and the early vocabulary may not be represented in terms of these discrete units, at least in part of the first year. What will this narrative look like ten years from now? New approaches may well lead to different emphases and somewhat different questions. For example, infants probably discover their first word-forms when they find that one chunk of speech is especially similar to another chunk heard recently (per, e.g., Park and Glass, 2008). Are consonants and vowels implicated in this similarity comparison at all, as they are in current infant models? If they are not implicated in newborns, then when are they, and why? Is it related to the infant’s vocal production? Word recognition is sometimes conceptualized as involving the activation of segmental categories, because language cannot be adequately explained without them (e.g., Pierrehumbert, 2016; Pisoni and Luce, 1987). But these categories cannot be recognized or learned without taking their context into account, because the specifics of their realization depend on many aspects of the context. Indeed, speakers often realize one sound by embedding a gesture toward it within another sound (Hawkins, 2010). Eventually, children must be able to interpret language not by slicing speech into segments and categorizing them, but by solving an attribution problem: for each perceivable aspect of the phonetic signal, what is its linguistic origin? Viewing speech perception as a sequence of categorization problems rather than an attribution or “blame assignment” problem may be a mistake (see Quam and Swingley, 2010, for discussion). Finally, evidence is building that infants’ own articulations help organize their interpretation of others’ speech (e.g., Vihman, 2017). We may find that the course of normal development depends critically on infants’ own somatosensory representations (e.g., Beckman and Edwards, 2000; Choi, Bruderer, and
Infants’ Learning of Speech Sounds and Word-Forms 291 Werker, 2019). If so, the corpora that we will need to create in order to model development adequately may come to involve not only microphones for parent and child, and head-mounted eyetrackers all around, but a scheme for measuring the infant’s own articulatory movements. Answering these questions is will be a challenge, but the progress that research has made in the past several years suggests that quantitative explanations of the course of language development are reasonable goals to aim for.
Acknowledgment This work was supported by NIH grant R01-HD049681 to D. Swingley.
Chapter 14
Learning word s a mi d st spe ech sou nd va ria bi l i t y Sarah C. Creel
14.1 Introduction In this chapter, I want to talk about how learners, particularly children, learn to recognize words in the face of sound variability present in any language. At a very simplified level, learners must associate word forms—regularities in the speech signal—with meanings. This assumes that the learner possesses representations both of word forms and of meanings. I am most interested in form, so when I say “word” here I am referring to form, and not to meaning (for more on meanings, see Gleitman and Trueswell, this volume). One assumption is that learners have to segment words from a continuous speech signal. Another common assumption is that, after a brief period of acquisition early in life, words are represented as sequences of speech sounds. For present purposes, by speech sound, I refer to the presumed set of speech perceptual categories in a language, such as ‘ee’ (/i/) compared to ‘ih’ (/ɪ/). (Throughout, letters in slashes are spelled in the International Phonetic Alphabet for readers who are familiar with it, but I include US English phonetic spellings throughout for those who are not.) For example, ‘ih’ and ‘ee’ in US English differ in their relative formant frequencies (vocal tract resonances) as well as relative duration. As implied by my use of the word relative, these speech sound categories do not include absolute formant frequency or speech rate information. If the learner encounters a novel sequence of speech sounds, then it must be a new word. As we will see, this assumption is not quite correct. I am skipping over the rather complex logical problem of word segmentation (though see Boll-Avetisyan, 2018, for recent coverage of this topic). I am also skipping over how one learns to produce words with variable sound patterns (though for more information on this topic see Bürki, 2018). What I am most interested in doing here is accounting for how one can learn words when the words themselves appear to vary in their sound patterns, sometimes even crossing the dividing line into another speech sound. Given
Learning words amidst speech sound variability 293 this reality, it is surprising that there has been relatively little exploration of how learning nonetheless proceeds.
14.2 Real-world settings where variability occurs Sound variability is rampant in spoken language (see Purse, Tamminga, and White, this volume). Speakers vary in their fricatives such as ‘s’ and ‘sh’ (Newman, Clouse, and Burnham, 2001), vowels (Hillenbrand, Getty, Clark, and Wheeler, 1995), and even stop consonants such as ‘d’ and ‘t’ (Allen, Miller, and DeSteno, 2003; Chodroff and Wilson, 2017; Theodore, Miller, and DeSteno, 2009). Male vs. female speech equivalencies (what male vowel formant values correspond to what female vowel formant values) vary by language, and thus presumably must be learned (Johnson, 2006). Despite this variability, enough structure occurs in one’s spoken language to allow recognition and production of sounds and words. Complicating this picture, there are numerous everyday situations where learners experience substantial variability in a word’s sounds. One such situation is phonological alternants of the same morpheme, such as the possessive marker -s, which sounds like ‘s’ in cat’s but ‘z’ in dog’s and ‘uhz’ /əz/ in mouse’s. Similarly, the ‘t’ in pat and the ‘d’ in pad are distinguishable, but in US English productions, they lose this distinction (it is neutralized) when -ing is added -patting sounds like padding. (Interestingly, 12-month- old infants may not detect a /t/in patting; see Sundara, Kim, White, and Chong, 2013.) Another example is sound change due to coarticulation, where, for example, the ‘n’ at the end of green sometimes moves closer to ‘m’ when spoken in the phrase green beans (e.g., Dilley and Pitt, 2007). Yet another common situation is casual or conversational speech. Yet another widespread source of sound variability in children’s (and adults’) input is sociolinguistic variability. Within a language community, socially linked variants of phonetic forms can pattern with gender, social class, and (in)formality of situation (see Nardy, Chevrot, and Barbu, 2013, for an overview; and references below in Section 14.3). Perhaps the best-studied case of within-word speech sound variability is accented speech. Both foreign accents and regional dialects can present sound patterns that sound to a listener like a different word. For example, a speaker from South Carolina might describe a writing implement (or a Philadelphia university) as ‘pin’ /pɪn/, while a speaker from New York City would describe it as ‘pen’ /pɛn/. A speaker from Mexico City might call the same writing implement or august university /pen/, which both of the first two speakers might identify as the English word pain /peɪn/. A related situation that is rarely considered as a contributor to input variability is the learner’s own errorful speech (Baese-Berk and Samuel, 2016; Cooper, Fecher, and Johnson, 2018). Early in the learning process, particularly in childhood, there are going to be production errors, due to motor difficulty, inaccurate perceptual representations,
294 Sarah C. Creel or both. Do these errors contribute to one’s speech representations? For example, a child may or may not have an adult-like perceptual representation of American English /ɹ/ but may produce something closer to adult /w/. Does this production alter their perceptual representations, and if not, why not? Similar questions obtain in second-language learning. I do not discuss self-speech errors further in this chapter, but I mention it here to illustrate how widespread speech sound variability is in the input. Thus, there are many well-acknowledged situations where listeners will encounter word forms with different sound patterns and should be aware of the similarity between them—as well as the contexts in which different variants occur. Given this, a learner or listener who is limited to matching exact sequences of sounds is going to have a difficult time. An expanding collection of research addresses how listeners might adapt to new listening situations, particularly unfamiliar accents or phonetic variants (e.g., Bradlow and Bent, 2008; Clarke and Garrett, 2004; Eisner and McQueen, 2005, 2006; Kraljic and Samuel, 2005, 2006; Norris, McQueen, and Cutler, 2003). Much of this work assumes lexical feedback—knowledge of what is a word and what is not—as a mechanism of learning. That is, if feesh /fiʃ/is not a word (lexical item), the speaker must be saying fish /fɪʃ/, therefore I should adjust my ‘ih’ /ɪ/category closer to ‘ee’ /i/. Yet many questions remain as to how adaptation takes place. Further, very little research asks what learning words looks like in high-variability settings. Unlike accent adaptation, lexical feedback is not available when learning new words. If we think the learner’s task is to infer the presence of a new word whenever they hear a not-yet- experienced sequence of speech sounds, and to group together word instances that contain the same sequence of speech sounds, the types of speech variability outlined above would seem to present a major difficulty to the learning process. If there is so much input variability, one might wonder how the learner ends up with speech sound categories to begin with, or what good it would do to recognize or encode them. However, the argument I put forth here is not so much about the usefulness or existence of speech sound categories; I rather argue that gradient similarity of words and sounds, without respect to category boundaries, is very important to learning words when sound variability is rampant.
14.3 Dimensions of word similarity: knowing when to duck Understanding word learning is to some degree a question about the elements of word forms themselves. The simple story that we tell is that words are distinct from one another because they differ in their constituent sounds. On this story, figuring out what the speech sounds of your language are is a critical step in learning, because not all sound differences indicate meaning differences. For a very young learner, this may be a rather complex problem. For example, the substantial acoustic differences between one’s
Learning words amidst speech sound variability 295 friend saying duck and one’s parent saying duck matter little to what word form it is. Yet the acoustically-subtler differences between one’s friend saying duck vs. buck, or duck vs. dock, are critical to word form identity. (This sort of similarity might explain why the author, as a young child, thought that writers of books were ‘arthurs.’) That is puzzling enough, but now imagine that a friend’s parent says dock /dɑk/ when referring to a duck. Is this a new word, perhaps a duck part, such as a wing or a beak? Or will a child recognize this as the intended sense of quacking animal? What if the friend’s parent and one’s own parent respectively refer to a new adult as Mr. Grok /grɑk/and Mr. Gruck /grʌk/? Will this interfere with learning in the name way? One possibility is that learners respond readily to gradient similarity between word forms, irrespective of speech sound boundaries. Thus, dahk /dɑk/is close enough to duck /dʌk/, and grok /grɑk/is close enough to gruck /grʌk/. If we remove speech sound category boundaries as a limiting factor, though, how do we decide what range of variability or set of instances count as “the same” word? I assert that the learner is amassing detailed representations of all inputs and computes recognition by activating stored material in proportion to its similarity to the input as described in exemplar-style models (Goldinger, 1996, 1998; Hintzman, 1986; Johnson, 1997; Sumner, Kim, King, and McGowan, 2014). Another possibility is tracking central tendency and variability of particular words or particular speech sounds. However, this requires knowing how to classify the words or speech sounds to begin with, and in many cases, one may not know how to do this classification, especially early in learning. Further, I will argue that preserving fine detail in the lexicon may be crucial for permitting rapid comprehension. An additional source of information for determining what information is activated during recognition, or what to include in calculations of central tendency and variability, is to let context supervise the learning process. Essentially, some sort of context may label potential words or sounds. Perhaps a duck is physically present when various speakers say both duck /dʌk/ and dahk /dɑk/(Smith and Yu, 2008; see also Yeung, Chen, and Werker, 2014). Perhaps ‘uh’ /ʌ/occurs when /d_k/is present (Feldman, Griffiths, and Morgan, 2009; Swingley and Alarcon, 2018; Thiessen, 2007). Perhaps someone produces an ambiguous sound ‘?’ which occurs in the context of other speech material where only one interpretation of the ambiguity yields a word (Kraljic and Samuel, 2005; Norris et al., 2003), such as dino?aur. Perhaps Spanish accent characteristics on one speech sound co-occur with Spanish accent characteristics on another speech sound (Gonzales and Lotto, 2013). Sociophonetic analyses of caregiver and child speech suggest quite strong contextual conditioning (roughly, formal vs. informal) on two versions of the same vowel in Scottish English (Smith, Durham, and Fortune, 2007). Lexically-specific patterns are also evident (Smith et al., 2007). Thus, other aspects of form, meaning, or situation might tell the learner what sounds or instances go together, both when words are initially stored and when they are recognized in real time. Rather than a perspective of storing words as sequences of speech sounds, a more exemplar-style perspective of storing continuous variation in detail may help explain accent adaptation and learning with accent variability. For adaptation, learners might be storing new accented word forms along with contextual information, which would then
296 Sarah C. Creel allow them to recognize those words and similar ones more rapidly when hearing them in the future. For learning words with accent variability, differently-accented variants of the same word would be similar to and would coactivate each other, depending on the degree of similarity between them. If different accents occur in separate contexts, then those contexts themselves could later focus activation. In both cases, storing detailed acoustic and contextual traces would allow learners to activate the most contextually- relevant information for comprehension. Thus, when listening to a particular speaker or accent, or when listening in particular conditions, those subsets of instances would be more activated.
14.4 The evidence for gradient flexibility I will talk about two situations suggesting strong sensitivity to gradient similarity across speech sound boundaries. The first situation is recognition of new variants of familiar forms, which on some accounts is a process of learning these variants as new word forms. The second situation is learning forms where the forms themselves are variable. My own research focuses on both young child learners (approximately 3–5 years old) and on young adults, so I discuss both age groups. A methodological note is in order. Many studies of accented speech processing use contrived accents, though quite a few use naturally occurring accented speech. Some accents are simply systematic mispronunciations of sounds, such as having a speaker replace the ‘a’ sound in cat /æ/with ‘eh’ /ɛ/, turning cat /kæt/ into ket /kɛt/. Others create a series of auditory morphs between one speech sound and another, for instance, by acoustically mixing ‘f ’ and ‘s’ sounds together to create a sound partway between f and s. The major advantage of artificial accents is experimental control: by asking a speaker to systematically produce a different target speech sound, or by creating a continuum of stimuli between two endpoints, the researcher has a better idea of what is disrupting comprehension or what the listener might adapt to. In natural spoken material, one can at best make educated guesses as to the factors impeding comprehension or permitting adaptation. However, this experimental control comes at some cost to ecological validity.
14.4.1 Recognizing words when the input is variable 14.4.1.1 Adult listeners comprehending and adapting to accented speech Adult listeners adapt to accented speech, showing improved performance after listening experiences that might vary from moments (Clarke and Garrett, 2004) to hours over
Learning words amidst speech sound variability 297 multiple days (Bradlow and Bent, 2008). They also begin to produce at least some new forms after moving to a region where there is a different accent (see Nycz, 2015, for a review of second-dialect acquisition). One facet of this work that often goes unremarked upon is that recognition does not start out at zero, which suggests that spoken word recognition by partial similarity is commonly operative (see also work that looks at adults’ activations and confusions of similar-sounding words, such as Magnuson and Crinnion, this volume; Allopenna, Magnuson, and Tanenhaus, 1998; Creel and Dahan, 2010; Marslen-Wilson, Moss, and van Halen, 1996). Research on accent adaptation has tended to take one of two complementary approaches: a narrow focus that looks at a small number of speech sound changes, often just one, and a broad focus that looks at improvements in overall speech recognition as the result of exposure to naturally produced sentence material. Researchers who focus on changes to individual speech sounds tend to view the learner’s problem as adjusting to preexisting sound categories (Kleinschmidt and Jaeger, 2015). These studies indicate that presentation of words with ambiguous sounds partway between two speech sound category means (dino?aur) (Eisner and McQueen, 2005; Kraljic and Samuel, 2005, 2006; Norris et al., 2003), or even unambiguously different sounds (such as the nonsense word wetch /wɛtʃ/in place of witch /wɪtʃ/) (Dahan, Drucker, and Scarborough, 2008; Maye, Aslin, and Tanenhaus, 2008; Weber, Di Betta, and McQueen, 2014) allow listeners to use word knowledge to retune their speech sound boundaries. How far does accent adaptation extend? In some cases, listeners generalize adaptation to new word contexts (McQueen, Cutler, and Norris, 2006; Maye et al., 2008). In other cases, the particular contrast under study does not exist in enough real words to allow a test of generalization, so generalizability is unclear (Dahan et al., 2008; Trude and Brown-Schmidt, 2012). Further, some studies find generalization of adapted speech sounds to new talkers (Kraljic and Samuel, 2006), while others do not (Eisner and McQueen, 2005; Kraljic and Samuel, 2005; Trude and Brown- Schmidt, 2012; Trude, Tremblay, and Brown-Schmidt, 2013). This is important because tracking particular talkers may be a mechanism that allows comprehenders and learners to keep track of speech sound contrasts that only some talkers use (Dahan et al., 2008; Trude and Brown-Schmidt, 2012). Perhaps generalization to new talkers depends on the particular type of speech contrast. One area where researchers report generalization to new talkers is in voice onset time (Kraljic and Samuel, 2006; though see Sumner, 2011). However, there is substantial, consistent intertalker variation in voice onset time (Allen et al., 2003; Chodroff and Wilson, 2017; Theodore et al., 2009), so it would be useful if learners could adapt to specific talkers’ voice onset time patterns rather than generalizing. Other researchers have favored a more wholistic approach to assessing accent adaptation, with studies that present large numbers of sentences containing a wide range of speech sounds to look at overall recognition accuracy (Alexander and Nygaard, 2019; Bradlow and Bent, 2008; Sidaras, Alexander, and Nygaard, 2009; Tzeng, Alexander, Sidaras, and Nygaard, 2016). Given that listeners are tested on different sentences than the ones they were trained on, adaptation generalizes to new words. These studies mostly
298 Sarah C. Creel find generalization to new talkers, provided that there were multiple talkers at adaptation. Greater benefits from multiple-talker adaptation are consistent with previous findings in second-language speech sound acquisition (Bradlow, Pisoni, Akahane- Yamada, and Tohkura, 1997; Lively, Logan, and Pisoni, 1993; Logan, Lively, and Pisoni, 1991). In some cases, improvement appears to be limited to speakers of the same accent (Alexander and Nygaard, 2019; Bradlow and Bent, 2008), though at least one study demonstrates generalization from training talkers with multiple accents to a test speaker with a previously unheard accent (Baese-Berk, Bradlow, and Wright, 2013). As Baese-Berk et al. (2013) point out, this may be due to the rarity of certain sounds in the target language (English), which speakers of multiple other languages modify in similar ways. A finding from Sebastián-Gallés and colleagues bears on this situation (Sebastián- Gallés, Echeverría, and Bosch, 2005; see also Sumner and Samuel, 2009), even though those researchers do not explicitly discuss it as accent adaptation. They find that native Catalan speakers who distinguish the Catalan sounds /e/and /ɛ/(something like English date and debt vowels) nonetheless recognize mispronounced */gəʎeðə/ (which should be /gəʎɛðə/) as a word, more often than the reverse mispronunciation, such as */finɛstɾɑ/(which should be /finestɾɑ/). Sebastián-Gallés et al. (2005) infer that these listeners have learned an additional word form from Spanish-accented speakers, who have only a single /e/-like vowel category (something like English date) in the region of Catalan’s /e/and /ɛ/, and thus produce words like galleda with something more like the Catalan /e/sound. This suggests that learning additional word form representations, or perhaps representations of sub-word components, may be a mechanism of accent adaptation. As I will discuss later, learning detailed sound representations may facilitate accented speech processing generally. What Sebastián-Gallés et al.’s (2005) findings do not really address is whether knowing the sound form /gəʎɛðə/helps one to learn the sound form /gəʎeðə/or what happens when one is learning both forms at the same time.
15.4.1.2 Child listeners comprehending and adapting to accented speech Of course, the above findings are all from young adults, who have had a substantial amount of language experience. A small literature has begun addressing whether young children adapt to accents (for more on infant speech sound and word learning, see Demuth, this volume, and Swingley, this volume). Schmale and colleagues find that infants are able to recognize (show differential responses to) familiarized words over an accent change by 12–13 months (Schmale, Cristía, Seidl, and Johnson, 2010; Schmale and Seidl, 2009). Still, Mulak, Best, Tyler, Kitamura, and Irwin (2013) found that Australian English- learning children cannot recognize words in Jamaican- accented English until 19 months, a finding replicated by van Heugten and Johnson (2014), though the latter authors also found that exposure to the accent permitted recognition. However, Swingley and Aslin (2002) found partial recognition of mispronounced words in 14– 15-month-old children, suggesting that under some circumstances children under 19 months are sensitive to accent-like variants (though those authors’ point was more that children can detect the difference between better vs. worse pronunciations). White and
Learning words amidst speech sound variability 299 Aslin (2011) reported that 19-month-olds who hear a raised vowel in words like sock (so that it sounds like sack /sæk/) while seeing a picture of a sock are shortly thereafter able to visually fixate a picture of a dog upon hearing dag /dæg/. This implies learning of a particular accent feature as well as generalization to new words. Van Heugten and colleagues (van Heugten and Johnson, 2016; van Heugten, Krieger, and Johnson, 2016) find that Canadian-English-learning toddlers simply understand Australian and Scottish English accents by 25–28 months both with and without exposure to those accents in the lab, as measured by visual fixations to pictures named in the accent. This suggests that by the age of 2 years or so, children are skilled at recognizing similarities of accented words to their own word representations. Further research extends to children aged 3 years and older. This literature tends to use different recognition measures than infant studies: verbal report of what was heard (word repetition) rather than infant measures such as looking-while-listening or preferential looking to sound source. Nathan, Wells, and Donlan (1998) found that child Londoners had difficulty understanding Scottish-accented words, as measured by their spoken word repetition accuracy. Bent and colleagues (Bent, 2014; Bent and Atagi, 2015; Holt and Bent, 2017), who also used verbal report measures, have more recently reported that children between ages 4 and 7 years have more difficulty than adults at repeating accented words heard in noise, and show improvement with age. This seems somewhat at odds with the above infant and toddler research suggesting good comprehension of accented speech by 2 years or so. These apparent discrepancies may reflect differences in the methods typically used to test infants (picture recognition, essentially a two-to four-choice task) and those used to test older children (word repetition, essentially an open-ended task). Creel (2012) tested 3–5-year-olds’ comprehension of accented speech but used a visual-world eye tracking experiment rather than word repetition. Children saw four pictures at a time and heard one of them labeled, as their eye movements to pictures were tracked. The use of visual fixations to a small set of response alternatives (pictures) is more similar to research on toddler accent comprehension. Some of the labels that children heard contained accent-like mispronunciations of familiar words modeled on the artificial accent used by Maye et al. (2008), such as fesh /fɛʃ/ for fish. Contrasting with the research discussed above (Bent, 2014; Bent and Atagi, 2015; Bent and Holt, 2018; Holt and Bent, 2017; Nathan, Wells, and Donlan, 1998), Creel found that 3–5- year-olds were well above chance in recognizing accent-like mispronunciations, both in terms of pointing accuracy and in terms of visual fixations of the named picture. For single-feature mispronunciations, they were sometimes close to ceiling. They also rarely selected novel pictures, which is the presumed response if the child thinks they are hearing a novel word (see related work by Swingley, 2016). The differences in performance between Creel (2012) and Bent’s and Nathan’s work raise questions about why children might perform so differently. Is it due to visual-world experiments providing supportive context (a limited set of alternatives), or to differences in accent strength between the real accents used in Bent’s and Nathan’s work and the pseudo-accent in Creel (2012)?
300 Sarah C. Creel Creel, Rojo, and Paullada (2016) addressed this puzzle by conducting two sets of experiments which presented children with natural-accented materials in two different tasks, one more like Creel (2012) and one more like Bent and colleagues (Bent, 2014; Bent and Atagi, 2015; Holt and Bent, 2017). The first two experiments tested children’s comprehension of Spanish-accented English speech, with American-accented English control trials, again using a four-picture display and eye tracking. Pointing accuracy and looks were high for both accents in the two experiments, and sensical sentence contexts also aided comprehension. The second pair of experiments presented the same isolated words and words in sentences, but instead of asking children to point to the named picture, we asked children to repeat the word (or, for sentences, the last word of the sentence). They were markedly less successful compared to the picture-pointing experiments and showed more statistically reliable effects of the unfamiliar accent. What this study suggests is that children’s difficulty with accented speech increases as supportive context decreases. This partly resolves the discrepancies in findings across studies: studies that present more context, such as a limited set of visible response choices or supportive sentence contexts, tend to find better performance than those with less context. A remark is in order concerning adaptation in child studies. While there is evidence of adaptation in infant and toddler studies, few of my studies have shown adaptation, and studies conducted by Bent and by Nathan and colleagues do not remark on it. Assuming that those authors did not find adaptation, there are a few reasons that adaptation may not have been observed. One reason may be that children need much more exposure than was presented in those studies to adapt to the accent. Another reason there might not be much adaptation in Creel et al. (2016) is that children were hearing American-English-accented material interspersed with Spanish-English-accented material, as well as substantial talker variability. Thus, there was no single accent to adapt to, nor could they track a small number of talkers (e.g., two). Yet another reason is that children may have less knowledge to base adaptation on than adults do: If children’s word representations are weaker than adults’, they are less likely to be able to engage in lexically guided retuning. This is consistent with findings that greater vocabulary size predicts better accented speech recognition performance in children and adults (Bent, 2014; Bent, Baese-Berk, Borrie, and McKee, 2016). A final reason is that children tested in San Diego may be pre-adapted to Spanish-accented speech via community exposure. However, pre-exposure seems unlikely to account for findings from Bent and colleagues, who have used a range of accents unlikely to be prevalent in their testing location (Bent, 2014; Bent and Atagi, 2015; Bent and Holt, 2018; Holt and Bent, 2017).
14.4.1.3 Accented speech: summary of recognition and adaptation A range of studies assess comprehension of accented and mispronounced speech. Adult and child learners can leverage existing word knowledge to recognize accented variants of words and, at least for adults, to improve comprehension of that accent. Questions remain as to how far accent adaptation generalizes—to new sounds, words, speakers?
Learning words amidst speech sound variability 301 —and what mechanisms permit adaptation. I return to these questions later, but first, I turn to a second area of inquiry.
14.4.2 Learning words when the input is variable Another area of research explores how learners cope with variable input. In this case, there is no teaching signal or labeling function like that provided by lexical feedback: for all the language learner knows, wetch is a word as much as witch is. There is still context, in that two forms (say, Mr. Gruck /grɑk/and Mr. Grok /grʌk/) may both occur in the presence of a particular referent (say, a novel adult). Learning with variable input has been addressed in language evolution, particularly in the domains of syntactic creolization (Hudson Kam and Newport, 2005; Hudson Kam and Newport, 2009) and simulated language evolution (Kirby, Tamariz, Cornish, and Smith, 2008; Scott-Phillips and Kirby, 2010; Smith and Wonnacott, 2010). Within psycholinguistics, variable input has been explored much less in the phonetic domain other than studies of bimodal vs. unimodal distributional perceptual learning (Maye, Werker, and Gerken, 2002; Pająk and Levy, 2011). An interesting line of research in sociolinguistics addresses children’s production of variable forms (Roberts, 1997, 2004; Smith et al., 2007; Smith, Durham, and Richards, 2013; see review in Nardy et al., 2013), though for the most part it does not directly address comprehension or rate of learning with vs. without variability (though see Miller and Schmitt, 2012).
14.4.2.1 Input variability: natural experiments A few studies have examined natural experiments, cases where children are learning in multiple accents. Floccia, Delle Luche, Durrant, Butler, and Goslin (2012) tested 20- month-olds learning English from their parents and community. Some parents were non-rhotic speakers, that is, they omitted coda (word-final) r’s from syllables, while the community were largely rhotic speakers, who do not omit r’s. Results suggested that children recognized words more readily in the community accent. This is interesting and somewhat surprising in that it implies that some of the children who were tested had difficulty understanding some words in their own parent’s accent. In a different study, Buckler, Oczak-Arsic, Siddiqui, and Johnson (2017) found that 24-month- olds who were exposed to multiple accents of English recognized Canadian-English- accented words more slowly than mono-accentual (Canadian-English) children did, though the groups performed similarly by 34 months. While there are multiple ways to interpret this finding, one interpretation is that exposure to more than one accent may make comprehension in each accent more effortful. A follow-up study to Creel et al.’s (2016) accented speech comprehension study examined a similar question to Buckler et al. (2017) in older children, ages 3 to 5 years. Creel (2016) looked at English monolingual and Spanish-English bilingual children’s comprehension of Spanish-accented English. To create a challenging listening situation, words were tokens that were difficult for adults to identify in an open-ended transcription
302 Sarah C. Creel task (average accuracy 12%). This study made the assumption that bilingual children would have more exposure than monolingual children to Spanish-accented English as well as some exposure to American-accented English, and might thus perform better than monolinguals on Spanish-accented English. Children saw four pictures at a time and heard a word or sentence that mentioned one of them, either in Spanish-accented or US-accented English. The result was somewhat unanticipated: children exposed to both Spanish and English in the home showed lower picture-pointing accuracy and fewer visual fixations relative to monolinguals for both US accents and Spanish accents. This may indicate differences in English vocabulary size between the two groups but does not suggest that bilingual children have an accented speech comprehension advantage from learning Spanish-accented variants of words. It is also consistent with Buckler et al.’s (2017) findings that dual-accent exposure may slow word recognition. All of the situations described above consider cases where the variability of interest occurs between speakers. Miller and Schmitt (2012) examined a case where variability occurs within, rather than across, speakers. They reported that children took longer to learn to produce a Spanish plural marker -s when acquiring a dialect where speakers sometimes omit the marker (Chilean Spanish) than a dialect where speakers do not omit it (Mexican Spanish). This production finding is consistent with the comprehension studies described above in showing slowed acquisition when variable forms are present. However, it is important to contextualize this with evidence that some sound variation that is partly conditioned by social, phonological, or grammatical context is learnable (e.g., Roberts, 1997, 2004; Smith et al., 2007; Smith, Durham, and Richards, 2013), in that adults’ variable usage patterns show up in children’s productions. In short, cross-accent and within-accent variability may provide a challenge in very early childhood, but variability in children’s input is widespread and they begin to produce that variability themselves during the preschool years.1
14.4.2.2 Sound pattern input variability: lab experiments with adults A second line of studies explores novel word learning with word-internal sound pattern variability in the lab. To my knowledge, my own lab is the only one to address this exact phenomenon to date (though see White, Peperkamp, Kirk, and Morgan, 2008; Witteman, Bardhan, Weber, and McQueen, 2014 for related work). Three sets of experiments are relevant (Creel, 2014; Frye, 2018; Muench and Creel, 2013). The basic logic of each of these studies is as follows. If speech sounds are truly critical to separating words from each other, then learners should be poor at learning dual labels for things, even if those labels are similar. However, if this sort of accent-like variability is not terribly detrimental to learning, then learners should perform well. The first experiment in Muench and Creel (2013) investigated this phenomenon. The experiment asked whether adults could learn words from two different artificially 1 As Roberts (2004) notes, it is somewhat challenging to look at children’s productions of variable forms because child speech itself is variable—that is, a child may drop a coda consonant because adults around them do, or because the child is still learning the motor control to produce that consonant.
Learning words amidst speech sound variability 303 accented speakers. The accent difference used was the front-vowel shift from Maye et al. (2008): the words deege, div, deg, dazz (/didʒ/, /dɪv/, /dɛg/, /dæz/) in one accent were pronounced as in the other accent as didge, dev, dag, dahz (/dɪdʒ/, /dɛv/, /dæg/, /dɑz/). Each accent was produced by one of two (male or female) speakers, meaning that accent variation was conditioned on the speaker. Listeners learned words for 16 novel pictures. Two-accent learners (the same picture is labeled both deege /didʒ/ and didge /dɪdʒ/) were compared to learners who heard only a single accent (a picture is only labeled deege /didʒ/). Learning in the two-accent condition was not difficult and appeared to affect accuracy mainly when accents created cross-accent competitors, such as didge /dɪdʒ/ in one accent appearing as the competitor picture to div /dɪv/in the other accent. The second experiment in Muench and Creel (2013) expanded on the first by testing a variety of artificial accents: single-accent; two-accent (as in the first experiment); farther-apart vowels (deege /didʒ/ and dedge /dɛdʒ/); coda (word-final) consonant variation (deege /didʒ/ and deev /div/); and dissimilar words (deege /didʒ/ and vig /vɪg/). The dissimilar-words condition, which used the same 32 words as the two-accent condition in a different word-to-picture mapping, controlled for the possibility that adult learners are simply really good at learning any two different words for each picture. The two-accent condition again showed learning almost as good as the single-accent condition, while the dissimilar-word condition was quite difficult: it took learners much longer to reach the same accuracy level during training, and they showed lower test-trial accuracy. The two conditions with more-distant segment changes (farther-apart vowels and variable coda consonants) fell in between the two-accent and the distant label condition. Both experiments tested learners of both monolingual and bilingual language backgrounds. The idea was that bilinguals might excel because (a) they are likely to be more accustomed to accent variability from being around second-language speakers, and (b) they are more likely than English monolinguals to speak a language that does not distinguish the sounds that differentiate dual labels. However, the first experiment showed no bilingual advantages, and bilingual learners in the second experiment were more negatively affected by moderate pronunciation differences than monolinguals or highly English-dominant bilinguals. Overall, this study suggests that accent-like variability has fairly weak consequences for word learning in adults. Frye (2018) replicated and extended these findings. As in Muench and Creel (2013), learners learned under single- accent, dual- accent, or dissimilar- word conditions. Unlike Muench and Creel (2013), all words were produced by a single talker, so the talker himself was variable. Further, the dual-accent conditions used a new set of vowel pairings based on tense/lax distinctions rather than vowel shifts (such as beesh /biʃ/ and bish /bɪʃ/), and tested consonant voicing differences (such as deev /div/ and teev /tiv/) that were more systematic than the different-consonant condition in Muench and Creel (2013). To sum up a complex set of results, the findings of Muench and Creel (2013) held for the new vowel pairings and for consonant voicing differences in both onset and coda positions. That is, adults were quite good at learning beesh /biʃ/ and bish /bɪʃ/as labels for the same picture, and at learning beesh /biʃ/ and peesh /piʃ/ as labels for the same picture. This was again compared to a condition where learners had
304 Sarah C. Creel to learn dissimilar labels for the same picture, such as zoof /zuf/ and feff /fɛf/, which was much more difficult. Results also extended to phonetically-Spanish-like novel words, which contained starker vowel differences due to Spanish’s five-vowel system (for example, /belu/and /bilu/). As in the earlier Muench and Creel (2013) study, Frye found similar effects for English monolingual and English-Spanish bilingual learners, but again, there was no clear advantage for bilinguals who would presumably be more accustomed to variable accents. Spanish-English bilinguals performed less accurately than English monolinguals on English-like novel words, but slightly (nonsignificantly) better on Spanish-like novel words. However, these accuracy differences were across the board, not restricted to the dual-accent conditions. This hints at possible own-accent advantages, but does not suggest any advantage for previous exposure to accent variability or having one language that does not distinguish the sound categories of the two artificial accents. In any case, Frye’s work suggests that noticeable differences between dual labels in both consonants and vowels have only mild effects on word learning, even when a single talker produces different forms in free variation.
14.4.2.3 Speech sound variability: lab experiments with children A relevant question is whether these results extend to younger learners. All of the above studies were run with college student (adult) learners, who have many skills they can marshal in learning novel vocabularies. The accepted wisdom is that children have trouble learning additional words for the same object (Golinkoff, Mervis, HirschPasek, 1994; Markman and Wachtel, 1988 though see, e.g., Matthews, Lieven, and Tomasello, 2010, for evidence that children know numerous synonyms). This observation, combined with the assumption that learners interpret a speech sound change as a meaning change, suggests that children would find dual-accent learning very confusing. However, recall that mispronunciation studies (Creel, 2012b; Swingley, 2016) suggest that young children are fairly willing to treat a mild mispronunciation as a similar word. There is also evidence that young children (ages 4–5 years, perhaps older) are less adept than adults are at distinguishing very similar native-language speech sounds (Gerken, Murphy, and Aslin, 1995; Hazan and Barrett, 2000; Treiman, Broderick, Tincoff, and Rodriguez, 1998). In the learning situation tested here, this might be an advantage: children’s relative insensitivity to speech sound variation might make them quite good at learning with minimal speech sound variability. The actual outcome seems to be somewhere in between poor and perfect performance, as one might expect on a gradient sensitivity account. Creel (2014), in a series of five experiments, found that children who learned two words (g_f and k_b with either the vowel ‘ee’ /i/or ‘ih’ /ɪ/) were affected by accent variability. Children showed lower accuracy if they learned in one “accent” (e.g., with picture 1 being geef /gif/, and picture 2 keeb /kib/) and were then presented with a different “accent” (giff /gɪf/, kib /kɪb/). Learners also showed some loss in accuracy if they learned one accent per speaker and then the speakers switched accents. However, in that series of studies, the two words themselves were also similar in onset position (‘g’ and ‘k’ are quite similar, both velar
Learning words amidst speech sound variability 305 stop consonants), which may have increased difficulty by introducing phonological competition between the two words. This is a different source of difficulty than within- word label inconsistency. Frye (2018) followed up this work by testing children using the set of 32 words from Frye’s experiments with adults (see also Frye and Creel, under review). This made it possible to test a large number of minimal pairs across children (16 word pairs with tense vs. lax vowels and 16 pairs with consonant voicing differences), which speaks to generalizability. Each individual child learned two words in a single-label condition, and another two words in a dual-label (two-accent) condition, with order counterbalanced across children. For 32 children, the two-accent words varied in onset consonant voicing; for the other 32, the two- accent words varied in vowel tenseness. Labels for the two tested pictures in all conditions were phonetically dissimilar from each other. For example, deev /div/ and teev /tiv/, the labels for picture 1, were dissimilar from vayfe /veɪf/ and fayfe /feɪf/, the labels for picture 2. This means that phonological competition cannot explain any heightened difficulty found in the two-accent conditions. Results showed that young children had lower accuracy for two-accent learning (70%) than one-accent learning (78%), but were above chance in both cases. (Interestingly, counter to the finding that consonants are more critical to word identity (Nazzi, 2005; Nespor, Peña, and Mehler, 2003), vowel variability was numerically more difficult to ignore than consonant variability.) A follow-up experiment by Frye and Creel suggests that children find it even more difficult to learn dissimilar labels (deev /div/ and tayge /teɪdʒ/as labels for picture 1). In any case, children appear affected by accent-like variability in word learning, but they learn to some degree nonetheless.
14.4.2.4 Summary of learning with variable sound structure Learning with variability appears to present comparable challenges to children and adults. It is sometimes measurably more difficult than learning with a single accent, but is much easier than learning two unrelated words for the same picture. This suggests that speech sound category variability may not be an insurmountable obstacle to language learning. Still, this work has yet to be extended to pairs of real-world accents, other than the previously mentioned natural experiments. One might think that children and second-language learners would have an advantage over adults and native learners because children and second- language learners tend to have less well-defined speech sound category boundaries. However, at least some evidence discussed above suggests that learners with less experience in a language do not have an advantage, and may have more difficulty, in learning variable words than learners with more experience in the phonetics and phonology of the to-be-learned words.
306 Sarah C. Creel
14.5 A brief summary of variability I have discussed two situations where listeners must deal with speech sound variability: recognizing familiar words with accent-like characteristics and learning new words with accent-like variability. In both situations, listeners show increased difficulty when accent-like variability is present, but the degree of difficulty depends on a variety of factors: age, familiarity with the language, vocabulary knowledge, supportive sentential (semantic or discourse) context, and supportive visual context. Even young children (2–5 year-olds) mostly recognize mispronounced or accented versions of familiar words as the words themselves, at least when response alternatives are constrained by visual or sentential context (Creel, 2012a; Creel et al., 2016; Swingley, 2016). One facet of this research that might seem counterintuitive is that listeners who presumably have weaker perceptual sensitivity—children and second-language speakers— appear more susceptible to disruption from sound variability. Children are more affected than adults by accent variability (Bent, 2014), and second-language speakers sometimes show more difficulty than native speakers in learning words with variable speech sounds (Frye, 2018; Muench and Creel, 2013). The reason that this is surprising is that one might think that perceivers with broader perceptual tuning would be relatively better at recognizing near misses. Yet it may be that having less-precise perceptual tuning indicates representations that are less rich relative to adults and native speakers. This representational-richness account also fits with evidence that children and second-language speakers have heightened difficulty comprehending speech in non-ideal listening conditions (children: Elliott, 1979; Fallon, Trehub, and Schneider, 2000; Fallon, Trehub, and Schneider, 2002; second-language speakers: Scharenborg, Coumans, and van Hout, 2018). On an exemplar account, there will be a less densely populated exemplar space earlier in learning and, as a result, a weaker response from activated exemplars. One might argue that a more densely populated exemplar space and stronger echo provide more top-down perceptual feedback to facilitate recognition in adverse listening conditions, as well as more sensitivity to incomplete similarity structure during word learning. A related possibility is that adaptation itself may be a process of learning similar-but- new word form representations. Thus, learners who already have rich perceptual representations of many words (as in the case of adults, and native speakers) would have more source material for creating new word form representations.
Learning words amidst speech sound variability 307
14.6 Areas ripe for future inquiry 14.6.1 Is all accent variability created equal? One area of uncertainty is the extent to which studies of mispronunciation comprehension and continua of ambiguous speech sounds map onto comprehension of real accents. Mispronunciations can resemble some accent properties, especially when an accented variant (say, accented ‘ih’ /ɪ/) lands squarely within the region of a different native-language sound (‘ee’ /i/). “Ambiguous midpoint” studies also represent some accent properties, such as accented ‘ih’ /ɪ/being shifted toward ‘ee’ /i/spectrally but with ‘ih’ /ɪ/-like duration. One way in which these two types of accents may differ is that there is likely more variability across speakers of real accents than occurs in artificial accent manipulations (though see next paragraph). Still, findings from simulated accents and real accents seem to resemble each other qualitatively, which is reassuring. Nonetheless, more work can be done to connect simulated and true accents. Perhaps a larger challenge is connecting studies of single-word comprehension to comprehension of running speech, the latter requiring segmentation and syntactic analysis, which may both be affected by accent characteristics. Related questions arise about the relationship between regional accent and non-native accents. A widespread notion is that regional accent variability is processed more easily, or at least differently, than non-native accents. From my perspective, there is not enough evidence to say. As with the relationship between artificial accents and real accents, this probably depends on both the strength of the accent and the familiarity of the accent. A mild Spanish accent where the vowel in fish is moved slightly toward ‘ee’ /i/may be more readily comprehended than an Australian accent where the vowel in fish most definitely sounds like ‘ee’ /i/. Still, one line of thinking is that an Australian accent in English (native) may have more ‘signal’ than a Spanish accent in English (foreign) in the sense that the variability is more structured. A native Spanish speaker may produce English sheep-vowel and ship-vowel words with less-different central tendencies than an Australian English speaker. They may also produce sounds with higher variability because the sounds of their second language (here, English) are pulled toward the sounds of their first language (here, Spanish). Both factors would tend to make the sounds more confusable with each other. In addition, even if a single speaker is consistent with themselves, speakers as a group may show more variability than a group of native speakers because different individuals will approximate native-speaker norms to a greater or lesser extent. Thus, differences in both central tendency and variability in principle would allow more exact adaptation or learning of native accent properties than non-native accent properties. However, some new evidence contradicts this variability story in illustrating cases where native speakers are at least as variable as non- native speakers (Vaughn, Baese-Berk, and Idemaru, 2019). This strongly suggests a need
308 Sarah C. Creel for researchers to closely examine the nature of variability in actual second-language speech, both within and across speakers.
14.6.2 Learning with variability: what is best for comprehension? One might ask if it is better for a learner in the long run to learn one version and then be flexible in recognition later, or to learn variant forms from the beginning. This essentially is the question of whether one can learn from noisy input, as in creole language formation, or even a second-language learning setting with a diversity of first languages. Some researchers have made arguments that children regularize to such an extent that input noise should not matter much (Singleton and Newport, 2004). However, this research largely concerns syntax rather than word or sound learning and assumes that one form is more probable than the rest. If children regularize in word form learning, what do they regularize to—the central tendency across two variants (such as halfway between ‘ee’ /i/and ‘ih’ /ɪ/), the more frequent form, the form that more speakers produce, the form that sounds more like them, the higher status form (see Sumner et al., 2014)? Effects may also depend on whether variability is systematic or random. If it is systematic, that is, conditioned on some sort of context, then learners might discover where each sort of variability occurs: before nasal consonants, in adolescent women, in informal settings (see, e.g., Miller and Schmitt, 2012; Roberts, 1997, 2002; Smith et al., 2007, 2013). However, if two forms appear to be in free variation, or if learners are unable to represent context, there seems to be a greater danger of contrast loss. Thus, in that situation, it may be preferable for learners to learn a single accent prior to generalizing to another one. This is consistent with comprehension findings in children exposed to multiple accents (Buckler et al., 2017; Creel, 2016), though note that sociophonetic work (e.g., Miller and Schmitt, 2012; Roberts, 1997, 2002; Smith et al., 2007, 2013) suggests that slightly older children are capable of producing variants in their own speech.
14.6.3 What happens when communicating with multiply-accented talkers at once? This parallels the question of learning from multiple accents but has been much less considered in accented speech comprehension. Much of this work assumes that the listener is hearing a single accent at a time, but this is often not the case: consider diverse faculty senate meetings, international conferences, or conference calls at multinational companies. One possibility is that listeners can adapt to multiple different accents and switch between them. However, the largest number of accents explored in adaptation studies is two, typically with one speaker of each accent. Trude and Brown-Schmidt’s (2012) extension of Dahan et al. (2008) suggests that learners can track two accents from
Learning words amidst speech sound variability 309 two dissimilar talkers (male and female). Relatedly, we know that listeners can keep track of two sets of speaker-specific voice onset time patterns (Allen and Miller, 2004; Theodore et al., 2009). It is less clear whether one can track more than two speakers independently of each other, particularly when speakers are similar (see Vitevitch, 2003, on change deafness for voices). It is also unclear whether it takes effort to switch among accents. Another possible solution to comprehending in multiple accent situations is simply to relax all speech sound boundaries, a possibility which Schmale, Seidl, and Cristia (2015) refer to as general expansion. Thus, any close-enough variant will be acceptable from any talker, regardless of whether accent-related alterations have been observed. A strong version of this hypothesis, though, would create substantial homophony, which would slow down comprehension even of native speech by activating many spurious matching words that could have been ruled out by finer-grained (non-expanded) processing. A third solution would be not to adapt at all, and simply recognize words via partial similarity, with concomitantly slowed comprehension for any accented speakers. However, given that adult listeners often do adapt to single speakers, it is unclear how they would know to not adapt when hearing multiple speakers. A final possibility invokes detailed exemplar-style knowledge. Rather than changing all of one’s recognition presets from moment to moment, careering wildly from accent to accent, listeners might directly access detailed word form representations or speech traces that contain accent detail. Detailed word and sound representations would allow recognition of other cues besides the clearly changed sound itself that might aid the listener in telling apart different accents. For instance, hearing dahk /dɑk/with a prevoiced dentalized /d/-as a Spanish speaker might talk -might activate previously-experienced Spanish-accented word forms and the associated duck meaning, while non-prevoiced dahk /dɑk/-as an English speaker might talk -would activate English-like forms that map to a ‘dock’ meaning (see Gonzales and Lotto, 2013; Ju and Luce, 2004 for language- specific activation of similar words). Activating all of these detailed, previously learned representations—without directly activating cognitive representations of the talkers themselves—might permit listeners to recognize a range of accents in close succession without incurring switch costs.
Pa rt I I B
MEANING
Chapter 15
H ow Learners Mov e From Sound To Morph ol o g y Katherine Demuth
15.1 Introduction Researchers have often approached the language learning problem from a modular perspective, examining children’s acquisition of syntax, semantics, phonology, or morphology. However, recent research suggest that learners use multiple types of information from different domains to construct their early grammars. This chapter reviews findings at the phonology/morphology interface, showing that the acquisition of grammatical morphology cannot be addressed without also understanding more about the development of the phonological system, how this interacts with the realization of lexical and prosodic words, how grammatical items themselves are phonologically and prosodically instantiated, and the frequency with which these occur in the ambient language (see also Embick, Creemers, and Davies, this volume). This approach helps make sense of the long-standing problems of variability in children’s use of grammatical morphemes in obligatory contexts, providing a framework for making predictions about the course of morphological development, both within and across languages. How do children learn the grammar of language? In particular, how do they come to learn the difference between lexical items and the smaller grammatical parts of language—including function words such as articles, pronouns, auxiliaries, and inflectional morphemes such as plurals and agreement/tense marking? Obviously, the speech stream must first be parsed into lexical and morphological units (see also Swingley, this volume; Creel, this volume). But then, the child needs to be able to store, access, organize, and produce these when constructing a sentence of their own. It is the emergence of this production process that is the central focus of this chapter: once lexical items and grammatical morphemes are segmented, perceived, and comprehended, how does the child begin to use them when producing sentences of their own?
314 Katherine Demuth Becoming a competent speaker of a language is a protracted process that shows much within speaker variability, even for the same grammatical morpheme, which may inconsistently appear in obligatory contexts. This process has therefore been of significant theoretical interest for decades. Some have suggested that the overall order of acquisition of grammatical morphemes is due to semantic complexity (e.g., Brown, 1973), but this does not address the issue of why a given morpheme may sometimes be used, and sometimes not. Others have attempted to address this variable use as evidence of an incomplete syntax, where grammatical morphemes will “randomly” appear until the requisite syntactic structures have “matured” (e.g., Radford, 1990). However, if one actually looks at corpora of children’s early speech, and/or designs experiments to examine when and where grammatical morphemes are actually produced, it quickly becomes apparent that there is nothing random about their use. Rather, the child’s phonological system is driving where and how grammatical morphemes will be most likely to appear (e.g., Gerken, 1996). Some of what determines when a child will tend to use a particular grammatical morpheme is a function of the prosodic context in which it appears. Although most English function words are written with spaces before and after (e.g., I like the bread), these short, unstressed grammatical items tend to prosodify as part of a disyllabic foot in spoken language when the previous word is monosyllabic (e.g., I [like the] bread), though not when the preceding word is disyllabic (e.g., I [packaged] the bread) (e.g., Selkirk, 1996). Thus, learners must learn to parse ‘the’ as the same unit, despite its different phonetic/ phonological and rhythmic realizations across different prosodic environments. The phonetic form that a function word or inflectional morpheme takes is thus often the result of phonological processes. In fact, most phonological processes take place at morphological boundaries, either at the end of words, between words, or between a word and a grammatical morpheme. For example, the plural of cat is cat+/s/, but the plural of dog is dog + /z/, and the plural of peach is peach+/əz/. This is due to the phonological processes of voicing assimilation (in the case of cats and dogs), and dissimilation (in the case of peaches), adding another when two sounds are too similar. Both are common processes in English that also occur in the formation of the present and past tense (e.g., eat+/s/ vs. dig+/z/; kick+/t/ vs. bag+/d/, catch+/ əz /, padd+/əd/). How, then, does the learner come to realize that these allomorphs mean the same thing, and to produce them in the appropriate contexts? Addressing the issue of how and when grammatical morphology develops, in both perception/comprehension and production, is thus a complex issue. Traditionally, this has been treated as a syntactic problem, where learners need to learn the syntax of a language in order to know when and how different types of grammatical morphemes are used (e.g., Radford, 1990; Wexler, 1994). However, once one begins to look at the early stages of grammatical morpheme use, which start prior to the age of 2, it is necessary to consider phonological aspects of the child’s grammar as well, and interactions at the phonology/morphology interface. Much research on early language development has focused on infant speech perception, examining the acquisition of phonemic contrasts and lexical form using
How Learners Move From Sound To Morphology 315 perception/comprehension studies (see Werker, 2018, for a recent review). A growing body of research also shows that infants and toddlers are often aware of grammatical morphemes and inflections even before they can reliably produce these in their own speech (e.g., Barrière, Goyet, Kresh, Legendre, and Nazzi, 2016; Davies, Xu Rattanasone, and Demuth, 2017, 2020; Gerken and Macintosh, 1993; Marquis and Shi, 2015; Shi, 2014; Sundara, Demuth, and Kuhl, 2011; Soderstrom, White, Conwell, and Morgan, 2007). This suggests that segmentation must have already taken place, raising questions about when and how such forms come to be produced. There has also been a long tradition of examining children’s language productions using data from both experiments and spontaneous speech (e.g., Berko, 1958; Brown, 1973). The remainder of this chapter reviews more recent research examining children’s use of function words and inflectional morphology in spoken language, suggesting that children’s use of grammatical morphology is determined, in part, by the child’s developing abilities with phonological/prosodic structure. Thus, although one can examine the use of morpho-syntax in older children, at earlier stages of development (e.g., below the age of 3) it is essential to consider the development of grammatical morphology along with the child’s phonological/prosodic development as well.
15.2 The early use of function words In contrast to open class lexical items (where new words can be added to the lexicon through borrowing and/or innovation), function words constitute a small, closed class set of items that carry grammatical meaning, such as tense/aspect, agreement, number, gender, case, etc. These tend to occur quite frequently, with some kind of function word appearing in most sentences. Words that are frequent tend to undergo processes of shortening (e.g., Zipf, 1935). That is, given that they are somewhat predictable (as in I saw __dog), and are high in frequency, they tend to have less phonological content. Thus, most function words across languages tend to be monosyllabic, and in a language like English, tend to be unstressed. However, when children begin to speak, they tend to preserve stressed (longer) syllables, such as those found in English content words. This means that short, unstressed function words that cannot be prosodified along with a content word are often omitted in children’s early utterances, or produced as a phonologically reduced prosodic placeholder, such as a “filler syllable” (e.g., a want a milk for ‘I want the milk’) (e.g., Peters, 1983). However, scholars and parents alike have long noted that young children are variable in their use vs. omission of function words. This led researchers like Radford (1990) to suggest that young learners did not yet have the more complex grammatical structures needed to license the use of grammatical function words. However, it had also long been realized that some of this variable use of grammatical morphology was sensitive the prosodic context in which these function morphemes appear. For example, Demuth (1994) showed that 2-year-olds tend to produce noun class
316 Katherine Demuth prefixes in the Bantu language Sesotho when the nominal stem is monosyllabic (mo- tho ‘person’), but tend to omit the same morpheme when it is prefixed to a nominal stem that is disyllabic ((mo)-sadi ‘woman’). That is, the prefix is produced when the resulting word constitutes a disyllabic foot of phonological structure, but is omitted if the combination would result in a trisyllabic word (i.e., more than a foot): thus, the grammatical morpheme (i.e., the noun class prefix) is omitted, resulting in the ill- formed bare nominal stem -sadi. This type of variable production of the same grammatical morpheme thus seems to be phonologically (or prosodically) conditioned, rather than syntactically conditioned. That is, the prosodic context of a given morpheme (e.g., how it interacts with the syllable and foot structure of the words around it) plays a critical role in determining if young children will actually use that grammatical morpheme in that context. These early findings from Sesotho have been shown to generalize across languages and grammatical morphemes, where prosodic (or rhythmic/metrical) structure is critical for understanding where in a sentence a grammatical morpheme will be most likely to appear in a child’s early speech. We review some of the crosslinguistic findings on children’s variable use of articles below.
15.2.1 Prosodic licensing of function words in English As mentioned above, articles in English can prosodify with the previous word if it is monosyllabic, forming a disyllabic foot, or prosodic word (PW) (e.g., [catch the]). However, if the previous word is already disyllabic (e.g., [catches] the), the article is left “unfooted,” and is more likely to be omitted in children’s early speech. This was shown by Gerken (1996) in a series of elicited imitation experiments with children aged 2;3 years. Some researchers wondered if this was merely an artifact of the task, which might have set up rhythmic expectations. However, this seemed unlikely, given that children this age also truncate unfooted syllables from monomorphemic words like banana to nana, creating a disyllabic foot in this case as well (Demuth, 1996a). Thus, disyllabic (trochaic, Strong-weak (Sw)) feet play an important role in children’s early language development, at least for languages like English. This may be due, in part, to the fact that most words in English begin with a stressed syllable, whether monosyllabic (e.g., cat) or disyllabic (children) (e.g., Cutler and Norris, 1988; Demuth, 1996b). To test the generality of Gerken’s (1996) findings, we therefore examined children’s spontaneous speech productions to see if their variable use of articles could be accounted for by these prosodic context effects outside the laboratory as well. Using longitudinal data from the Providence Corpus (Demuth, Culbertson, and Alter, 2006), which included fortnightly hour-long mother-child recordings for six children from 1– 3 years, all phonemically transcribed with audio files linked (see CHILDES database; MacWhinney, 2000), we found the same patterns as that reported in Gerken (1996), with articles occurring as part of a disyllabic foot being produced at significantly higher rates earlier than those occurring in an unfooted prosodic context (e.g., Mary [likes
How Learners Move From Sound To Morphology 317 the] dog vs. Tommy’s [rolling] __ball) (Demuth and McCullough, 2009). Furthermore, these patterns persisted for several months, from roughly 1;6 years until 2 or so, until the child had acquired the more complex prosodic structures needed to include unfooted function words into their developing prosodic grammar. It thus appears that Gerken’s (1996) findings were not due to task effects, but capture important generalizations about children’s early phonological grammars, and how these develop over time. That is, these constraints on early productions appear to be prosodic, not syntactic. Thus, only once children have access to higher levels of prosodic structure, such as that outlined by the Prosodic Hierarchy (e.g., Nespor and Vogel, 1986; Selkirk, 1984), can they begin to include unfooted syllables in their early speech. We now had strikingly similar evidence of prosodic interactions with early grammatical morpheme use from two typologically and prosodically very different types of languages, the morphologically rich Bantu language Sesotho, and the morphologically more impoverished Germanic language English, including evidence from both experiments and longitudinal spontaneous speech corpora. Furthermore, English prosodifies articles to the left—to the preceding monosyllabic word when possible— whereas Sesotho prosodifies noun class prefixes to the right (to the following lexical item), suggesting that these findings might generalize to other languages as well. This then raised questions about the extent to which one might be able to make predictions, given the prosodic structure of another language, about the course of grammatical morpheme development, and in which prosodic contexts grammatical morphemes might be most likely to first appear in children’s speech.
15.2.2 Prosodic licensing of function words in French, Spanish, and Italian To address the above question, we also collected the Lyon Corpus (Demuth and Tremblay, 2008), a French corpus of longitudinal acquisition data that paralleled the English/Providence Corpus (also archived as part of the CHILDES database; MacWhinney, 2000). French is a Romance language, with phrase-final lengthening. It is thus prosodically quite different from both English and Sesotho: English has penultimate lexical stress (cat, table, tomato), whereas Sesotho had penultimate phrasal lengthening (mo-tho ‘person’, mo-sadi ‘woman’). Both result in Sw disyllabic feet. In contrast, French has phrase-final lengthening (chapeau ‘hat’), resulting in an iambic weak- Strong (wS) prosodic structure. This raised many questions about how French-speaking children might produce their early articles, and if these would also be prosodically conditioned in some way. We hypothesized that, even though disyllabic feet are not thought to play a central role in French phonology (e.g., Scullen, 1997), French-speaking children might initially produce articles with monosyllabic lexical items, resulting in a disyllabic (iambic) foot, and then gradually begin to use articles with disyllabic words, slowly beginning to produce articles that fall outside the foot.
318 Katherine Demuth Our hypothesis was confirmed: in an examination of two children from the Lyon Corpus below the age of 2, we found similar patterns to that found in English, with articles (and other determiners) more likely to be produced when these occurred as part of a disyllabic prosodic word (e.g., le chat ‘the cat’) compared to trisyllabic prosodic word (e.g., la couronne ‘the crown’) (Demuth and Tremblay, 2008). Note that in French, as in Sesotho, the grammatical morpheme prosodifies to the right, with the following word. Thus, when the following French lexical item was already disyllabic, the determiner was less likely to be produced until after the age of 2 (see Bassano, Maillochon, and Mottet, 2008 for similar findings). Furthermore, as found in English, many of these early determiners took the form of a “filler syllable” (e.g., Veneziano and Sinclair, 2000). But both English and French have many monosyllabic words in the input children hear and use (80% and 45%, respectively) (Roark and Demuth, 2000). In contrast, languages like Spanish and Italian have only a few monosyllables (i.e., forms like sí ‘yes’ and no! ‘no’), but mostly longer disyllabic, trisyllabic, and quadrasyllabic words (Roark and Demuth, 2000). What would the course of article acquisition look like in those languages? Perhaps the use of articles would be even more delayed, awaiting the acquisition of even more complex prosodic structures? An examination of Spanish acquisition data suggests that this is not necessarily the case. Rather, in a language where there is a greater abundance of multisyllabic words, children below the age of 2 tend to produce more complex prosodic word structures earlier than in English. Thus, even though some Spanish-speaking children may truncate early words like muñeca ‘doll’ to meca around 1;6–1;8 years (Gennari and Demuth, 1997), they quickly begin to incorporate articles, producing trisyllabic prosodic words like ameca for four-syllable targets like la muñueca ‘the doll’ before the age of 2. That is, in languages like Spanish, children quickly start producing larger, multisyllabic prosodic words, and incorporating articles, even at the cost of omitting segmental material from the lexical item itself. This is further illustrated in the longitudinal study of two Spanish- speaking children (Demuth, Patrolia, Song, and Masapollo, 2012). One child, a girl, clearly showed this pattern, very quickly producing both articles and prosodic words of more than a disyllabic foot at 1;8–1;9 years (a beza (la cabeza) ‘the head’; dinero ‘money’). The second child, a boy, showed a slightly different pattern, with no word truncation, producing trisyllabic words like caballo ‘horse’ at the age of 1;10, but only beginning to produce articles once he was 2;1 (un besito ‘a little kiss’), with no word truncation. There are two possible explanations for the boy’s later development of articles. The first is that he did not have the syntax or semantics required for the use of articles before the age of 2, and that their first appearance then came after his lexical items were prosodically well formed, an indication of access to more complex prosodic structure. The second possible explanation for this different pattern of development is that he did have the syntax and semantics needed to produce articles earlier, but was unwilling to truncate lexical items to incorporate the article into his still expanding prosodic system. Although we suspect that the explanation for his behavior is probably the former (i.e., a boy with a later developing grammar), further study of more Spanish-speaking children would help shed light on this issue to determine if the predominate pattern is similar
How Learners Move From Sound To Morphology 319 to that of the (possibly precocious) girls who showed earlier use of articles in Gennari and Demuth (1997) and in Demuth et al. (2012), or not. This type of individual variation is thus very useful for better understanding the nature of the acquisition process. Critically, however, both children showed earlier acquisition of three-syllable prosodic words than is typically found in English, where words such as banana are often truncated until around 2;6 years, and unfooted articles are often omitted. Findings similar to those of the Spanish-speaking girls above have also been reported for an Italian-speaking girl of the same age (Giusti and Gozzi, 2006). As Italian is another language with many polysyllabic words, similar in this respect to Spanish, we predicted similar results. In sum, it appears that there are constraints on the shape of children’s early prosodic words, and that this interacts with their use of grammatical morphology. Language- specific prosodic characteristics of both the function words themselves and of the ambient lexicon combine to influence how and when children begin to incorporate grammatical morphology into their early speech. Those function words that can be prosodically licensed, for example, as part of a disyllabic foot in English, Sesotho, and French, tend to be produced earlier than those that are not. For languages that have larger, more prosodically complex words, as in the case of Spanish and Italian, the size of prosodic words quickly becomes larger than just a disyllabic foot, with the rapid incorporation of articles to form a wSw prosodic word, sometimes at the cost of omitting segmental material (syllables) from the lexical item itself. The Prosodic Licensing Hypothesis (Demuth, 2014) thus provides a framework for exploring these issues crosslinguistically, and for making predictions about how and when which grammatical morphemes, in which prosodic contexts, will be likely to be produced first. It also helps provide a means for better understanding individual variation as children’s lexical and grammatical development proceeds. These early prosodic constraints do not last forever. Around 2;6–3 years they begin to disappear, at least in short sentences. This appears to be due, in part, to increasing access to more complex/higher level prosodic structures, at the level of the phonological phrase, intonational phrase, and utterance, with much research still to be done at these higher levels of structure. Persistent variable use of grammatical morphemes later in development may arise in syntactically more complex constructions (presumably with greater demand on cognitive processing/working memory; see Valian, 1991), and/or other learning challenges (e.g., hearing loss: Titterington, Henry, Krämer, Toner, and Stevenson, 2006). However, some inflectional morphemes are still being acquired until the early school years, even in typically developing children. We discuss these below.
15.3 The acquisition of inflectional morphology In the above, we focused on function words that tend to constitute a syllable, such as articles. Although these forms are unstressed, and perhaps less perceptually salient than
320 Katherine Demuth stressed syllables, the general consensus has been that they are perceived, but simply omitted in early speech, especially when they do not occur as part of a rhythmic unit such as a disyllabic foot (e.g., Gerken and Macintosh, 1993; Gerken, 1996). But what about grammatical morphemes that are smaller than a syllable, that is, only a segment? This is the case in many nominal and verbal inflections in a language like English, such as the plural (cat+s) or the present/third person singular subject-verb agreement (she eat+s). Brown (1973) was one of the first to document children’s longitudinal development of grammatical morphemes in the corpus data of Adam, Eve, and Sarah. He found that, although there were individual differences in the timing of acquisition (with some children more precocious than others), all had the same order of acquisition, with certain morphemes acquired before others. In particular, he showed that progressive –ing was the earliest inflectional morpheme acquired, followed then by the plural, and later by the present (third person singular) and past tense, with the segmental forms (e.g., cat+s) generally being produced before their syllabic counterparts (e.g., peach+es). This gradual acquisition (and variable omission) of inflectional morphemes raised many of the same issues that had been noted for articles above: why was use of these forms variable for a given child, and what accounted for this variable use? As in the case of the articles, we again presumed that inflectional morpheme use would not be random but would be systematic in some respect. Although the overall course of development identified by Brown (1973) might be influenced by semantics, the variable use of a given morpheme (and its allomorphic variants) must surely be governed by other factors. Why, for example, would segmental allomorphs of the plural, as in cat+s, be acquired so much earlier than the syllabic allomorph as in peach+es? Surely, this was not determined by either the syntax or the semantics. Perhaps aspects of phonology played a role.
15.3.1 Prosodic context effects on the production of inflectional morphology The above has talked about grammatical morphemes that preceded a noun. In contrast, many inflectional morphemes may suffix to a verb or noun. Here too, the appear to be phonological effects on comprehension and production. For example, Marshall and van der Lely (2007) found that older children with specific language impairment (SLI) (now known as developmental language delay (DLD)) had more problems producing the past tense morpheme when the verb ended in a complex consonant cluster (e.g., danced) than when it occurred in a less complex singleton (e.g., cried). Drawing on this observation, we wondered if typically developing toddlers would exhibit the same patterns during the acquisition of the present tense/third person singular. Again, we examined longitudinal data from the Providence Corpus, and found that this was indeed the case: 2-year-olds were much more likely to produce third person singular -s when it occurred as a single coda consonant (e.g., see+s) compared to when it occurred as part
How Learners Move From Sound To Morphology 321 of a consonant cluster (e.g., cook+s) (Song, Sundara, and Demuth, 2009). However, Song et al. also found that these children were much more likely to produce these inflectional morphemes when the verb occurred utterance finally, rather than utterance medially. Given that English utterance-final syllables tend to be lengthened—a process known as “phrase-final lengthening” (Lehiste, 1973)—it was thought that the increased duration/time at the end of the utterance might facilitate production of these segmental morphemes. In contrast, utterance medially, one had to focus quickly on planning the production of the next word: perhaps there was not enough time for young children to produce all segments/morphemes under such utterance medial conditions, and/or the morpheme itself would be masked by the following word, and thus harder to perceive. Indeed, in both the corpus data of children’s spontaneous speech between the ages of 1;6–3 years, as well as in our follow-up elicited imitation tasks with children up to the age of 2;3 years, children were more likely to produce third person singular -s when it occurred utterance finally (Song et al., 2009). It therefore appears that both simpler phonotactic (syllable) structure and the increased duration of the morpheme utterance finally enhanced the likelihood of the morpheme actually being produced.
15.3.2 Prosodic context effects in perception If utterance medial inflectional morphemes are shorter in duration than those that occur utterance finally, this raised the possibility that they might also be less perceptually salient. If so, one might expect that utterance medial inflectional morphemes would not be as easily detected when missing in a listening task. Conversely, such morphemes should also be more perceptually salient when occurring utterance finally, where they are longer in duration. We therefore tested this expectation with both 22-and 27-month- olds, comparing listening times to grammatical vs. ungrammatical sentences, where the verb occurred in either utterance medial or utterance-final position (e.g., She *cry/cries now vs. Now she *cry/cries). We predicted that the ungrammatical (uninflected) form would be harder to perceive utterance medially, and this was indeed the case; children in both age groups showed looking time differences between the grammatical vs. ungrammatical forms utterance finally, but not utterance medially. Furthermore, there was a correlation between the same children’s perception and production behavior: those few children who showed perceptual sensitivity to the presence vs. absence of third person singular -s utterance medially also exhibited better production of the morpheme utterance medially (Sundara, Demuth, and Kuhl, 2011). This finding suggests a much tighter connection between the perception and production abilities of young children than is typically assumed, calling for much more research on the link between perception and production in language development in the same children. Researchers tend to look at either issues of perception or production, but information about both is critical for building a comprehensive model of how language is acquired. Finding the same patterns in both perception and production in the same children on the same day also provides converging evidence for these results, replicating
322 Katherine Demuth the finding of the phrase-final advantage for young learners. We call this phrase-final advantage the positional effect—noting that it is a type of Prosodic Licensing. Thus, inflectional morphemes occurring utterance finally are also prosodically licensed, occurring in a position where they are more likely to be both perceived and produced. Once again, this appears to account for much of the variable appearance of inflectional morphemes in children’s early productions, lasting until around the age of 4 or so in children’s short/ simple spontaneous utterances.
15.3.3 Generalization of prosodic context effects to other inflectional morphemes In the above we have discussed prosodic context effects on the perception and production of third person singular -s. This raises the possibility that similar effects might be found for other inflectional morphemes. We therefore looked at the plural—an earlier acquired inflectional morpheme, and found similar positional effects: 2-year-olds were again much more likely to produce plural morphemes utterance finally compared to utterance medially (Theodore, Demuth, and Shattuck-Hufnagel, 2015). One should note, however, that not all English plural morphemes are ‘s.’ English actually has three plural allomorphs, the segmental allomorphs -/s/and /-z /, and the syllabic allomorph /-ə z/ (e.g., cat+s, dog+z, bus+es). Although the segmental plurals are acquired around the age of 2 in both perception (Davies, Xu Rattanasone, and Demuth, 2017) and production (Brown, 1973), the syllabic plural is only understood by the age of 3 or later (Davies, Xu Rattanasone, and Demuth, 2020; Davies, Xu Rattanasone, Schembri, and Demuth, 2019), and is delayed in the production of novel words until as late as 7 years, as shown in Berko’s (1958) ground-breaking wug task. Young children also exhibit challenges with the use of plural morphemes utterance medially (Mealings, Cox, and Demuth, 2013), and in longer utterances (Mealings and Demuth, 2014). One might expect the syllabic plural (e.g., bus+es) to be more perceptually salient than segmental inflections (cat+s), and therefore acquired earlier. Yet the addition of this extra syllable appears to come at a cost. This could be due either to (1) the addition of another (unstressed) syllable with concomitant fricative articulatory challenges (Mealings et al., 2013); (2) a mis-parsing of the sibilant final form of the singular as already being plural (e.g., bus, rose) (Berko, 1958); and/or (3) the low frequency of the syllabic plural in the input children hear (only 5% of plural types/tokens) (Davies et al., 2017). Brown (1973) found that there is a similar delay in the use of these syllabic allomorphs in verbal tense inflections as well (e.g., third person singular (e.g., wash+es), and past tense (e.g., wait+ed)). Although the acquisition of all these inflectional morphemes is delayed in the speech of children with SLI/DLD, the use of the syllabic morphemes is particularly challenging, in both comprehension and production (e.g., Tomas, Demuth, and Petocz, 2017).
How Learners Move From Sound To Morphology 323 Thus, prosodic context effects include not only positional effects within the utterance, but also phonotactics—the way in which a given inflectional morpheme or allomorph is phonological instantiated in terms of syllable/word structure. This in turn can influence the timing of acquisition and the likelihood with which it will be used or omitted. However, this can also interact with how often a particular morpheme or allomorph is actually encountered, with later acquisition of those inflectional morphemes that the child hears less often. Teasing apart the contributions of frequency versus phonotactics is complicated and may need crosslinguistic evidence to more fully understand.
15.3.4 Generalization of prosodic context effects to other languages Thus, just as language-specific differences are found in the timing of function word acquisition, influenced by the prosodic shape of words, syllable structure phonotactics, whether a morpheme occurs in a stressed syllable or not, and how frequently it is encountered may all play a role in when it will be well comprehended and/or produced. Thus, in Spanish, which has fewer coda consonants than English, the acquisition of coda consonants, including plural -s, is reported to appear only around the age of 2;3—later than in languages like English that have a higher frequency of word-final consonants and clusters (Lleó, 2003). Of course, Spanish has many more multisyllabic words than English, with predominantly penultimate stress. This means that most Spanish plural morphemes also appear in an unstressed syllable, whereas the predominance of monosyllables in English means that the segmental plurals occur in a stressed syllable (e.g., muñeca+s vs. doll+s). Just like phrase-final syllables, stressed syllables are longer and acoustically more prominent, making them easier to perceive and produce. This may also contribute to the earlier use of the segmental plural in children’s English compared to Spanish (Lleó, 2003). The predominant word order of a language may also influence how and when different inflectional morphemes are acquired, both within and between languages. English is a predominantly subject-verb-object (SVO) language. This means that verbs, and thus third person singular -s (and the past tense), are most likely to occur utterance medially, whereas nouns, and therefore plural morphemes, are most likely to occur utterance finally. Thus, in addition to being more frequent than verbal morphology, plural morphology may be more perceptually salient as a function of occurring more often at the end of an utterance in English, facilitating earlier acquisition (e.g., Hsieh, Leonard, and Swanson, 1999). It has long been known that children tend to preserve stressed syllables in their early productions, with unstressed syllables that are also unfooted being more likely to be omitted. Thus, the first syllable of word banana tends to be omitted, whereas the last syllable is preserved (see Demuth, 1996a). However, some inflectional morphemes may actually be stressed or occur in a prominent/enhanced syllable, depending the structure
324 Katherine Demuth of the language. We have already mentioned above that this is the case with noun class prefixes in Sesotho, where the penultimate syllable is lengthened, probably playing a role in the inclusion of noun class prefixes as part of a disyllabic foot (e.g., mo-tho ‘person’). It is therefore of significant interest to find that stressed inflectional morphemes at the ends of morphologically complex verbs in Quiché Mayan are the forms that tend to be preserved in children’s early speech (Pye, 1980), and that the stressed (but not the unstressed) determiners in Swedish are the most likely to appear in children’s early speech (Peters and Strömqvist, 1996). This suggests that many of the early patterns of morpheme inclusion or omission in children’s early speech may be due not only to interactions with language-specific prosodic word structures, positional effects, and phonotactics but also to issues of lexical/phrasal stress and/or prominence. All of these issues need to be taken into account in a more explanatory model of early grammatical morpheme acquisition, helping to explain which morphemes will tend to be produced first, and in which prosodic contexts. In sum, the Prosodic Licensing Hypothesis (Demuth, 2014) can be used to make predictions about where and when inflectional morphemes may also be most likely to appear in children’s early speech. But this cannot be done in a vacuum. Rather, an excellent understanding of the prosodic phonology of a language, as well as the distribution of different types of morphemes, allomorphs, and lexical words, is all needed to better understand the course of early morphological development, and how this is predicted to vary across languages (see Lleó and Demuth, 1999 for further discussion).
15.4 Word (and morpheme) segmentation As noted in the Section 15.1, one of the challenges for the language learner is to parse the speech stream into words (see Morgan and Demuth, 1996 for review). Thus, all of the above discussion assumes that the child has already figured out what constitutes a lexical item/content word, and what is a function word/inflectional morpheme. Much research has been devoted to this area of inquiry, involving studies of infant speech perception (Soderstrom, White, Conwell, and Morgan, 2007), artificial grammar learning (Newport, 2016), and computational modeling (Johnson, Christophe, Demuth, and Dupoux, 2014). Since lexical items/content words are an open class, these are many, with new words being learned every day. In contrast, function words and inflectional morphemes constitute a closed class: there are only a few (compared to lexical items), but they occur quite frequently. These therefore might be thought to be easier to learn, and earlier acquired. Recall that many infant speech perception studies have shown some early sensitivity to function words and inflectional morphemes, even if the full meaning is not yet comprehended (e.g., Barrière, Goyet, Kresh, Legendre, and Nazzi, 2016; Davies, Xu Rattanasone, and Demuth, 2017; Gerken and Macintosh, 1993; Marquis
How Learners Move From Sound To Morphology 325 and Shi, 2015; Shi, 2014; Sundara, Demuth, and Kuhl, 2011; Soderstrom, White, Conwell, and Morgan, 2007). Interestingly, computational modeling suggests that learners solve these problems together, segmenting lexical from functional material at the same time (e.g., Johnson et al., 2014). This appears to happen quite early in the acquisition process, with 2-year-olds already making articulatory distinctions between the production of the word-final monomorphemic vs. bimorphemic consonant cluster /ks/(e.g., box vs. rocks) (Song, Demuth, Shattuck-Hufnagel, and Ménard, 2013). Thus, learners are parsing at least some grammatical morphemes quite early, including some that might otherwise appear to be ambiguous. This helps to explain why children often produce the bar/uninflected forms that they do. That is, without having segmented the morphology from the surrounding prosodic structure, children would not variably produce the grammatical morphemes that have been discussed above. This attests to the fact that children are skilled learners who know much about language, even when their speech productions are not yet like that of adults. Future computational modeling of this process, combined with experimental data, will, hopefully, shed further light on how this is actually achieved.
15.5 Discussion This chapter has provided an overview of some of the early processes involved in moving from sound to morphology in children’s early language. Much of this takes place in segmentation and perception/comprehension before the age of 2—suggesting at least some awareness of the relevant syntax and semantics by then. Converging evidence from a range of perception, comprehension, production, and computational studies suggests that this aspect of language learning is robust, permitting the acquisition of a wide range of typologically diverse languages. Language-specific differences in the rates of acquisition will vary. But understanding how and why language learning proceeds as it does can benefit from taking a less modular approach, allowing for interactions between different levels of structure. This does not necessarily mean that all aspects of grammatical morphology are learned by an early age: just as the lexicon continues to grow and develop, with subtle aspects of meaning still to be learned, so too the ability to “put morphology to use” during both the production of novel words (e.g., Berko, 1958) and sentence processing (Trueswell, Sekerina, Hill, and Logrip, 1999) may be an ongoing process. Learning to speak fluently also takes time, and can be disrupted not only by the more commonly known issues such as lexical frequency and phonotactic probabilities (e.g., Storkel, 2011) but also by issues of phonological and prosodic context, especially before the age of 3. These processes may be even further delayed in children with language learning challenges, such as those with SLI/DLD (Deevy, Leonard, and Marchman, 2017; Tomas et al., 2017) or hearing loss (Koehlinger, Owen Van Horne, Oleson, McCreery, and Moeller, 2015; Davies, Xu Rattanasone, Davis, and Demuth, 2020). All must develop
326 Katherine Demuth a mental lexicon that allows for easy lexical access, putting words (and their associate morphemes) to use in a productive manner during both language processing and production. Much is still not understood about the processes that underlie this human feat. It is hoped that this chapter will provide the basis for future interdisciplinary and crosslinguistic research in this exciting field.
Chapter 16
Systematic i t y a nd A rbitrari ne s s I n L anguag e Saussurean rhapsody Charles Yang *
In fact, the whole system of language is based on the irrational principle of the arbitrariness of the sign, which would lead to the worst sort of complication if applied without restriction. But the mind contrives to introduce a principle of order and regularity into certain parts of the mass of signs, and this is the role of relative motivation.... Since the mechanism is but a partial correction of a system that is by nature chaotic, however, we adopt the viewpoint imposed by the very nature of language and study it as it limits arbitrariness. Ferdinand de Saussure, Cours de linguistique générale (CLG; Saussure, 1916, p. 133)
16.1 Beyond arbitrariness The original title of this chapter was Productivity and the Lexicon but that would probably strike one as an oxymoron. The lexicon, at least in the tradition widely recognized as Saussurean, is a depository of arbitrariness, the very opposite of productivity. As Di Sciullo and Williams (1987, p.3) vividly put it, the lexicon is “like a prison—it contains only the lawless, and the only thing that its inmates have in common its lawlessness.”
*
I thank Bob Berwick, Lila Gleitman, Annika Heuser, Barbara Lust, John Trueswell, Hector Vazquez, Virginia Valian, and two anonymous reviewers for helpful discussions and comments.
328 Charles Yang Strictly speaking, of course, the lexicon is not always arbitrary. As Saussure duly noted, along with some of the most prominent contributors to linguistics from Jespersen (1922) to Bloomfield (1933), and from Harris (1951) to Chomsky (1957), sound symbolism can be observed to varying degrees in all languages. For example, English words with the initial consonant cluster /gl/strongly convey the meaning of light, vision, and brightness—glitter, gleam, glory, etc.—which was readily confirmed when I met a co- editor of the present volume over 20 years ago.1 But the more serious problem with the principle of arbitrariness, especially when over-emphasized, is that it lends to an incomplete reading of Saussure. As he made abundantly clear throughout CLG, the principle of arbitrariness, an observation that borders on banality, is important not so much as a fundamental property of language, but as a fundamental property of language for language to overcome. On my (whig-history) reading, Saussure advocated a radical, and deeply psychological, perspective on words, language, and cognition. Psychologically our thought—apart from its expression in words—is only a shapeless and indistinct mass. Philosophers and linguists have always agreed in recognizing that without the help of signs we would be unable to make a clear-cut, consistent distinction between two ideas. Without language, thought is a vague, unchartered nebula. (CLG, pp. 111–112)
A principle of order and regularity, as quoted in the epigram, serves to limit such arbitrariness. As a point of example, Saussure contrasted the French numerals vingt ‘20’ and dix-neuf ‘19’ (CLG, p.131). The former is purely idiosyncratic and the sound-meaning pair is arbitrary. By contrast, the latter clearly has a compositional structure as both dix and neuf recur into other numerals such as dix-huit ‘18’ and trente-neuf ‘39’. The “relative motivation” between dix-neuf and dix-huit/trente-neuf is referred to by Saussure as “associative.” It is nevertheless clear, especially from many other examples in Saussure’s discussion that primarily draws from English, French, and Latin morphology (e.g., enseignement-enseinger-enseignons, painful-delightful-frightful), that such associative relations refer to higher-order rules and abstractions over words. “Elements of sound and thought combine,” language makes the infinite use of finite means, the Humboldtian notion brought to modern prominence through generative grammar and Chomsky. This chapter is the discussion of productivity on the Saussurean theme: How the formal systematicity enables language to go beyond and overcome the arbitrariness of the lexicon. A special focus will be placed on learning: to go beyond arbitrariness, 1
There are 19 such words (more precisely, stems) with non-negligible frequencies: glad, glass, glove, glory, glue, glow, globe, glance, glimpse, glamour, glitter, glitch, glare, gloat, gland, gloss, gloom, glee, and glum, and I count 5, underlined, as not supporting the noted semantic field. For what it’s worth, 13 items out of 19 just about support a generalization according to the Tolerance Principle; see Section 16.3.2. Of course, such a sound-meaning mapping, if genuine, can at best guide the English speaker into the right semantic ballpark when encountering a novel word gleit: the residual, and arbitrary details still have to be filled out.
Systematicity and Arbitrariness In Language 329 children must discover, and subsequently generalize, the language specific regularities that reside in their finite linguistic experience. Like Saussure, our discussion will mostly focus on morphology or word formation; see also Demuth (this volume). However, I will also suggest that an appropriate theory of productivity should be applicable to form-meaning mappings on all linguistic levels, including sound symbolism alluded to earlier (fn. 1) as well as syntax-semantics correspondences, with potentially interesting implications for how language acquisition shapes children’s conceptual development. In Section 16.2, I review some widely used behavioral tests of productivity, along with the associated methodological complications that may give rise to conflicting findings. In Section 16.3, I provide a cross-linguistic survey of productivity in children’s morphology. The evidence is unequivocal: productivity is a categorical phenomenon, as traditionally held, despite an assortment of pleas for gradience in the recent literature. In Section 16.4, I discuss how productivity may be detected by the child learner in a psychologically plausible setting. The developmental evidence suggests that an appropriate learning model must be similarly categorical in nature. In Section 16.5, I offer some initial thoughts on how a learning-theoretic approach to productivity impacts the theory of language and cognition. What can be effectively learned from data needn’t be built into Universal Grammar, and a learning model that gives rise to qualitative changes in the child’s grammar may also help understand how language may play a causal role in children’s conceptual development. In what follows, I will assume that the child has acquired a reasonable vocabulary so that a formal system of productivity can be established; see Gleitman and Trueswell (this volume). These two processes of acquisition are logically independent but are clearly intertwined. For example, Brown’s seminal study (1957) shows that the form in which a word is used—to sib, a sib, sibbing—can guide children to focus on special aspects of word meanings. Similarly, according to the theory of syntactic bootstrapping (Gleitman, 1990; Lidz, this volume), children exploit syntactic regularities such as word order and case marking, which must be derived from their existing vocabulary, to deduce the semantic properties of novel words. Even an incomplete grasp of productivity can contribute to the acquisition of vocabulary. For instance, if children have learned that determiners and nouns productively combine, they may infer that a novel word following a determiner is likely a noun (Shi and Melançon, 2010; see Dye, Kedar, and Lust, 2019 for a recent review). The acquisition of the initial vocabulary appears slow, which in part has to do with the very high degree of referential ambiguity in the linguistic input (Chomsky, 1959; Quine, 1960; Landau and Gleitman, 1985; Gillette, Gleitman, Gleitman, and Lederer, 1999; Trueswell, Medina, Hafri, and Gleitman, 2016) and the considerable computational challenges it poses (Siskind, 1996; Yu and Smith, 2007; Frank, Goodman, and Tenenbaum, 2009; Fazly, Alishahi, and Stevenson, 2010; Stevens, Gleitman, Trueswell, and Yang, 2017). The rapid rise in children’s vocabulary during later stages is likely a consequence of having established a formal system of productivity (e.g., Bates, Bretherton, and Snyder, 1988; Fenson, Dale, Reznick et al., 1994; Anisfeld, Rosenberg, Hoberman, and Gasparini, 1998; Hoff and Naigles, 2002; Gleitman, Cassidy, Nappa, Papafragou, and Trueswell, 2005).
330 Charles Yang
16.2 Tests for productivity The simplest assessment of productivity is the celebrated Wug test as it directly measures the linguistic capacity to generalize. In a landmark study, Berko (1958) introduced young children (age range 4–7) to novel words including nouns, verbs, and other categories, and recorded their responses. (1)
This is a WUG. Here is another one. These are two ___. WUGS.
It is important to note that children are far from perfect in their Wug performance (Berko, 1958, p. 160). For instance, in the regular past inflection of novel verbs, first graders successfully produced rick-ricked and spow-spowed in as few as 25% of the test cases. But we must remember that most if not all English-learning children acquire the -d rule by three (Marcus, Pinker, and Ullman, 1992), as indicated by their spontaneous use of over-regularized forms (e.g., fall-falled). Thus there seems to be a gap of at least three years between acquiring the regular past tense and using it in a specific experimental design (e.g., Huttenlocher, 1964; Anisfeld and Tucker, 1967): failure to pass the Wug test does not imply imperfect knowledge of morphology. The picture gets murkier for words that do not follow productive rules, with the English irregular verbs as the paradigm example. Many irregular verbs are residue of historically productive processes; as such, some may exhibit a certain degree of similarity. For instance, some irregular verbs ending in /ɪŋ/form past tense by changing the vowel to /æ/(e.g., sing-sang, ring-rang) or /ʌ/ (e.g., swing-swung, cling-clung). These similarities are at best partial: for instance, wing, ping, ding, and so forth, actually take -d, and there are irregular verbs that have nothing in common yet undergo the same process to form past tense: bring, buy, catch, fight, teach, think, and seek all replace the rime with /ɔt/.2 Nevertheless, these similarities are the primary motivation for analogical account of morphology and associative accounts of morphological acquisition (Rumelhart and McClelland, 1986; Pinker and Prince, 1988). Children, however, are oblivious to the pull of analogy. The pattern is already clear in the original Wug study although it has been overshadowed by the better-known result on productive rules. Berko found that when children failed to produce spowed for spow, 2 The English irregular past tense formation has been referred to as product-oriented schema (e.g., Bybee and Slobin, 1982) as the verbs are characterized by undergoing the same structural change: which verb can undergo the change must be lexically learned (Yang, 2002). This is to be contrasted with source- oriented schema where the structural change automatically applies if some structural condition is met, for example, -d applies to a word if it is a verb. We will retain the more traditional non-productive vs. productive distinction in the current discussion.
Systematicity and Arbitrariness In Language 331 they produced no response at all rather than, say, spew, which would follow the analogy of know-knew, grow-grew, blow-blew, etc. In addition, Berko systematically investigated the role of analogy with very irregular-like stimuli. Children were presented with novel verbs such as bing and gling that are strongly similar to existing irregular verbs (sing- sang, sting-stung, etc.), and thus with great potential for analogical extension. (2)
This is a man who knows how to GLING. He’s GLINGING. (Picture of a man exercising.) He did the same thing yesterday. What do he do yesterday? Yesterday he ___.
Children were overwhelmingly conservative in their responses. Only one of the 86 children in Berko’s study supplied the analogical form bang and another produced glang. Ivimey (1975) replicated Berko’s results with the same stimuli: virtually no children analogized the real or the pseudo-irregulars. (Graves and Koziol, 1971) tested children’s response on real irregular nouns such as foot and mouse—which were indeed over- regularized—but did not report results on goot and touse. These findings are strongly consistent with children’s production data from many cross-linguistic acquisition studies (Section 16.3.2). The paucity of analogical forms and the abundance of rule-based generalizations stand in contrast with claims about the probabilistic and gradient nature of productivity and grammar (Keller, 2000; Hay and Baayen, 2005; Bresnan and Nikitina, 2009). In the words of McClelland and Bybee (2007, p. 439), “there is no dichotomous distinction between productive and unproductive phenomena; rather, there are only degrees of productivity.” The appearance of gradience, however, seems methodological in nature. This is already evident in Berko’s original study. While almost no child produced forms such as bang and glang, adults “clearly felt the pull of the irregular pattern, and 50% of them said bang or bung for the past tense of bing, while 75% made gling into glang or glung in the past” (Berko, 1958 p. 165). Similarly, Ivimey (1975) found that adults’ tendency to use the irregular forms increased as a function of age and especially the level of education. Indeed, most case studies that support to the claim of gradience were carried out with adult subjects (e.g., Bybee and Moder, 1983; Marcus, Brinkmann, Clahsen, Wiese, and Pinker, 1995; Clahsen, 1999; Hahn and Nakisa, 2000; Hayes, Zuraw, Siptár, and Londe, 2009; Becker, Ketrez, and Nevins, 2011). Some, however, have design flaws. For instance, Bybee and Moder (1983) and Hahn and Nakisa (2000) presented the subjects with questionnaires consisting of real irregular and pseudo-irregular words; the facilitatory effect is left uncontrolled for. Furthermore, while children clearly regarded the nonce words as if they were being taught new English words (Berko, 1958, p. 157), it is not clear how adult subjects approach the Wug test, as they may try to uncover the purpose of the task as in many experimental settings (Anderson, 1980). As Schütze (2005) notes, the instructions provided (to adults) generally do not rule out what he calls the dictionary
332 Charles Yang scenario, in which the subject approaches the test item as an obscure word in their native language and thus attempts to offer a reasonable guess of what the inflected form would be. This may have the effect of leading the subject to deviate from their “normal” linguistic behavior. It turns out that subjects frequently reported the Wug task as some kind of joke that required bizarre analogies (Derwing and Baker, 1977). Compounding the matter further is the use of rating tasks in the assessment of productivity (Kim, Pinker, Prince, and Prasada, 1991; Prasada and Pinker, 1993; Ramscar, 2002; Albright and Hayes, 2003; Ambridge, 2010). But these tasks, which are gradient in nature, would almost guarantee a gradient conception of productivity. In general, participants are inclined to spread responses over whatever range they are given (Parducci and Perrett, 1971). For instance, the classic study of Armstrong, Gleitman, and Gleitman (1983) finds that gradient judgment is obtained even for uncontroversially categorical concepts such as “even number.” Still, virtually all studies have found that regularized forms are rated higher than irregularized forms in head-to-head comparisons. For example, in one of the few rating studies with children in the original Wug test age group (Ambridge, 2010: Appendix), only one (drit) out of 40 verbs tested had the regular form dritted rated slightly lower than the irregular form drit. This somewhat critical review is only meant to suggest that the tests for productivity, like all experimental methods, must be properly interpreted and may not provide a direct window into language and productivity.3 There may be several factors contributing to the mixed results reviewed so far. First, as suggested earlier, it is possible that adults and children approach experimental tasks differently. For instance, in artificial language studies, children generally apply categorical rules while adults tend to match the statistical distribution of variable forms (e.g., Hudson, Kam, and Newport, 2005; Schuler, Yang, and Newport, 2016). Similar differences have been observed between children and adults in behavioral tasks in other domains of cognition (Stevenson and Weir, 1959; Weir, 1964; Derks and Paclisanu, 1967). Whatever the nature of these differences is, it is important to note that adults in naturalistic settings are capable of learning categorical rules as in their second language however arduously (White and Genesee, 1996). Second, the previous conflicting findings may be due to the failure to distinguish productive from unproductive processes. In the former, such as the English past tense -d, rules always apply. In the latter, such as the English irregular past tense, the absence of a productive rule forces (adult) subjects to resort to analogical similarity under experimental conditions—only up to a point, as the discussion of morphological gaps in Section 16.3.3 will make clear. Such an interpretation of the behavioral results is similar to the findings
3
Similarly, I have primarily focused on children’s production data as evidence for productivity. Other tests are possible but all come with caveats. The most obvious alternative is comprehension. For instance, (Figueroa and Gerken, 2019) find with a preferential listening task that 16-month-olds can distinguish nonce words from high-frequency and likely familiar nouns when both are suffixed with -d, for example, fimmed and trucked. Thus, children at this age know that verbs and nouns are distributionally different with respect to the -d ending/suffix. But this falls short of productivity, which requires that all verbs can appear with -d, sometimes with the irregulars as collateral damage.
Systematicity and Arbitrariness In Language 333 in the study of categorization—linguistic or otherwise—that abstract categories may co-exist with prototypical exemplars (e.g., Armstrong et al., 1983; Pierrehumbert, 2001; Murphy, 2004). Again, clear evidence comes from actual languages. For example, despite adult subjects’ willingness to supply and accept irregularized forms, the English language has not added a new irregular verb in a very long time: see Anderwald (2009) for comprehensive discussion of the historical record. All changes in the English verb system in the past few centuries have been familiar pattern of irregular verbs drifting to the regular class (Anderwald, 2013; Ringe and Yang, 2020). Indeed, the English language of the recent past has presented two clear opportunities for the extension of irregular forms. The verb bing (from the Microsoft search engine) and bling (to display ostentatious jewelry) both take -d and are fully regular, despite fitting the most favorable condition for irregularization (Bybee and Moder, 1983; Albright and Hayes, 2003).
16.3 Productivity in child language In this section, we provide a review of the quantitative findings of children’s morphology across languages. A categorical notion of productivity is strongly supported. We once again begin with the well-known case of English past tense.
16.3.1 Emergence An idealized depiction of English past tense acquisition is given in Figure 16.1 based on the longitudinal data of “Adam,” a child from Brown’s (1973) study and adapted from Pinker (1995 p. 116); see also Maslen, Theakston, Lieven, and Tomasello (2004). Adam’s irregular verb acquisition follows the well-known U-shaped learning curve (Marcus et al., 1992), a developmental pattern first discussed by Ervin and Miller (1963), Brown (1973), and MacWhinney (1975). Prior to 2;11, every irregular verb was used correctly before Adam succumbed to over-regularization and produced the first irregular marking error: What dat feeled like? Because feeled is inconsistent with adult input, it must have been the child’s spontaneous creation. The onset of over-regularization, then, is taken to be the emergence of productivity: the suffix -d in this specific case. Over- regularization errors persist from this point onward and will gradually diminish, due to the cumulative exposure to the irregular forms in the input that will override the productive rule over time. The precise nature of how the irregular verbs are learned and stored is a matter of considerable debate (Rumelhart and McClelland, 1986: Pinker and Prince, 1988: Clahsen, 1999: Yang, 2002) that will not concern us here. The development of the regular verb past tense marking is also of significant interest although this is often not considered in the same context. An important early observation is due to Ervin (1964): for some regular verbs, children do not mark them at all (and instead use the stem form) in obligatory past tense context until after the appearance
334 Charles Yang 100
Percentage correct
80
60
40
20
0
2
2.5
3
3.5
4
Age Irregular verbs when marked Regular verbs when necessary
Figure 16.1 The developmental trajectory of irregular and regular past tense marking. Irregular verbs, when marked, are initially perfect before children start to over-regularize them, which indicates the emergence of -d productivity. The regular verbs are initially very inconsistently marked with -d in obligatory past tense contexts and become much more consistent at the same time when -d becomes productive.
of over-regularization of irregular verbs. Similarly, Marcus et al. (1992, Chapter VI)’s longitudinal study shows that before the onset of productivity, children marked both regular and irregular verbs inconsistently in obligatory contexts, producing numerous examples such as I see bird just now and I walk home yesterday. Past tense marking rose significantly across the board, for regular and irregular verbs alike, after the -d suffix became productive (with over-regularization errors). As Kuczaj (1977, p. 593) remarks, “Apparently, once the child has gained stable control of the regular past tense rule, he will not allow a generic verb form to express ‘pastness,’ which eliminates errors such as go, eat, and find, but results in errors like goed, eated, and finded, as well as wented, ated, and founded.” A reasonable interpretation of the developmental change is that in the early stage, the child’s regular verbs are lexically specific (i.e., Tomasello, 2003) and do not go beyond the scope of the input. Once they discover that -d is formally productive, they realize that tense marking is an obligatory feature of their language and use to do so across the board for all verbs. In other words, the child’s grammar becomes infinite and systematic after they detect formal productivity that resides in their finite, and necessarily arbitrary, vocabulary. Such an inductive step is obviously critical for the acquisition of language; it may also have important implications for children’s conceptual development and change that interacts with language as we discuss in Section 16.5.2; see Landau (this volume) for general discussion.
Systematicity and Arbitrariness In Language 335 The emergence of productivity is a critical milestone in child language, and has rightly become a major focus of research from many theoretical perspectives. These transient moments are no doubt real. There is a moment when the child overgeneralizes a rule for the very first time. And for the acquisition of language specific rules, all learning theories require some accumulation of examples: the child cannot conclude -d is productive for all verbs upon hearing only one instance. But these moments are nevertheless difficult to capture in “real time” as the record of individual children’s longitudinal development is never complete. There is also a good deal of individual variation in the development of productivity noted from the very earliest studies of morphological acquisition (Cazden, 1968; De Villiers and De Villiers, 1973; Brown, 1973). For instance, while Adam produced the first instance of over-regularization at 2;11, Eve’s first overuse of -d came at 1;10 (it falled in the briefcase). Brian, a child in the longitudinal study of (Maslen et al., 2004), over-regularized at 2;5. Abe, another child with extensive longitudinal record (Kuczaj, 1977), over-regularized during the first recording session at 2;3 (he falled) and his regular past tense marking in obligatory context is already very high (Marcus et al., 1992). A theory of productivity acquisition must leave room for such extensive individual differences even though, in general, all children eventually converge on the same set of rules; we discuss these matters further in Section 16.4.1.
16.3.2 Irregulars and regulars across languages Once a productive rule emerges, children acquire a systematic aspect of their language while occasionally over-extending its application. The rates and duration of such errors vary from case to case, and from child to child. Earlier corpus studies suggest that 4.2% of the English irregular verbs children produce are over-regularized (Marcus et al., 1992; Pinker, 1995), but later studies have found somewhat higher rates. For instance, Yang (2002) reports a rate of 10% based on approximately 7,000 tokens from four large longitudinal corpora in the public domain (see also Maratsos, 2000), and Maslen et al. (2004) find that 7.8% out of about 1,300 tokens produced from a single child are over- regularized. The robustness of over-regularization is uncontroversial but the literature has often given an impression that the irregular forms are also frequently extended. As discussed in Section 16.2, this may be the result of task effects in various behavioral tests for productivity. The evidence from children’s naturalistic production, by contrast, unambiguously points to a categorical distinction between regular and irregular processes. The acquisition of English again provides most detailed evidence. The frequent reference to analogical errors such as bite-bote, wipe-wope, think-thunk, etc. (Bowerman, 1982; Bybee, 1985; Pinker and Prince, 1988; Pinker, 1999; Ambridge, Kidd, Rowland, and Theakston, 2015) seems anecdotal: not a single instance can be found in the over four million words produced by English-learning children in the CHILDES database (MacWhinney, 2000). The most comprehensive empirical study of analogical errors (Xu and Pinker, 1995) dub these “weird past tense errors” on the basis of their rarity. Xu and Pinker examined over 20,000 past tense tokens produced by nine children, and
336 Charles Yang only 40 weird errors (0.02%) were identified. Similarly, an elicitation study of typically- developing and SLI children also finds that when prompted for past tense of irregular verbs, children frequently over-regularize or leave the verb unmarked, but the stem changes characteristic of irregular verbs are very rare for both groups (Marchman, Wulfeck, and Weismer, 1999). The drastically different rates of over-regularization and over-irregularization suggest that there is a (near) categorical distinction with respect to productivity between productive rules and irregular rules. When children make mistakes, they almost always employ a default or productive form (e.g., thinked) or omit the appropriate form altogether: they almost never substitute with an inappropriate form. The categorical distinction between productive and unproductive morphological processes is strongly confirmed across languages. For example, in a study that targets the German agreement affixes -st (2sg) and -t (3sg), Clahsen and Penke (1992) find that while children supply an agreement affix in obligatory context only 83% of the time, almost all the errors are those of omission. When the child does produce an agreement affix, it is almost always the appropriate one (over 98% of the time); inappropriate use (e.g., substituting a -t for -st) is virtually absent. Similar patterns can be observed in the acquisition of Italian. In a cross-sectional study (Caprin and Guasti, 2009, p. 31), children in all age groups use a diverse and consistent range of tensed forms. Furthermore, the use of person and number agreement is essentially error free throughout, reaching an overall correct percentage of 97.5%, consistent with previous reports (Guasti, 1993; Pizzuto and Caselli, 1994). Children’s impressive command of agreement is most clearly seen in the acquisition of languages with considerable morphological complexities. In a study of morphosyntactic acquisition in Xhosa (Gxilishe, de Villiers, de Villiers et al., 2007), children are found to gradually expand the use of subject agreement across both verbs and noun classes. The rate of marking in obligatory contexts as well as the diversity of the morphological contexts themselves steadily increased. In a process best described as probabilistic, the children often alternate between marking a verb root in one instance and leaving it bare in another, very much like the use/omission alternation pattern reviewed earlier. Crucially, virtually all agreement errors are those of omission: 139 out of 143 or 97.2% to be precise. Substitution errors are again very rare, confirming previous research on languages with similarly complex morphology (Demuth, 2003; Deen, 2005), including polysynthetic languages such as Inuktitut (Allen, 1996). We turn now to two case studies that focus more specifically on the contrast between regular and irregular morphologies in children’s naturalistic speech. This type of evidence has been accumulating in the literature on the dual-route approach to morphology (Pinker, 1999; Clahsen, 1999), for which a categorical distinction between regular and irregular processes is of central importance. The results are again unambiguous. The German participle system consists of a productive default -t suffix (fragen-gefragt ‘ask-asked’), as well as an unpredictable set of irregulars taking -n (stehlen-gestohlen ‘steal-stolen’) (Wiese, 1996). In a series of studies, Clahsen and colleagues (Clahsen and Rothweiler, 1993; Weyerts and Clahsen, 1994; Clahsen, 1999) find that children across all age groups overapply the -t suffix to the irregulars, where the reverse usage is virtually
Systematicity and Arbitrariness In Language 337 absent. Their longitudinal data contains 116 incorrect participle endings, out of which 108 are -t errors (*gekommt instead of gekommen ‘come’, i.e., over-regularization). The rest are irregularization errors such as *geschneien for geschneit (snowed). According to the authors, the overall rate of -t regularization is 10% of all usage, which suggests that the -n irregularization rate is merely 0.75% (based on the eight -n errors compared to 108 -t errors). The acquisition of German past participles, therefore, is quite analogous to that of English past tense reviewed earlier (Xu and Pinker, 1995): both point to the productive asymmetry between regulars and irregulars. The inflection of Spanish verbs provides a complex but highly informative case for exploring productivity in child language. In Spanish, stems generally consist of theme vowels and roots, which are then combined with affixes for inflection. For instance, a finite form of the verb hablar (to talk) is habl-a-ba-ais, which represents the root (habl ‘speak’), the theme vowel (a), the past tense (ba) and the second-person plural (ais). The irregularity in Spanish inflection comes in two broad classes concerning the stem and the suffix respectively. There are some 30 verbs that are highly irregular with the insertion of a velar stop in certain inflections. The majority of irregulars undergo an alternation known as diphthongization, a process that is not limited to verbal morphology (Harris, 1969; Eddington, 1996). However, which verbs undergo diphthongization still must learned on an individual basis: It is possible to find minimal pairs such as contar (‘to count’) and montar (‘to mount’) where the former contains the diphthong (cuento) but the latter does not (monto). And there are a few common verbs that show both diphthongization and velar insertion in some forms. Although inflectional irregularities in Spanish mostly concern the stem, the suffixes are affected as well. For the stem querer ‘to want,’ for instance, the 1sg past tense is quise, which involves the stem change noted earlier but also takes an irregular suffix rather than the regular suffix, which would have resulted in quisí. The suffix in the 3sg past tense puso ‘she/he/it put’ is -o and the regular suffix would have formed *pusió. Clahsen, Aveledo, and Roca (2002) analyzed the verbal inflections of 15 Spanish- speaking children and found strong evidence for a categorical distinction between the regular and irregular inflections. (3)
a.
b.
The irregulars: children produced a total of 3,614 irregular verb tokens, out of which 168 (4.6%) are incorrect either in stem formation or suffixation. i. Of the 120 stem-formation errors (see below), 116 are over-regularizations and only one is analogical irregularization. ii. Of the 133 suffixation errors, 132 are over-regularizations with no occurrence of irregularization. The regulars: children produced 2,073 regular verb tokens, only two of which are the inappropriate use of irregular suffixes.
Collectively, then, the rate of analogical irregularization is only 0.001% for all verbs, and also 0.001% for the irregulars: again, orders of magnitude lower than the rate of over-regularization errors. Clahsen et al.’s study does not include errors regarding
338 Charles Yang diphthongs; all the stem-formation errors are failures to use a diphthong when required; it thus does not rule out the possibility of “mis”-diphthongization—for example, the child produces [ie] alternation when the correct diphthong is [ue]. To address this issue, Mayol (2007) provides a finer-grained investigation of inflectional errors focusing more specifically on the types of stem-formation errors and their underlying causes. The speech transcripts of six Spanish-learning children, almost 2,000 tokens in all, fail to yield a single misuse of diphthongization.
3.3 When productivity fails The Wug test puts the unbounded creativity of language in the spotlight but I would be remiss if we left out the corners of grammar where productivity breaks down. In a classic paper, Halle (1973) draws attention to morphological “gaps,” the absence of inflected words for no apparent reason. For instance, there are about 70 verbs in Russian that lack an acceptable first-person singular nonpast form (data from Halle, 1973 and Sims, 2006). (4)
*lažu ‘I climb’ *pobežu (or *pobeždu) ‘I conquer’ *deržu ‘I talk rudely’ *muču ‘I stir up’ *erunžu ‘I behave foolishly’
There is nothing in the phonology or semantics of these words that could plausibly account for their illicit status, yet native speakers regard them as ill formed. Such unexpected failure of productivity is, in fact, widely attested in the world’s languages (see Baerman, Corbett, and Brown (2010) and Fanselow and Féry (2002) for surveys). Even the relatively impoverished morphology of English contains gaps. For example, most speakers are not sure about the past tense form of the verb forgo (forgoed or forwent) or the past participle of the verb stride (strided or stridden) (Pullum and Wilson, 1977). Similarly, in most dialects of English, the negative clitic -n’t cannot be contracted to the auxiliary/modal verbs am (*I amn’t) and may (*you mayn’t) in the manner of haven’t, needn’t, and don’t (Zwicky and Pullum, 1983). In other words, children learning these English dialects must fail to learn a productive rule for these words in order to acquire the dialects correctly. Finally, as Halle (1973) notes, gaps show that there is no inherent productivity distinction between inflectional and derivational morphology: that the former tends to be productive and the latter tends not to must be captured by the language learning rather than some architectural property of the grammar. The failure of morphological productivity has only received scant attention in the acquisition literature. In an important study, Dąbrowska (2001) shows that in Polish, masculine nouns in the genitive singular either take an -a or -u suffix. A corpus analysis of child-directed Polish shows that -a is the numerical majority, covering over
Systematicity and Arbitrariness In Language 339 65% of the nouns, but fails to become the productive default; see Section 16.4.3 for details. Loanwords, for example, take on -a and -u in unpredictable fashion. And native speakers have to resort to brute-force memorization of noun-specific suffix, a process that extends well into teenager years (Dąbrowska, 2005), as the choice of the suffix seems arbitrary (Mausch, 2003; Westfal, 1956). The Polish genitive system, and the phenomenon of gaps more generally, pose considerable challenges to many leading theories of language and language learning. It suggests that child learners should not presuppose the existence of a default rule—and look for it—as presumed by the dual-route model of morphology (Clahsen, 1999; Pinker, 1999). Likewise, the absence of a default also poses challenges to competition-based theoretical frameworks such as Distributed Morphology (Halle and Marantz, 1993) and Optimality Theory (Prince and Smolensky, 2004) in which a winning form is expected to emerge. In summary, we have seen cross-linguistic evidence that when learning morphology, children draw a categorical distinction between productive and unproductive processes. These results conform with the traditional notion of productivity (Bloch, 1947; Nida, 1949; Botha, 1969; Bauer, 1983), and are at odds with the recent claim of gradience along a continuum of productivity (Jackendoff, 1996; McClelland and Bybee, 2007). That is not to say that productivity holds uniformly across individual speakers in all cases: a rule may be categorically productive for some speakers but categorical unproductive for others. We have already seen examples of this: at the age of 2;6, for instance, -d was categorically unproductive for Adam but categorically productive for Eve. But if the individuals’ productivity measures are pooled (as is frequently the case in experimental research), or the productivity test is inherently gradient (Section 16.2), it may give rise to the impression that productivity lies on a continuum. The categorical distinction between productive and unproductive processes, extensively supported across languages, should not be surprising: similarity-based analogies, which children steadfastly avoid, would not give rise to a stable and usable language shared by a community of speakers. Returning to our Saussurean theme, the emergence of productivity allows the child to form generalizations beyond their finite, and to a great extent, arbitrary experience. Furthermore, productivity must overcome another layer of arbitrariness: the irregular verbs, which derive from historical change but form essentially an arbitrary list, must be overcome for children to derive the -d rule. How, then, do children pick out just the productive rules in their language? Fortunately, the acquisition evidence reviewed this section places severe constraints on the space of possible models.
16.4 The learnability of productivity 16.4.1 Developmental constraints on learnability The preceding section can be viewed as a descriptive study of productivity from a developmental perspective: the child knows A at age X but B at age X+Y. The learnability of
340 Charles Yang productivity aims to provide a more complete explanation (Chomsky, 1965): What kind of learning mechanisms, acting on what kind of linguistic data, can facilitate the transition from A to B during the time span of Y? A description of productivity thus provides a set of design specifications for a learning-theoretic account, for which the psychological nature of language acquisition places additional constraints. For example, a descriptive account of productivity can be pursued with whatever methodological tools at the scientist’s disposal such as those reviewed in Section 16.2 and 16.3: corpus statistics, behavioral studies, observations of language change over time, and so forth. But children learning languages are much more resource-limited. They have no access to information such as “This is an irregular verb” or “That’s a productive rule,” or indeed behavioral or statistical correlates of productivity (e.g., that the reaction time of processing productive generally does not show whole- word frequency effects; Ullman, Corkin, Coppola et al., 1997; Clahsen, 1999). Rather, they are presented with a finite set of words that undergo an assortment of morphological processes: they must be able to determine which ones can, and cannot, generalize beyond the input. Finally, a learning theory must has plasticity to leave room for individual variation in language development, as reviewed by Potter and Lew-Williams (this volume), while ensuring the final outcome to be generally uniform across individuals. To illustrate this point, consider the first verbs produced by the six children in the Providence Corpus (MacWhinney, 2000). The biweekly recording sessions started at 1;0 and therefore provide a reasonable approximation of the children’s vocabulary growth. Using the morphological annotation in the corpus, I extracted the first 150 unique verb stems produced by these children in longitudinal order, divided into six sets with a size increment of 25 verbs, and the Jaccard similarities are computed for children’s verb vocabulary.4 Figure 16.2 reports the mean value and standard deviation across the 15 pairwise comparisons for the six stages. The individual differences across children are significant, especially during the early stages. We may never be able to predict exactly which word a child learns, but a theory of productivity must be able to “normalize” such differences to ensure an essentially uniform outcome of language learning, which has been abundantly documented in the quantitative study of language variation and change within speech communities (Labov, 1972; Labov, Ash, and Boberg, 2006). There has been no shortage of learning models of productivity. The so-called past tense debate, which has engendered much of the research on morphological productivity and its acquisition, was ignited by Rumelhart and McClelland’s (1986) connectionist network to model English past tense. At a high level, the network model was able to reproduce the U-shaped learning curve in child language development, that is, an initial stage of conservatism followed by the emergence of productivity that results in over- regularization. However, the Rumelhart and McClelland model makes unpredictable
4 For
each pair of children, the Jaccard similarity is the ratio between the cardinality of their vocabulary intersection and the cardinality of their vocabulary union.
Systematicity and Arbitrariness In Language 341
Jaccard similiarty index
0.8
0.6
0.4
0.2
0.0 25
50
75 100 First verbs
125
150
Figure 16.2 Children’s early verb vocabularies differ considerably across individuals. Data from the Providence corpus.
errors that are unattested in child learning when trying to handle regular forms, such as membled for the past tense of mail (Pinker and Prince, 1988). While the specifics of these problems have been addressed with further improvements to this class of models (Plunkett and Juola, 1999), the core tension between productive and unproductive morphological processes remains. The categoricity of productivity in child language acquisition documented in the previous sections has almost completely escaped the attention of learning research. While every effort to model morphological learning has (rightly) focused on the over- regularization phenomenon, the absence of over-irregularization in child language has never been taken into account, save a brief comment in Marcus (1995, p. 277) that connectionist models produce similar rates for both. Indeed, the failure to distinguish productive and unproductive patterns has been a persistent defect of computational models of morphological learning. For example, a recent computational study by O’Donnell (2015) tests several (nonconnectionist) models of past tense learning. Most models are reasonably successful at passing the Wug test for regular verbs, but they all produce a high number of analogical forms on the basis of the existing irregulars: the best overall model produces 10% of over-irregularization patterns on novel items, which is orders of magnitude higher than the over-irregularization rate by human children (Xu and Pinker, 1995). The reason for these failures appears to be the inclusion of token frequency of words in the modeling effort. Because irregular verbs in English are quite frequent, they can lead the model to assign/reserve a large probability mass to the irregular morphological processes, in effect matching the frequencies in the input. As a result, irregular forms are analogized when models are presented with nonce words, especially those similar to existing irregulars, in contrast with the virtual absence of analogical errors by children.
342 Charles Yang Nevertheless, important insights have emerged from connectionist models, which actually converge with ideas from linguistic theorizing. It was long recognized that productivity is the result of high, and indeed, dominant coverage of (unique) words, or types. For instance, a classic text states that “(a) form which is statistically predominant is also likely to be productive for new combinations” (Nida, 1949, p. 45), and the regular suffix -d in English is explicitly identified with statistical majority (Bloch, 1947). In the first comprehensive study of morphology in generative grammar, Aronoff (1976, p. 36) quantifies the productivity of a word-formation rule (WFR) as follows: “we count up the number of words which we feel could occur as the output of a given WFR (which we can do by counting the number of possible bases for the rule), count up the number of actually occurring words formed by that rule, take the ratio of the two and compare this with the same ratio for another WFR.” Early psychological research on children’s morphology contains similar proposals (Ivimey, 1975; MacWhinney, 1978), and that is in fact also how Rumelhart and McClelland (1986) modeled English past tense. They first presented the network with a small sample of ten verbs with a dominant majority of eight irregular verbs, before a sudden influx of a much larger sample of 420 verbs with a dominant majority of 334 regular verbs. The legitimacy of these modeling assumptions aside (see Pinker and Prince, 1988 for discussion), the accumulation of a critical mass of regular verbs to overwhelm the irregulars is the driving force for the emergence of productivity (Marchman and Bates, 1994). A statistical majority (of types) will surely identify -d as the productive rule for English past tense: there are over 100 irregular verbs, somewhat fewer in regular circulation, and there are thousands more that take -d. But this strategy fails as a general principle for productivity learning. On the one hand, there are morphological gaps such as the Polish masculine genitive reviewed earlier: the suffix -a is by far the majority (65%) yet it fails to achieve productivity. On the other, there are cases such as the German noun plural system, where the suffix -s is clearly productive—along with at least some of the other suffixes—while not covering only a tiny proportion of nouns in the language (see Section 16.4.3). The Tolerance Principle is a model that formalizes the traditional insights into a unified solution for productivity.
16.4.2 The Tolerance Principle Learning a language requires discovering rules that generalize beyond a finite sample of data. The Tolerance Principle (Yang, 2005, 2016b; henceforth TP) is a theory of how such generalizations are formed. Specifically,
e ≤ θN =
N (5) ln N
Let a rule R be defined over a set of N items. R is productive if and only if e, the number of items not supporting R, does not exceed θN:
Systematicity and Arbitrariness In Language 343 If e exceeds θN, then the learner will learn these by lexically specific means and not generalize beyond them: that is, R is unproductive. Here I use the term “rule” as a theoretical-neutral term to denote a function that maps an input item to an output item. The function may be partial, as it must be for the case of morphological gaps reviewed in Section 16.3.3. The TP builds on the intuition in the discussion of learning models earlier, that rules must “earn” productivity by the virtue of being applicable to a sufficiently large number of candidates it is eligible for. If there are ten examples and all but one (9/10) support a rule, generalization ought to ensue. But no one in their right mind would extend a rule on the basis of 2/10: the learner should just memorize the two supporting examples. Productivity is a calibration of regularities and exceptions—crucially with respective to word types rather than tokens. The reader is referred to Yang (2016b) for the empirical motivation and formal analysis behind of the TP: essentially, the TP specifies when a rule can be regarded as good enough for generalization. Table 16.1 provides some sample values of N and the associate threshold values θN. Note that θN decreases quite sharply as a proportion of N, which suggests that rules defined over a smaller vocabulary can tolerate relatively more exceptions, and are thus easier to learn. This has interesting consequences for language development and provides a theoretical underpinning for the idea of “less is more” (Newport, 1990; Elman, 1993; Yang, 2018), which will not be pursued here. Table 16.1 The maximum number of exceptions for a productive rule over N items N
θN
%
10
4
40
20
6
30
50
12
24
100
21
21
200
37
19
500
80
16
1,000
144
14
In contrast to most if not all learning models in linguistics and psychology (e.g., Shepard, 1987; Anderson, 1991; Nosofsky, Palmeri, and McKinley, 1994; Tenenbaum and Griffiths, 2001), the TP is parameter free. Productivity is determined by two input values (i.e., N and e) that are word counts in an individual learner’s vocabulary, and a categorical prediction is made without the need for parameter tuning or curve fitting. While the TP is a claim about all language learners, it clearly allows room for variation in the transient stages of language acquisition as well as in the stable grammars of individual speakers. The relationship between N and e, which may change during the course of
344 Charles Yang language acquisition, determines the status of the rule. If e is very low as a proportion of N, then children may rapidly conclude that a rule is productive. Otherwise, a protracted stage of conservatism may ensue, which may be followed by the sudden onset of productivity. It is also possible that no rule ever reaches the productivity threshold; gaps and other phenomena of ineffability arise. To apply the TP, it is important to obtain reliable estimates of N and e, ideally at the individual level. This is possible in the controlled setting of artificial language learning, as the items children are exposed to are under the complete control of the researcher. For instance, Schuler et al. (2016) taught children the labels for nine objects. In one condition, five of the nouns share a plural suffix and the other four have individually specific suffixes. In another condition, the split was three and six. Children were given a novel label in the singular in a Wug-like test afterwards: generalization was observed in the 5/4 condition but not in the 3/6 generalization, as the TP predicts that nine items can only tolerate four exceptions. In a further variation (Schuler, 2017), children were given a vocabulary test to assess their N and e, which subsequently confirmed personalized predictions of the Tolerance Principle. The TP has also been shown to be effective for even younger subjects on a passive language task (Koulaguina and Shi, 2019; Emond and Shi, 2021) where again the vocabulary size and composition can be carefully controlled. In general, however, vocabulary estimation of language learners is difficult. The challenge is even more formidable when a child-directed speech corpus is not available. Nevertheless, there are several mitigating factors such that the data poverty problem is not debilitating. On the one hand, child language, at least the core aspects of the grammar such as morphology and word order, is acquired very early, at an age where the vocabulary size is at most just over 1,000 items (Hart and Risley, 1995, Szagun, Steinbrink, Franik, and Stumper, 2006). On the other, there is converging evidence that lexical frequency can help to provide a reasonable approximation of vocabulary. Nagy and Anderson (1984), for instance, estimate that most English speakers know words above a certain frequency threshold (about once per million). Developmentally, it is also known that children’s vocabulary acquisition correlates with word frequencies in child-directed speech (Goodman, Dale, and Li, 2008), especially for open-class words, the primary arena of rule productivity. These two considerations suggest that the bulk of young children’s vocabulary can be found in the most frequent words in the language. For instance, the Chicago Corpus, a large longitudinal study of vocabulary acquisition by 62 children (Rowe and Goldin-Meadow, 2009; listed as an appendix of Carlson, Sonderegger, and Bane, 2014), produced a list of words assessed to be available to most English-learning children prior to 50 months. The vast majority can be found in the top 1,000 most frequent words in child-directed speech (MacWhinney, 2000). Finally, the calibration of productivity under the TP deals with the type frequency of words, and more specifically, the proportion of e relative to N. It is immaterial exactly what these words are, or how frequently they appear—assuming, of course, they are frequent enough to be learned by young children. Therefore, while different learners necessarily know different words, for obvious and non-obvious reasons (Figure 16.2) their grammars may still be the same if the relative proportions of N and e fall on the
Systematicity and Arbitrariness In Language 345 same side of productivity. As Kodner (2019) shows for a cross-linguistic/genre study, the vocabulary necessary for children’s rule learning can be effectively bootstrapped from adult corpora; the simulated samples almost always result in the same rules. The TP can thus apply, at least as a suggestive model for rule learning even in the absence of precise vocabulary data from child learners.
16.4.3 Applying the Tolerance Principle The application of the TP is a typical example of the hypothesis testing approach to language learning (Chomsky, 1965; Wexler and Culicover, 1980; Berwick, 1985, Trueswell et al., 2013). The learner forms a hypothesis, that is, a rule R, and then tests its productivity numerically with the quantity of N and e. If R is productive, the learner will generalize it; otherwise, the learner will attempt to formulate a different rule and the process repeats. It is important to emphasize that the formation of a hypothesis and its subsequent evaluation are in principle two independent processes. For concreteness, consider linguistic rules of the form:
R: IF A THEN B (6)
where A provides the structural description for R, which, if met, triggers the application of B, the structural change. Quite generally, learning can be framed as a search problem that identifies the structural descriptions of the items that undergo a specific structural change. All explicit learning models in the study of language (e.g., Chomsky, 1955; Ivimey, 1975; Berwick, 1985; Skousen, Lonsdale, and Parkinson, 2002; Albright and Hayes, 2003) as well as from adjacent fields such as artificial intelligence (Mitchell, 1982; Cohen, 1995; Yip and Sussman, 1997; Daelemans, Zayrel, and van der Sloot, 2009) and cognitive psychology (Medin and Schaffer, 1978; Feldman, 2000; Osherson and Smith, 1981) converge on a shared insight: inductive learning must proceed conservatively, drawing minimal generalizations from the data. For concreteness, we illustrate the inductive process using the Yip-Sussman model (Yip and Sussman, 1997 and extended by Molnar, 2001) on the familiar example of English past tense. A TP-based learning model that handles English past tense, German noun plurals, and other morphological processes has been implemented (Belth, Payne, Beser, Kodner, and Yang, in press). The learner constructs rules as mapping relations between input (stem) and output (past tense). Both input and output are represented as a linear sequence of phonemes specified by their distinctive features in the Yip-Sussman model but we will use English orthography for ease of presentation. The operation of the model is presented in Figure 16.3 with a sequence of input words that becomes incrementally available to the learner. The model forms ever more inclusive generalizations over the phonological properties of the verbs that take -d. Eventually, it concludes that -d has no restrictions whatever (*’s all around in Figure 16.3) the generality of a rule is directly related to the diversity of words it applies to.
346 Charles Yang Generalization-based models as in Figure 16.3 do not have the capacity to distinguish productive from unproductive rules. Indeed, when executed on English, it produces rules in (7) in addition to the -d rule: (7)
a. Rime → ɔt/ ___ (e.g., think-thought, catch-caught, buy-bought, bring-brought, seek-sought, fight-fought, teach-taught) b. d → t /en ___(e.g., bend-bent, lend-lent, spend-spent) c. ɪ → æ /___ŋ (e.g., sing-sang, ring-rang)
These rules, however, are not constrained for productivity. For example, the high degree of heterogeneity for words in (7a) results in a rule that places no restrictions on its application, such that any verb take on “ought” for past tense, which is clearly illicit. Similarly, the rule (7b) would replace /d/with /t/for verbs with the rime /end/, erroneously turning blend into blent, mend into ment, and end into ent, which no English speaker does. The TP effectively places a cap on the productivity of rules. The rules in (7) have open- ended scopes of application but they would be almost immediately assessed as unproductive: the number of items that follow the rules is a tiny fraction of those that fit the structural description but do not the rules. It is especially interesting to study rules such as (7c), namely the verbs that end in /ɪŋ/(‘-ing’), the only kind that adults but only in experimental conditions feel the temptation to irregularize (Bybee and Moder, 1983;
for: walk do: -d
for: talk do: -d
for: *alk do: -d
for: bake do: -d
for: *k do: -d
for: kill do: -d
for: * do: -d
Figure 16.3 The learning of the regular rule (-d). Adapted from Molnar (2001).
Systematicity and Arbitrariness In Language 347 Albright and Hayes, 2003). There are only 14 such verbs in English that fall into this category: (8)
a. b. c. d.
‘ought’: bring (1) ɪ → æ /___ŋ: sing, ring, spring (3) ɪ → ʌ /___ŋ: swing, string, sting, fling, cling, sling, wring (7) Regular: wing, zing, ding (3)
None of these patterns, including the plurality pattern (8c), is numerically large enough to achieve productivity because the maximum number of exceptions is only θ14 = 5. It is clear, however, that the TP leaves room for individual and dialect variation (Herman and Herman, 2014). If a learner happens to receive input data where an overwhelming majority—as determined by the numerical relationship between N and e—then the speaker can achieve productivity for patterns that are not productive in the “standard” English variety. The -d suffix, by contrast, can easily clear the threshold: for some 100 irregular verbs, some 650 regular verbs are sufficient to establish productivity. However, the numerical condition of the TP (Table 1) suggests that this will be a protracted development. Of the top 200 verbs inflected in the past tense (MacWhinney, 2000), 76 are irregulars. Because θ200 is only 37, it follows that children who know some 200 most frequent verbs cannot recognize the productivity of -d despite its statistical predominance. Productivity can only result when the number of regular verbs thoroughly overwhelm that of irregulars, as the TP provides a precise measure of the critical mass necessary for productivity suggested in the previous literature (Ivimey, 1975; Rumelhart and McClelland, 1986; Marchman and Bates, 1994). The individual differences in the emergence of productivity (e.g., Adam and Eve reviewed in Section 16.3.2) appears to the children’s vocabulary size (Yang, 2016b: Section 4.1.2). The TP can be applied fairly mechanically, provided that the researcher can identify plausible generalizations from the data on the basis of independently motivated developmental assumptions while quantifying the vocabulary counts that follow or defy such generalizations. It is intended to operate for productivity calculation on all linguistic levels. For instance, phonological alternation, typically described as allophonic rules, may also have exceptions. A well-known example is the Philadelphia “short-a” system (Labov, 1989). It tenses /æ/in a well-defined set of morpho-phonological contexts but there are lexically specific exceptions. While /æ/is lax before voiced stops, mad, bad, and glad tense, and while /æ/productively tenses before tautosyllabic front nasals, three high frequency irregular past tense forms, that is, ran, swam, and began, do not. The rules and their exceptions are all reliably acquired by native Philadelphia speakers (Roberts and Labov, 1995), and the TP can be applied to account for how the distribution of /æ/is acquired and how it may change over time as the input experience changes (Sneller, Fruehwald, and Yang, 2019). When a rule defined over a set of words fails to reach productivity, the set can be partitioned further and productive patterns may be detected within the resulting subsets. From a cross-linguistic perspective, such recursive application of the TP is
348 Charles Yang likely the norm: morphological systems in general have “nested” patterns where productivity is defined over complementary distributions of words along some structural dimension (e.g., phonological, semantic, gender, and what is descriptively known as conjugations and declensions), as the simple default-plus-exception case of English past tense being an anomaly. As an illustrative example, consider the distribution of the noun plural suffixes in German: -s, -n, -e, -er, and -ø (the null suffix). Notably, the -s suffix covers only a small minority of nouns. Table 16.2 provides the statistics based on some 450 highly frequent noun plurals from child-directed German speech. The distribution of the suffixes is similar to data from larger corpora (e.g., Clahsen, 1999). This is a significant fact, suggesting that grammatical rules acquired from a child-sized vocabulary would generalize to larger data sets. An application of the TP clearly will not identify a productive suffix in Table 16.2 since none is anywhere near requisite threshold. Yang (2016b) reports several detailed case studies of this type: when children fail to discover a productive rule over a set of lexical items, they will subdivide the set along some suitable dimension and apply the TP recursively; see Belth et al. (2021) for a full implementation. For the German plural system, the relevant dimensions are gender (Mills, 1986) and the phonology of the final syllable (e.g., Wiese, 1996), which children appear to acquire in conjunction (Szagun, Stumper, Sondag, and Franik et al., 2007). Applying to the TP to the subdivided classes of nouns in Table 16.2 produces the correct results. The net effect is that almost all nouns are predictably accounted for by the four suffixes: each suffix will still have exceptions but the number of them fall under their respective tolerance threshold. For example, 146 out of the 166 feminine nouns take the -n suffix, easily clearing the threshold (134): the remaining 20 would be memorized as exceptions to the feminine rule, which children indeed occasionally over-regularize with -n (Gawlitzek-Maiwald, 1994; Elsen, 2002). The remaining non-feminine nouns, however, still do not productively support any of the suffixes, but further partitioning by gender (masculine vs. neuter) and the phonological condition on the final syllable yields productive rules for -ø, -e, and -er, although each rule has exceptions that must be rote- memorized. This removes almost all nouns from consideration when it comes to the -s suffix, which has no structural restrictions on the noun and thus becomes the default. Finally, recall that the TP tolerates a relatively low level of exceptions (as indicated by the logarithmic function; see Table 16.1). This accounts for the protracted development of productive rules with many and especially high frequency exceptions (e.g., -d) and will also detect the absence of productivity when none of the alternations meets the threshold. As noted in Section 16.3.3, the phenomenon of morphological gaps is difficult to account for by acquisition and theoretical models that assume the existence of a productive default. It similarly poses challenges for learning models especially those probabilistic in nature (e.g., Rumelhart and McClelland, 1986; Skousen, 1989; Albright and Hayes, 2003; Tenenbaum, Kent, Griffiths, and Goodman, 2011; O’Donnell, 2015), that produce the most favored output or sample from alternatives. Consider again the Polish masculine genitive singular (GEN.SG) suffixes, where neither -a nor -u is productive and the learner must resort to rote learning (Section 16.3.3). In contrast, the genitive plural (GEN.PL) for masculines is unproblematic: the default suffix
Systematicity and Arbitrariness In Language 349 is -ów with a small number of exceptional nouns taking -i/-y. Children acquiring Polish make very few errors in the genitive singular (GEN.SG), and frequently overextend -ów in the GEN.PL., just as this description leads us to expect (Dąbrowska, 2001). Applied to noun stems found in child-directed Polish, the TP provides a straightforward account of both patterns, summarized in Table 16.3; see Gorman and Yang (2018) for details. For the singulars, neither -a nor -u can be productive because the maximum number of exceptions is 187 (1353∕ ln 1353). The absence of a productive suffix offers no opportunity for over-regularization and children’s performance on both suffixes is consistently high. For the plural, however, the productivity of the -ów suffix is unperturbed by the presence of 61 exceptions that fall far below the threshold of 95 (612∕ ln 612). It thus serves as the attractor for over-regularization: to wit, -i/-y nouns have the highest error rates by far, even though they have a higher average frequency in child- directed Polish. Table 16.2 Distribution of noun plural suffixes for highly frequent nouns in child-directed German Suffix
Types
Percentage
–ø
87
18.9
–e
156
34.1
–er
30
6.5
–n
172
37.5
–s
13
2.8
16.5 From form to meaning The conception of productivity in this review, and the learning-theoretic approach to it, take the position that the formal system in language is established by distributional means from the input (Harris, 1951; Chomsky, 1955, 1957). If so, we may begin to contemplate how it figures into the traditional conception of language and cognition. On the one hand, a learning model capable of capturing linguistic generalizations from the input data may change the perspective on explanatory adequacy (Chomsky, 1965). A primary motivation for domain-specific principles of Universal Grammar is to constrain the child’s hypothesis space and eliminate logically possible but empirically unattested options. But such principles may be dispensable if the learning model can prevent wild hypotheses all by itself. Children would not even attempt to generalize a property from two verbs to 20 or 100; see the discussion of (7). And crazy rules—e.g., only verbs ending in nasals can passive—would be instantly rejected even if they were entertained by the learner. The net result would be a simpler theory of Universal Grammar (Berwick and
350 Charles Yang Chomsky, 2017). On the other, and pursuing the Sausurrean theme that the formal correspondences between linguistic form and meaning bring order to thought, it is possible that the acquisition of productivity, which crucially depends on experience (e.g., N and e), may result in differential trajectories in children’s conceptual development across cultures—as a result of their language (Yang, 2016a). In this final section, I will briefly discuss one case study from each direction.
16.5.1 Productivity and linguistic theory Consider a well-known problem in the acquisition of argument structure and syntax- semantic mapping, the English double-object construction (Baker, 1979): (9)
a. b. c. d. e. f.
John told the students a story. *John said the students a story. John offered the students a book. *John donated the students a book. John promised the students a pizza. *John delivered the students a pizza.
A substantial linguistic literature is concerned with providing precise characterizations of these verbs and their structural properties, to an extent that their distributional patterns as (9) can be predicted; see Harley and Miyagawa (2017) for a recent review. Likewise, the language acquisition literature (e.g., Pinker, 1989; Levin, 1993) assume (innate) mapping relations between the semantic properties of verbs and their syntactic manifestations; see Jackendoff (this volume) and Alexiadou (this volume) for an overview of lexical semantics and its relation to syntax. The verbs in (9) indeed fall in the semantic class “caused-possession” (Green, 1974), but that is clearly only a necessary condition and not a sufficient one. In fact, Levin (1993) lists nearly 250 English caused-possession verbs but only less than half participate in the double-object construction, with a high degree of lexical idiosyncrasy as illustrated in (9), and presumably
Table 16.3 Distributions of genitive suffixes on Polish masculine nouns, the productivity predictions of the Tolerance Principle, average frequency (mean number of tokens per million words), and children’s error rates gen.sg: gen.pl:
Suffix
Types
Productive?
Avg. freq.
Child error rate
–a
837 (62%)
No
7.2
1.28%
–u
516 (38%)
No
8.8
0.24%
–ów
551 (90%)
Yes
6.5
0.41%
–y/–i
61 (10%)
No
11.4
15.53%
Systematicity and Arbitrariness In Language 351 individual speaker variation as well.5 Moreover, while all languages appear to have the semantic class of “caused-possession,” their syntactic distributions differ widely. In Korean, for example, the equivalent of the double-object construction is restricted to a handful of verbs (Jung and Miyagawa, 2004), and there are languages such as Chamorro that disallow the double-object construction altogether (Chung, 1998). Whatever innate syntax-semantic mapping is available, learning from language specific and lexically idiosyncratic data is inevitable. While children must be able to grasp the concept of caused possession and form verb semantic classes, it does not seem necessary to assume that they also need to have any prior expectation of the syntactic construction associated with these verbs. In fact, a minimalist distributional learning model of hypothesis formation and testing can be proposed: (10)
a. Observation: A child learner observes a set of verbs V1,V2, . . . ,VM that participate in the syntactic construction “V NP NP.” b. Hypothesis formation: The learner proceeds to inductively identify a semantic class C, over the verbs V1,V2, . . . ,VM. c. Hypothesis testing: The learner identifies the total number of verbs (N, N ≥ M) in their vocabulary that belong in the semantic class C. c1. If (N - M) < θN then the learner extends the use of double-objects to all members of C. c2. Otherwise the learner lexicalizes the M verbs as allowing double-objects but will not extend the construction to any other item.
The model in (10) can be deployed on corpora representative of language acquisition data. A five-million-word corpus of child-directed English (MacWhinney, 2000) contains 42 verbs that appear in the double-object frame “V NP NP” (Observation; 10a). Of these, 38 have the clear semantic property of caused possession: the four exceptions are the performative verbs call, consider, name, and pronounce, which nevertheless fall below the tolerance threshold (42∕ ln 42 = 11). These generalizations can in principle be derived automatically by models such as the Yip-Sussman learner (Figure 16.3) if the semantics of verbs is decomposed into more primitive feature representations as in the structuralist tradition of componential analysis (e.g., Nida, 1975) and its modern descendants (Fillmore, 1968; Dowty, 1979, 1991; Jackendoff, 1990). Thus, the hypothesis formation step (10b) succeeds as the learner may conclude that the caused possession verb class is a necessary condition for the double-object construction. In the hypothesis testing step (10c), the learner would evaluate all of the verbs in their vocabulary with the semantics of caused possession—49 from the corpus—to see if the subset actually attested in the construction (i.e., 38) is sufficient for generalization: just about, as 37 is required (49−49∕ ln 49). In other words, a reasonable sample of English input 5
In the history of English, the syntactic distributions of these verbs have changed considerably even though their meanings have remained stable (e.g., Visser, 1963; Mitchell, 1985).
352 Charles Yang data supports a productive correspondence between the semantics of caused possession and the syntax of the double-object construction. The well-documented errors in child English (Bowerman, 1982; Gropen, Pinker, Hollander, Goldberg, and Wilson, 1989) such as I said her no, Shall I whisper something, I am going to deliver you some milk, etc. are thus expected. To get a more complete picture of the English speaker’s knowledge of the construction and to account for the distributional patterns in (9), we need to go beyond the child- directed sample and approximate the linguistic input of a typical speaker. Bootstrapping off CHILDES into larger corpora (see Yang, 2016b p. 207ff for details), we can obtain a list of dative verbs, sorted by frequency, that provides important insight on how productivity changes as a function of the learner’s vocabulary. Table 16.4 shows that if a child only learns from the most frequent dative verbs, such as those found in CHILDES, the double-object construction will be deemed productive because the vast majority of these verbs—sufficiently many as assessed by the TP—will be attested in the construction. Indeed, the semantics of the most frequent verbs have very salient manifestations in language use that involve the physical transfer of objects (e.g., give, hand, bring, throw) or the transmission of information (e.g., tell, ask, read), which may help the child to quickly form the semantic generalization of caused possession as described in the hypothesis formation step in (10b). However, as the learner’s vocabulary expands, the construction will no longer meet the productive threshold: the hypothesis testing step in (10c) eventually fails. The learner will then retreat from the over-generalization and lexically memorize those verbs that do participate in the construction: “I said you something” will then disappear. Under the TP approach, the learner has only one hypothesis at any time: the grammar is either productive or not. Thus, the conundrum that arises from choosing between an unproductive hypothesis and a productive hypothesis is not an issue, sidestepping the Subset Problem and the ineffective use of indirect negative evidence (Yang, 2017). Table 16.4 Caused-possession verbs and their expected distribution in the double- object construction Top N
Yes
No
θN
Productive?
10
9
1
4
Yes
20
17
3
7
Yes
30
26
4
9
Yes
40
30
10
11
Yes
50
34
16
13
No
60
39
21
15
No
70
43
27
16
No
80
46
24
18
No
92
50
42
20
No
Systematicity and Arbitrariness In Language 353 If this line of approach is correct, then children can directly acquire language specific rules that grammaticalize conceptual categories and relations such as animacy, causation, intentionality, quantity, and so forth. These categories and relations are likely innately available to the pre-linguistic infants (Leslie and Keeble, 1987; Gergely, Nádasdy, Csibra, and Bíró, 1995; Kotovsky and Baillargeon, 1998; Woodward, 1998; Feigenson, Dehaene, and Spelke, 2004; Sommerville, Woodward, and Needham, 2005; Spelke and Kinzler, 2007; Carey, 2009); they are also invoked as primitives as in the decompositional representation of linguistic meaning (Fillmore, 1968; Dowty, 1991), to which generalization models such as Figure 16.3 and principles of productivity assessment such as the TP can apply. Importantly, recall that children acquire the core syntax-semantics rules very early (Brown, 1973; Golinkoff, Hirsh-Pasek, Cauley, and Gordon, 1987; Naigles, 1990); it must have been done on a relatively smaller vocabulary, under which the TP would be especially robust and effective. The result would be a distributional learning theory of how the formal system for syntactic bootstrapping (Gleitman, 1990) becomes available to children learning their specific languages. This reduces the need for an innate, universal, and highly domain-specific theory of syntax and semantics mappings, which in any case would be difficult to maintain given the level of lexical idiosyncrasy across languages (Pinker, 1989; Borer, 1994; Levin and Rappaport Hovav, 1995).
16.5.2 Productivity and conceptual development As a final example, consider the development of number concepts in children (Gelman and Gallistel, 1978; Dehaene, 2011). Despite the apparent connection between natural language and natural number (Chomsky, 1988), and in contrast to the very early acquisition of language, children develop number knowledge in a protracted fashion (Feigenson et al., 2004; Carey, 2009). While very small numbers can be understood innately (Wynn, 1992) or by direct experience (e.g., counting objects; Mix, Huttenlocher, and Levine, 2002), the concept of large numbers can only be obtained via generalization: specifically, the Successor Function, that every number is followed by another number exactly one greater. The impact of language in children’s understanding of number has long been noted (Hale, 1975; Miller, Smith, Zhu, and Zhang, 1995; Fuson and Kwon, 1991; Siegler and Mu, 2008; Gordon, 2004; Pica, Lemer, Izard, and Dehaene, 2004) although the underlying causal mechanism has remained obscure. The Successor Function is a semantic relation that holds between two number concepts. If, by hypothesis, semantic systematicity can only follow from formal systematicity, then the Successor Function can only be acquired when the child establishes the systematic formal relation between successive placeholder expressions (i.e., numerals) that represent the concepts. This is strongly analogous to the acquisition of linguistic rules that have exceptions. As discussed in Figure 16.1, children’s consistency of tense marking in obligatory context increases significantly after the very first instance of over-regularization, that is, the discovery that -d is formally productive. Prior
354 Charles Yang to that point, their knowledge of tense marking, and perhaps the notion of “pastness,” appears to be lexically specific (Ervin, 1964; Kuczaj, 1977; Marcus et al., 1992). As previously observed (Fuson, 1988), English numerals have 17 expressions (1–13, 15, 20, 30, 50) that require some kind of rote memorization as they do not conform to the transparent structure of digit in combination with teens and decades. To discover the productive rules for counting, the child must learn a sufficiently large number of “regular” numerals to overcome the 17 exceptions. The problem of learning to count, then, reduces to a problem of productivity in language acquisition. According to the Tolerance Principle, English-learning children need to count to at least 73 to learn the English numeral rules as N = 73 is the small N such that N∕ ln N = 17. This prediction receives support from the study of children’s counting sequences (Fuson, 1988): once English-learning children can count to the 70s, they generally continue all the way to 100 where counting tasks typically conclude (Fuson, Richards, and Briars, 1982). Moreover, it appears that productive counting is necessary for generalizing the successor relation that holds of the small numbers. For instance, only children who can count past 80 show systematic knowledge of the Successor Function (Cheung, Rubenson, and Barner, 2017) although no theoretical reason was given for this observation. The productivity based approach to number makes a strong claim across languages, as the TP makes precise predictions about the tipping point at which counting becomes productive for specific languages: significant change in children’s understanding of the Successor Function should follow. A recent study (Yang, Lei, and Lee, 2019) provides a direct test with Cantonese. The Cantonese numeral system, like the Chinese system widely adopted in East Asian languages, have a very transparent counting system. Only 12 numerals are idiosyncractic and require rote memorization: 1–10, and two linear patterns of teens (i.e., ten-digit) and decades (i.e., digit-ten). By hypothesis, children only need to count past 46 to understand the Successor Function. Indeed, unproductive counters (i.e., those who could not count past 50) performed at chance on a binary outcome task (Sarnecka and Carey, 2008) that assesses their understanding of the Successor Function. By contrast, productive counters who counted past 50 performed at nearly 90% across a wide range of numbers. Such a quantal change in the productivity of counting and the understanding of number is very similar to the English past tense: again, the formal productivity of -d is critical for the development of tense. It turns out that Cantonese-learning children show an understanding of the Successor Function over a full year earlier than English-learning children: the cognitive advantage derives from nothing but a simpler linguistic system.
16.6 Summary This chapter reassesses some very traditional ideas about the nature of language— Saussurean arbitrariness and Humboltian infinity—in light of progress in the study of linguistic productivity, language acquisition, and cognitive science. The formal nature
Systematicity and Arbitrariness In Language 355 of the learning-theoretic approach should not overshadow the fact that the calibration of productivity is ultimately a psychological process, which interacts with other components of cognition and perception that we share with our biological relatives (Hauser, Chomsky, and Fitch, 2002; Berwick and Chomsky, 2017). Challenging problems lie ahead. We need to further develop formal and quantitative learning models that conform to the developmental evidence (Section 16.2 and 16.3). To the extent this direction of research holds promise (Section 16.4), it raises fundamental questions about the nature of language and its place in cognition (Section 16.5). In particular, the Tolerance Principle provides a causal mechanism that connects the acquisition of the vocabulary (i.e., arbitrariness) to the rise of productivity (i.e., systematicity). Saussure’s vision of language as the medium between form and meaning is remarkably consistent with the view that emerged after 60 years of generative grammar and cognitive science. He likened thought to an “indefinite plan of jumbled ideas.” Phonetic sounds are “equally vague” given their inherent variability in a continuous acoustic space. The direct correspondence between these “two shapeless masses” provides very little expressive power—like the English irregular verbs, essentially a random list of word forms. Such a correspondence is necessarily finite and highly dependent on experience. The principle of arbitrariness is merely the starting point, or a design specification, to be rescued by the “principle of order and regularity,” which, on my view, is a theory of productivity which operates formally on a quantitative basis. And that appears to be exactly Saussure’s conception of language as a science: “Linguistics then works in the borderland where the elements of sound and thought combine: their combinations produce a form, not a substance” (CLG, p. 113).
Chapter 17
Children’s Use of Syntax In Word Le a rni ng Jeffrey Lidz
17.1 Introduction One might think that it is too obvious to mention that when it comes to language use, humans are free to say whatever they want to, or, in the case of children and politicians, whatever they can get away with. What we say is determined by our communicative goals—whether we are trying to provide information, get information, lie or mislead, issue commands, provide warnings, make promises, or simply to speculate about the future (Austin, 1962; Searle, 1969; Murray and Starr, 2020; see also Schwarz and Zehr, this volume). Our communicative goals, in turn, are determined by a host of factors relating to our beliefs and desires, including our beliefs about and desires for the people we are speaking to, in hopelessly complex ways. With this complexity in mind, it is somewhat odd that our naïve intuitions about word learning seem to emanate from the idea that speakers operate in a descriptive mode. Following this intuition, we say the word “dog” when there are dogs around, and hence a child who doesn’t yet know the meaning of this word-form merely has to look around and notice the contingency between the form and the thing. My own experiences as a parent reveal the hopelessness of this naïve theory. My children use the word “dog” in a never-ending harangue about why we do not have one. Consequently, a dog novice would have a hard time figuring out the meaning of “dog” based on observations of what’s happening in the physical world when people in my family use that word. Utterances of the word “dog” are conditioned not by dogs per se, but by thoughts of dogs and the utility of getting our interlocutors to have dog-thoughts given our communicative goals. Learning the meaning of a word, then, must involve sifting through all of that psychology. If word learners were psychics, word learning would be trivial. But given that word learners are just like the rest of us in lacking the ability to read minds, there must
Children’s Use of Syntax In Word Learning 357 be some kind of evidence that they could rely on that could at least point them in the right direction. In this chapter, we consider one kind of evidence that serves to focus child word learners in on the relevant dimensions of meaning—syntactic distribution. We will see that a word’s syntax provides evidence about its meaning (and why it does so) and that from the earliest stages of word learning, children are sensitive to that information. This “syntactic bootstrapping” theory of word learning does not solve the problem of word learning, but it does place some constraints on the learner that, when paired with other capacities, including our ability to estimate the goals underlying people’s actions and utterances, make significant contributions to our ability to acquire word meanings.
17.2 The puzzle of word learning Quine (1960) famously observed that the extralinguistic environment accompanying the use of a word leaves open the particular concepts in the mind of the speaker that condition that use. A situation in which a speaker uses a word to refer to a rabbit might be identical to one in which he intended to refer to the rabbit’s fur, the speed at which it is moving, or to the memory of a delicious stew (cf. Chomsky, 1959). Even if a speaker is talking about the here and now, and even if learners could somehow zero in on the part of the world being picked out by a novel word, recognizing, for example, that the speaker was attending to the rabbit, this reference would be consistent with a broad range of concepts with overlapping extensions (e.g., Peter Rabbit, animals, mammals, rabbits in the yard, rabbits or black holes, rabbits with more than one ear, things physically identical to rabbits, etc.; Goodman, 1955). Indeed, it was precisely this kind of indeterminacy that led Quine to reject the very idea of word meanings being constant across speakers. We can respond to these philosophical problems by noting that humans have perceptual, conceptual, and linguistic abilities that limit the space of plausible word meanings (Chomsky, 1971; Fodor, 1975). These abilities lead learners to experience the world through the same conceptual apparatus as the speakers producing those words, making meanings like “rabbit or black hole” or “rabbit in the yard” unlikely candidates for word meanings (Gleitman, 1990; Spelke, 1990; Markman, 1990; Waxman and Lidz, 2006; Carey, 2009). Moreover, these conceptual burdens may be further reduced by learners’ ability to track the goals and intentions of their interlocutors (Baldwin, 1991; Bloom, 2001; Clark and Amaral, 2010). But even granting learners these constraining capacities, the world still vastly underdetermines the concepts that learners might invoke in explaining the use of a novel word (see also Jackendoff, this volume). And, as Landau and Gleitman (1985) observed, learners somehow acquire meanings even for words whose content is absent from their sensory experience, for example when blind children learn meanings for words like look and see.
358 Jeffrey Lidz The insufficiency of the world is perhaps made most clear experimentally in the Human Simulation Paradigm (Gillette, Gleitman, Gleitman, and Lederer, 1999). In this paradigm, adults are shown videos of parent-child interactions with the sound turned off and are then asked to guess what word the parent said at a particular point in the video. Overall, participants were remarkably bad at this task. For nouns, they guessed the correct word 44% of the time on average, and for verbs, they guessed correctly only 15% of the time. One might expect that the weakness of the world as an information source could be overcome by multiple exposures, over which the learner could find a common thread (e.g., Smith, 2000). Pull on that thread and the meaning will be revealed. However, participants in these experiments actually had six attempts to make their guesses, and improvement across trials was very weak at best. This lack of improvement suggests that integrating information across occurrences does not help much. Indeed, the lack of similarity across contexts led to a significant rise in guesses like “toy” or “look” as the trials progressed. This fact could reasonably lead to the expectation that a learning procedure based in cross-situational comparison would end with every word having an extremely weak meaning so that all contexts would fall under its extension. And empirically, there is little evidence that supports the idea of cross-situational learning and lots of evidence against it (see Gleitman and Trueswell, this volume, for review). But perhaps more importantly, once we recognize that we only rarely use sentences to describe the events before our eyes (after all, the person we are talking to can see what’s going on as well as we can), the expectation that we learn the words by placing them in correspondence with the bits of the world that they refer to seems like an idea we never should have taken all that seriously in the first place.
17.3 Why syntax might help Landau and Gleitman (1985) described the language acquisition of a blind child, noting that she learned the meanings of look and see, despite lacking visual inputs. She had learned them as perception verbs, but mapped their meanings onto haptic rather than visual perception. For example, she responded to the request to “Look up!” by reaching upwards with her hands, whereas sighted children wearing a blindfold turned their heads upwards. Landau and Gleitman initially reasoned that “look” must have been used by the child’s parents when the object to be apprehended was near to the child, and hence available for manual inspection. But dividing verbs along this situational dimension did not distinguish these verbs from other verbs, such as touch or hold, that had nothing to do with perception. So instead, they hypothesized that the syntactic distribution of look compared to touch and hold drives this difference in interpretation. Look occurs in different syntactic environments than touch or hold, as shown in (1–3): (1) a. Look where I’m going. b. *Touch/hold where I’m going.
Children’s Use of Syntax In Word Learning 359 (2) a. Look at that picture. b. *Touch/hold at that picture. (3) a. Look down. b. *Touch/hold down. Perhaps the blind child in this study first used the syntactic distributions to distinguish perception verbs from other verbs (using features like compatibility with a clausal complement), and then to use the near/far distinction to explain how these verbs came to be associated with manual inspection. Thus, the syntactic context provided the information that the verbs had something to do with perception and the extralinguistic context provided further information about the kind of perception that was relevant. This kind of learning procedure, using syntax to zero on the semantic domain and then combining this information with an understanding of the extralinguistic context, could generalize beyond blind children. Any learner armed with knowledge of how meaning projects onto syntax could use the syntax of a word to zero in on a specific semantic domain and then use the extralinguistic context to further specify the word’s meaning. The extralinguistic context provided information only after the initial syntactic/semantic partition. This initial partitioning is what has come to be known as syntactic bootstrapping (Landau and Gleitman, 1985; Gleitman, 1990).
17.4 Two dimensions of syntactic bootstrapping The term “syntactic bootstrapping” actually covers two closely related ideas. The first idea was that syntax is a constraining factor in narrowing reference. Given the lack of information in a scene by itself about a verb’s meaning, the syntactic structure of a clause could direct a learner to have a particular perspective on the scene. That perspectival information would then allow the learner to identify the event labeled by the sentence and hence the meaning of the verb in that sentence. In Gleitman’s words, “the structure of the sentence that the child hears can function like a mental zoom lens that cues the aspect of the scene the speaker is describing,” (Gleitman and Gleitman, 1992). The second idea was that semantic properties of a verb’s meaning stand in a principled relation to its syntactic distribution (Fillmore, 1970; Jackendoff, 1972; Levin, 1993; see also Jackendoff, this volume). Consequently, observations of a verb used in a range of syntactic environments could help the learner identify the semantic properties of that verb’s meaning. These two ideas about syntactic bootstrapping are linked together by the fundamental idea that the less information there is in the world about a verb’s meaning, the more information will be found in the syntactic distribution (Snedeker and Gleitman, 2004). At the far extreme are mental state verbs, whose meanings are hidden from observers
360 Jeffrey Lidz because their contents reside inside the mind. For such verbs, the syntactic environment is highly informative both as evidence about the verb in a single sentence and about the verb across many different sentence contexts (Papafragou, Cassidy, and Gleitman, 2007; Hacquard and Lidz, 2019).
17.5 Syntax and event reference As noted above, the meaning of a sentence/verb reflects the perspective of the speaker on an event. Events are not defined simply by their physical properties, but rather by a perspective. For example, a scene in which a bunny runs and follows the path of an elephant ahead of him might be described in either of the two ways in (4): (4)
a. The bunny is chasing the elephant. b. The elephant is fleeing the bunny.
As physical events, chasings and fleeings are indistinguishable. Where they differ is in the perspective the speaker takes toward them. A transitive clause that identifies the agent and patient as subject and object can therefore provide a learner with the appropriate perspective on the event to make the clause and verb meanings more accessible. Hearing a sentence like (4a) prior to knowing the verb’s contribution to meaning would allow the learner to adopt a perspective on the event that makes the bunny the agent. Hence, the verb in that sentence should describe the event as a chasing. By the same token, hearing (4b) would provide a perspective in which the elephant is the agent, hence a fleeing but not a chasing (Fisher, Hall, Rakowitz, and Gleitman, 1994; Nappa, Wessel, McEldoon, Gleitman, and Trueswell, 2009). In support of this theory, numerous studies have shown that infants can use the syntactic structure of a sentence as evidence about what event it describes, and consequently what the verb in that sentence means. For example, Naigles (1990) presented 25-month- olds with a novel verb in the context of a complex scene with two parts: a causal part in which a duck pushes a bunny over, and a non-causal part in which the duck and bunny each wheel their arms independently. While they watched this scene, the children heard a novel verb used either transitively (The duck is gorping the bunny) or intransitively (The duck and the bunny are gorping). Naigles then separated these two parts of the scene into two different videos each showing only one of the two parts. She measured infants’ looking preferences when they were asked to “find gorping” as a function of whether they were initially familiarized to the novel verb in a transitive an intransitive clause. Infants who had heard the transitive looked longer at the pushing scene, and infants who heard the intransitive looked longer at the arm-wheeling scene. Infants were thus sensitive to the syntactic frame of the novel verb, inferring that gorp in a transitive frame was more likely to label the causal event, whereas gorp in an intransitive frame was more likely to label the non-causal event.
Children’s Use of Syntax In Word Learning 361 This basic finding has been reproduced in various ways, confirming that infants as young as 22 months are sensitive to transitivity, and will reliably infer that a novel transitive verb labels a causal event (Arunachalam and Waxman, 2010; Arunachalam and Dennis, 2018; Brandone, Addy, Pulverman, Golinkoff, and Hirsh-Pasek, 2006; Fisher, Gertner, Scott, and Yuan, 2010; Noble et al., 2011; Pozzan, Gleitman, and Trueswell, 2016; Yuan and Fisher, 2009; Yuan, Fisher, and Snedeker, 2012). Moreover, children are able to draw this inference on the basis of distributional information alone. Yuan and Fisher (2009) familiarized 28-month-olds with short dialogues containing novel transitive or intransitive verbs, without any informative visual context. At test, infants were then asked to identify the referent of the novel verb (e.g., Find blicking!) while viewing two candidate events, one causative and one non-causative. Infants who had heard the transitive dialogues looked longer at the causative event than infants who had heard the intransitive dialogue. This indicates that they had tracked the syntactic properties of the novel transitive verb and used those properties to draw inferences about its possible meanings, even without the support of referential context. However, beyond Naigles’ (1990) seminal study, further work has found inconsistent behavior with intransitive clauses. Infants who hear novel verbs in intransitive frames do not show a reliable preference for events intended to be viewed with one participant as opposed to two (e.g., Arunachalam and Waxman, 2010; Noble, Rowland, and Pine, 2011; Yuan et al., 2012). Several methodological explanations have been proposed to account for these variable results with intransitive clauses. First, many studies use intransitive sentences with conjoined subjects (e.g., The duck and the bunny are gorping) in order to control the number of nouns across conditions. It is possible that infants may not reliably perceive these sentences as intransitive: if they mistake the conjoined subject for two separate arguments, this might lead them to infer a causative meaning for the verb (Gertner and Fisher, 2012; Yuan et al., 2012). Alternatively, it is possible that infants do not reliably perceive the presented scenes under the intended event representation. If infants conceptualize a scene of one actor pushing another as an event of two actors playing, then they might consider the intended “two-participant” scene a good referent for a novel intransitive verb (Brandone et al., 2006; Pozzan et al., 2015).
17.5.1 Explaining the effects of transitivity: one-to-one matching vs. thematic linking Results showing that infants reached different interpretations of verbs presented in transitive vs. intransitive clauses have been used to support an influential hypothesis about how infants use the syntactic contexts that verbs appear in to draw inferences about their meanings. Under this hypothesis, infants take the noun phrases in a clause to be arguments, and expect the number of arguments in a clause to match one-to-one the number of participants in the event the clause describes (Fisher, 1996; Gleitman, 1990; Naigles, 1990; Lidz, Gleitman, and Gleitman, 2003; Fisher et al., 2010). Thus,
362 Jeffrey Lidz a transitive clause with two arguments should label an event perceived with two participants, whereas an intransitive clause with only one argument should label an event perceived with one participant. This is a potentially powerful learning strategy for infants at early stages of syntactic development because it requires very little syntactic knowledge. In order to narrow down the candidate events that a clause might refer to, infants need only to identify the number of nouns or noun phrases in the clause, and do not need to identify their thematic roles or hierarchical position in the clause. On the other hand, there are several reasons to question whether this simple matching strategy is the appropriate way to capture effects of transitivity on verb meaning. First, it is not generally true of languages that participants and arguments stand in one-to-one correspondence. Verbs that describe events that entail three participants often only require two arguments: (5) a. Jesse robbed the train. b. Bonnie stole the money. c. Clyde took the jewels. In these examples, the events seem to have three main participants (the thief, the loot, and the victim), but these verbs are simple transitives, regularly occurring in clauses with only two arguments. Thus, a child guided by a one-to-one matching strategy might be expected to learn meanings for these verbs that entail only two participants. It is worth pausing for a moment to ask how one would know whether these examples are problematic for a bootstrapping theory based in one-to-one correspondence. Is it relevant that robbings entail loot if sentences with the verb rob do not require this participant to be named? The answer to this question depends on the event concepts that such events are viewed under, independent of language. If the loot in a robbery is entailed but not foregrounded as a participant in our event representations, then the fact that rob takes two syntactic arguments is not a problem. But if the loot is foregrounded as a participant, then the one-to-one-correspondence theory would predict that infants would have difficulty acquiring such a verb, since the two-argument syntax would not align with the three-participant concept. Moreover, it is not sufficient to note that rob is a transitive verb to decide that the loot is not a participant in a conceptual representation, as this would be question begging. One cannot take transitivity as evidence for conceptual adicity and also argue for a bootstrapping theory based in one-to-one correspondence; doing so presupposes the conclusion that it argues for. Thus, one would ultimately want independent evidence for the conceptual representations in order to test whether the one-to-one correspondence theory was correct (Williams, 2015; Wellwood, He, Lidz, and Williams, 2015). Second, cross-linguistically speaking, these kinds of mismatches between participant relations and argument relations are common. As an extreme example, nearly every St’át’imcets verb can occur in an intransitive clause (Davis, 1997; Davis and Demirdache, 2000; Davis, 2010). Even verbs that seem to entail three participant roles, like the verb in (6), can occur in intransitive clauses without losing those entailments:
Children’s Use of Syntax In Word Learning 363 (6) Q ́amt kwskwim ̧cxen hit.with.projectile det.NAME ‘Kwim ̧cxen got beaned.’ The sentence in (6) is a basic intransitive clause, with neither null arguments nor valency reducing morphology like passive or middle. A child expecting one-to-one correspondence between participants and arguments would thus be severely misled by (6) about the meaning of the verb in that clause (Williams, 2015). Because the one-to- one matching heuristic does not reflect a basic fact about the languages of the world, or about verbs in general, a bootstrapping theory using it as a basis will require learners to abandon it at some point in development as they acquire a theory of linking that is more richly structured. So, if the one-to-one matching heuristic is correct, we will need an additional theory detailing how it is abandoned in development, and on what basis. Further, infants seem to have rich syntactic representations of clause structure and the relation between argument position and thematic relation from as young as 16- months. Lidz, White, and Baier (2017) asked what infants know about the mapping between the syntactic position of a Noun Phrase (NP) and its thematic interpretation. They presented events in which an agent used an instrument to affect an object. For example, they saw events in which a hand used a ruler to tap a traffic cone. While they saw these events, they heard either a simple transitive clause containing a novel noun in the direct object position (She’s hitting the tam) or an intransitive clause containing a novel noun inside an instrumental Prepositional Phrase (She’s hitting with the tam). After several exposures to these sentences containing the novel noun, they were then shown the two objects (i.e., the ruler and the cone) and were asked, “Which one is the tam?” Sixteen-month-olds looked more at the cone in the transitive condition and they looked more at the ruler in the Prepositional Phrase condition. These results indicate that by 16-months, infants know how to identify the thematic role of a NP based on its syntactic position. Because infants drew different conclusions about the referent of the novel NP (and hence the meaning of the novel noun) as a function of its syntactic position, we can conclude that they build syntactic representations that contain more information than simply the number of nouns in the clause. In turn, this conclusion suggests that effects of clause structure in verb learning experiments might be driven by a richer representation of subject and object and the thematic consequences of these representations. Perkins (2019) spells this idea out more fully, suggesting that the effects of transitivity could be explained by infants’ initial expectations about thematic linking. Specifically, infants might begin language learning with an expectation that subjects of transitive clauses label agents and objects of transitive clauses label patients (Baker, 1988; Dowty, 1991; Fillmore, 1970; Jackendoff, 1972; Pinker, 1984). If infants could identify the subject and object in a transitive clause, they could then infer that the clause labels not just any event seen as having two participants, but one in which the referent of the subject is the agent and the referent of the object is the patient. This kind of learning mechanism would support acquisition in the same way, independent of whether the language being
364 Jeffrey Lidz acquired was like English or St’át’imcets. And it would allow for a continuous theory of the relation between argument structure and interpretation across development. The idea that early verb learning is driven by expectations about thematic linking is supported by several empirical results. Gertner, Fisher, and Eisengart (2006) tested 24-and 21-month-olds’ abilities to link syntactic position to thematic relation using a preferential looking task. Infants heard a transitive sentence (e.g., The duck is gorping the bunny) in the context of two causative scenes: one in which a duck pushed a bunny, and one in which the bunny pulled the duck. Both groups of infants looked preferentially at the scene in which the duck was the agent, indicating that they knew that the subject of a transitive clause labels the agent rather than the patient of a causal event. Furthermore, infants preferred the duck- agent and bunny-patient event even for sentences like He is gorping the bunny: here, they could only rely on the referent of the object because the subject does not identify a unique referent in the discourse. This indicates that infants knew that the object of a transitive clause labels the patient rather than the agent of a causal event. These infants were able to exploit relationships between argument position (subject vs. object) and argument roles (agent vs. patient) in order to constrain the inferences they draw about transitive verb meanings. For intransitive verbs, these relationships are more complicated: the subject of an intransitive clause can label either an agent (e.g., John baked) or a patient (e.g., The bread rose). These subclasses of intransitives also display differences in meaning: intransitives whose subject is an agent tend to label actions of that agent, whereas intransitives whose subject is a patient tend to label changes undergone by that patient (e.g., Fillmore, 1970; Levin and Hovav, 2005; Williams, 2015). Another line of work has asked whether children can draw these finer-grained inferences about verb meanings on the basis of the animacy and thematic role of the intransitive subject (Bunger and Lidz, 2004, 2008; Naigles, 1996; Scott and Fisher, 2009). For example, Bunger and Lidz (2004) familiarized 24-month-old infants to an event in which a girl bounced a ball with a tennis racquet, while they heard one of four types of linguistic input: transitive (The girl is pimming the ball), unaccusative (The ball is pimming), multiple frame (The girl is pimming the ball, The ball is pimming), or a no-word control (Hey, look at that). At test, they were asked, “Where’s pimming now?” while seeing the event broken into two parts—the girl patting the ball (but with no bouncing) or the ball bouncing on its own. In the transitive and no-word conditions, infants showed no preference at test. This suggests that the transitive clause by itself is not sufficient to identify the event as a contact event (hitting) or a change of state event (bouncing). However, in the unaccusative and multiple frame conditions, infants looked more at the bouncing event. This suggests that infants know that an intransitive clause with an inanimate subject is likely to label an event describing a change to that argument.1 Thus, infants seem to be aware of the thematic relations associated with arguments in different syntactic positions. 1 One
should be cautious here about the identification of unaccusative clauses. Obviously, unaccusative verbs can take both animate and inanimate subjects and there is no guarantee that an intransitive clause occurring with an inanimate subject has an unaccusative verb. Nonetheless, subject animacy is strong probabilistic cue to the classification of intransitive verbs (Scott & Fisher, 2009; Becker, 2014).
Children’s Use of Syntax In Word Learning 365 Scott and Fisher (2009) found a similar pattern. They familiarized 28-month-olds with a dialogue in which a novel verb alternated between transitive and intransitive uses. Infants either heard the intransitive with an animate subject (e.g., Matt dacked the pillow. He dacked), cueing unergativity, or an inanimate subject (e.g., Matt dacked the pillow. The pillow dacked), cueing unaccusativity. At test, infants heard the verb in a transitive frame in the context of two scenes: a caused-motion event in which a girl pushes a boy over, or a contact-activity event in which the girl dusts the boy with a feather duster. Infants who were exposed to the animate-subject intransitive dialogue preferred to look at the contact-activity event, whereas infants who were exposed to the inanimate-subject dialogue preferred to look at the caused-motion event. These infants were able to use cues to the thematic role of the intransitive subject, such as its animacy, to infer whether the novel verb labeled the action of an agent or the change undergone by a patient. Perkins (2019) further distinguished this thematic linking hypothesis from the one- to-one matching hypothesis discussed earlier by exploring whether infants would allow a three-participant event to be described with a two-argument description. As noted above, in order for such an event to be a useful probe into these alternative bootstrapping theories, it is important to have an independent measure of the number of participant relations in the event concept. So, prior to this verb learning study, Perkins, Knowlton, Williams and Lidz (2018) explored the event representations of 10-month-olds. Infants were habituated to an event in which a woman picks up a truck in an arcing motion while a man sits idly near the truck. After habituation, infants saw one of two changes to the event: a participant change or a manner change. In the participant change, the man has his hands on the truck prior to the woman picking it up. In the manner change, the woman slides the truck into her possession rather than picking it up. The physical difference between the habituation and test events were larger in the manner change than in the participant change. Infants showed greater dishabituation to the participant change than to the manner change, suggesting that they viewed the man as a participant in the event. Perkins (2019) therefore used the event in which the woman takes the truck from the man as a three-participant event in a verb learning study with, 20-month-olds. This “taking” event was labeled with a novel verb either in a transitive clause (She’s gonna pim the truck) or in an intransitive clause (The truck is gonna pim). After familiarization, infants were shown two different events: one in which the girl takes the truck from the boy vs one in which she moves the truck in same way, but with no boy present. The critical test is the transitive condition. If children expect clausal arguments and event participants to align one-to-one, then we would predict that the infants in the transitive condition would not naturally link “pim” to the three-participant event concept under which they see the event. Instead, this theory predicts that they would take the verb to label a two-participant event concept, something like move or slide. Thus, both test events should be equally good exemplars of pimming, predicting no difference in the test condition. If, on the other hand, infants take the subject to be the agent of the pimming event and the object to be the patient of the three-participant concept that they
366 Jeffrey Lidz naturally see the event under, then, they should think that the taking event is the only exemplar of pimming, and hence look more to that event. This is indeed what happened. Infants performed in line with the predictions of the thematic linking hypothesis and against the predictions of the one-to-one matching hypothesis. In the intransitive condition, both hypotheses predict no preference between the two events, and this is what was found. The intransitive results also rule out the possibility that infants in the transitive condition simply chose the more familiar video. If that were the explanation for the transitive condition, the same pattern would have been seen in the intransitive condition, contrary to fact. In sum, infants before their second birthday show an impressive ability to use information about clause structure to provide them with a perspective on an event, and hence to zoom in on relevant features of a verb’s meaning. These abilities are driven by an early appreciation of the link between grammatical relations like subject and object and the thematic relations borne by the NPs in those positions. Infants as young as, 20-months are able to identify the subject and object of a clause, to appreciate the thematic relations borne by the NPs in those positions, and to use that information to identify the meaning of a novel verb in that clause.
17.6 Distributional profiles, semantic features, and propositional attitude verbs To this point, we have seen that infants can use observations about a verb’s syntactic environment to help identify which event in the world the verb labels. However, many sentences are not used to label events. As noted in opening, speakers do more with language than simply describe the world around them. And even when they do describe what’s going on, much of what they are talking about leaves no detectable trace in the physical world that would serve as the “referent” of the expression. The very same scene might elicit any of the following utterances: (12)
a. b. c. d. e. f. g.
April is walking quickly down the street. April is going home. April is late. April wants to get home soon. I think April is late for dinner. They expected April to be home already. Remember when we went sailing?
Children’s Use of Syntax In Word Learning 367 The differences between these sentences have little to do with what’s happening at the moment of utterance, but rather with what thoughts the physical event triggers in the speaker and what the conversational goals of the speaker happen to be in producing the sentence. Of particular interest are sentences like (12d) and (12e), with propositional attitude verbs like want and think. Propositional attitude verbs describe the contents of peoples’ minds such as beliefs or desires about possible states of affairs. Such verbs present a special challenge to learners because they name internal states of speakers’ minds (Gleitman, 1990; Gleitman, Cassidy, Nappa, Papafragou, and Trueswell , 2005). A large body of literature has argued that young children may have difficulty acquiring attitude verbs because they lack the mental state concepts that these verbs label; in particular, children fail in certain tasks to demonstrate the ability to represent others’ beliefs (the so-called developing Theory of Mind, e.g., Astington and Gopnik, 1991; Flavell, Green, and Flavell, 1990; Gopnik and Wellman, 1994; Perner, 1991). However, more recent work finds that children’s failure on these tests may be due to experimental and pragmatic factors rather than immature belief representations (e.g., Hansen, 2010; He, Bolz, and Baillargeon; 2012; Helming, Strickland, and Jacob, 2014; Lewis, Hacquard, and Lidz, 2017; Onishi and Baillargeon, 2005; Rubio-Fernández and Geurts, 2012). But even if children do have the ability to represent speakers’ mental states, learning which verbs label these mental states is still no trivial matter. It is difficult to tell when mental states rather than actions are under discussion: if a speaker uses a new verb, how does a child know whether the verb labels what someone is feeling or what someone is doing? In the human simulation study by Gillette et al. (1999), adults were particularly bad at identifying attitude verbs from the situations in which they were uttered; they could occasionally identify action verbs like hit, but almost never identified attitude verbs like think, know and want. This difficulty may be caused by the low conversational salience of beliefs and desires (Papafragou, Cassidy and Gleitman, 2007; Dudley, 2017). Papafragou et al. (2007) showed that false belief contexts increase the situational salience of beliefs and the likelihood that children and adults guess that a novel verb labels a belief. This result suggests that false belief uses would be particularly informative contexts for learning that a verb labels the belief concept. However, Dudley, Rowe, Hacquard, and Lidz (2017) show that approximately 75% of uses of think in child- directed English occur with first person subjects in present tense (see also Diessel and Tomasello, 2001), contexts which do not lend themselves to false belief uses. These uses are more commonly indirect assertions, whose conversational function is to proffer the content of the embedded clause, potentially hiding even further the meaning of the verb think. Similarly, the verb know is most often used in indirect questions such as “Do you know where my keys are?”, where the illocutionary force of the entire sentence is to ask about the location of the keys, not about the addressee’s knowledge. Attitude verbs do have a reliable syntactic signal to their meaning, however. Such verbs take full clauses as complements, whereas action verbs do not: (13)
a. Kim thought that Chris liked her. b. *Kim danced that Chris liked her.
368 Jeffrey Lidz Therefore, even though children may have difficulty identifying attitude verbs from the situational contexts in which they are used, they might be able to identify them through their syntactic distribution—specifically, by paying attention to which verbs take clausal complements (Fisher, Gleitman, and Gleitman, 1991; Gleitman et al., 2005; Papafragou et al., 2007). Furthermore, differences in the clausal complements of attitude verbs might help children tell certain attitude verbs apart from each other. Attitude verbs fall into two major classes: representational and preferential (Bolinger, 1968). The representational verbs, such as think, know, or say, express judgments of truth and present a picture of the world. The preferential verbs, such as want or demand, convey preferences about how the world ought to be. Cross-linguistically, these two classes of attitude verbs also differ in the properties of their clausal complements (Bolinger, 1968; Farkas, 1985; Giannakidou, 1997; Hooper, 1975; Villalta, 2008). In English, this difference is reflected in the tense (finiteness) of the complement. Preferential attitude verbs like want tend to occur with nonfinite complements: (14)
a. I want Jo to be at home. b. *I want that Jo is at home.
By contrast, representational attitude verbs like think tend to occur with finite complements: (15)
a. I think that Jo is at home. b. *I think Jo to be at home.
For syntactic bootstrapping to work in the domain of attitude verbs, subclasses of attitude verbs must have different syntactic distributions. In addition, it must be that these differences can be linked to cross-linguistically stable properties, so that it is possible for learners to link the distributional differences to those aspects of meaning that explain them. Finally, learners must be sensitive to the relevant distributional features and use them to make inferences about meaning. Each of these conditions appears to hold. White, Hacquard, and Lidz (2018a) tested whether subclasses of attitude verbs reliably show different distributional signatures. Building on a method pioneered by Fisher et al. (1991) these researchers collected two kinds of judgments. First, they collected syntactic acceptability judgments for a set of 30 attitude verbs in 19 syntactic environments, which were used to identify subclasses of verbs based on the similarity of judgments across all 19 environments. Second, they collected semantic similarity judgments in sets of three for all 30 verbs. Putting these together, they showed that the verb similarities identified in the semantic similarity task are highly predictive of the similarities identified in the syntactic acceptability judgment task. Together, these results indicate that attitude verbs with similar meanings show similar syntactic distributions. In addition, these patterns of distribution relate to a principled feature of the syntax- semantics mapping. Although languages differ in the particular features that distinguish
Children’s Use of Syntax In Word Learning 369 belief verbs from desire verbs, there is a higher order generalization that links these classes to the syntax. Specifically, representational attitude verbs take complements that have syntactic features of declarative main clauses (Hacquard and Lidz, 2019). The specific features distinguishing declarative main clauses from other kinds of clauses vary from language to language.2 In English, these features include finite tense, lack of a complementizer, presence of subject/verb agreement, and obligatory overt subjects, among others. In German, finiteness is a less good cue, but verb-second word order is a more reliable indicator. Similarly, in French finiteness is not a good indicator of declarative main clauses, but indicative mood is. When the features of declarative main clauses occur in embedded clauses, they are good predictors of the kind of attitude verb that embeds them. Thus, the abstract link between declarative main clauses and the complement of belief verbs appears to be cross-linguistically stable, though a more thorough investigation of the range of variation, drawing from a wider sample of languages, remains to be undertaken. The link between representational attitudes and declarative main clauses may be further explained by the pragmatic function to which declarative clauses are used. Declarative main clauses are used to make assertions and when representational attitude verbs are used to make indirect speech acts, these are indirect assertions. Recognizing the similarity between direct and indirect assertions and the formal similarity between declarative main clauses and the complements of representational attitudes may play a key role in explaining why the abstract connections between verb meaning and complement type hold (see Hacquard and Lidz, 2019 for elaboration). In order for the link between complement type and attitude verb meaning to be useful in acquisition, though, learners must be able to identify the relevant features from their input, as they acquire attitude verbs. White et al. (2018b) built a computational model showing that it is possible to identify these features and consequently to learn which attitude verbs are belief verbs and which are desire verbs. Their model identifies the morphosyntactic features of declarative main clauses. The model then looks for these features inside complement clauses to classify the embedding verbs. This model successfully figures out how to divide attitude verbs into representational and preferential subclasses in English. Crucially, it does so not by relying on specific morphosyntactic properties, but rather via more abstract expectation about verb classes, whose expression can be discovered depending on the surface features of the language being acquired. Huang, Liao, Hacquard, and Lidz (2018) successfully extend this analysis to Mandarin, a language with a sparser set of morphosyntactic cues. Finally, children are sensitive to the relevant features in their acquisition of attitude verbs. Harrigan, Hacquard, and Lidz (2019) tested whether four-year-olds use the finite/ nonfinite complement distinction as evidence about the meaning of attitude verbs. They 2 It is worth keeping in mind that the representational/ preferential split does not exhaust the subclasses of attitude verbs. There are many further subclasses, which also display systematic relations between meaning and syntactic distribution (see Hacquard and Wellwood, 2012; White, Hacquard, and Lidz, 2018a; White and Rawlins, 2019; Djarv, 2019; Wurmbrand and Lohninger, 2020).
370 Jeffrey Lidz probed four-year-olds’ understanding of hope. Hope is relatively uncommon in child- directed speech; it should thus be familiar enough to four-year-old children for them to know that it is an attitude verb, but perhaps not enough for them to be sure about its meaning. Hope is also relevant because it can occur with both finite and nonfinite complements, and also shows properties of both representational and preferential attitudes (Portner, 1992; Scheffler, 2008; Anand and Hacquard, 2013). For example, while a hope is a kind of desire, one cannot hope for things that they know to be incompatible with truth (16). (16)
a. Kim knows it is snowing but wants it to be sunny. b. *Kim knows it is snowing but hopes it is raining.
The experimental setup in Harrigan et al. made both the beliefs and desires of a puppet, Froggy, salient, and tested whether the syntactic shape of the complement influences children’s interpretation of hope. In this game, the child and one experimenter are behind an occluder, while Froggy is on the other side. In front of the child is a box with 40 wooden hearts and stars, which are either red or yellow. Color is predictive of shape: 15 of the hearts are red, five are yellow, and 15 of the stars are yellow, five are red. The child and the experimenter pull shapes out of the box to show Froggy, and every time the shape is a heart, the child gives Froggy a sticker. Froggy likes getting stickers, therefore his desire on every trial is that the shape be a heart. On each trial, before Froggy sees what the shape is, the child and the experimenter show him a “clue,” which is ambiguous as to shape but not color, by inserting a point (of the heart or the star) through an opening in the occluder (see Figure 17.1). Thus, on every trial, Froggy has both a desire about shape (he always wants the shape to be a heart), and a belief about shape (when it is red, he always guesses that it’s a heart and when it is yellow, that it’s a star). Another puppet, Booboo, whom the child is told is “silly and wants to learn how to play the game, but often gets things mixed up,”
Occluder
Exp. 2
Booboo, you’re right
Froggy
Child It’s red, so I bet it’s a heart!
Booboo
Froggy thinks that it’s a heart!
Figure 17.1 Experimental set up of Harrigan et al. (2019).
Exp. 1
Children’s Use of Syntax In Word Learning 371 utters test sentences about what Froggy wants (17), thinks (18), or hopes (19–20). The child’s task is to say whether Booboo is right. (17)
Froggy wants it to be a heart/star.
(18)
Froggy thinks that it’s a heart/star.
(19)
Froggy hopes to get a heart/star.
(20) Froggy hopes that it’s a heart/star. The results reproduced the traditional split in performance with think and want. Children correctly judge want sentences even when the reported desire conflicts with reality, but they tend to incorrectly reject think sentences that report false beliefs. Crucially, children’s responses to hope sentences differ depending on the syntactic frame in which they are presented. With a finite complement, their responses pattern like their responses to think sentences. However, with a nonfinite complement, their responses pattern like their responses to want sentences. These results reveal that children use complement syntax to identify aspects of an attitude verb’s meaning: four-year-olds treat hope as a preferential attitude verb when it takes a nonfinite complement, and a representational attitude verb when it takes a finite complement. One potential concern with this finding is that it depends on children’s performance with a real verb (though one that they likely have very little experience with) and uses the false belief error as evidence of the verb’s representationality. To get a more direct demonstration of the role of syntax in learning attitude meanings, Lidz et al. (2017) tested four-year-olds’ understanding of a novel attitude verb, in an experiment building on Asplin (2002). In a representative story, Dad (the discoverer) leaves chicken legs on the kitchen table and goes to work. Fido (the actor) arrives hungry but can’t find any food. Jimmy (the enticer) arrives and entices Fido to eat the chicken legs, puts them in Fido’s bowl, and leaves. Fido decides to eat the chicken. When Dad returns, he finds his chicken legs gone and chicken bones in Fido’s bowl, thereby discovering that Fido ate his chicken. The enticer has a desire (for Fido to eat the chicken), while the discoverer has a belief (that Fido ate the chicken). After each story, two puppets deliver test sentences as descriptions of the story using the same novel verb and complement syntax (finite vs. nonfinite), but with different subjects and participants are asked to choose the puppet who said something true about the story: (21) Finite Condition Puppet 1: Dad gorped that Fido ate the chicken. Puppet 2: Jimmy gorped that Fido ate the chicken. (22)
Nonfinite Condition Puppet 1: Dad gorped Fido to eat the chicken. Puppet 2: Jimmy gorped Fido to eat the chicken.
372 Jeffrey Lidz Preferential attitudes (e.g., enticings in this context) take nonfinite complements in English; representational attitudes (e.g., discoveries in this context) take finite complements in English. If children are sensitive to this mapping, they should pick the discoverer when they hear a sentence with a finite complement, but the enticer when they hear a sentence with a nonfinite complement. This is what we find. In the finite condition, children were significantly more likely to pick the puppet that mentioned the discoverer, but in the nonfinite condition, they were significantly more likely to pick the one that mentioned the enticer. In this section, we have seen that different classes of attitude verbs show different distributional profiles, that these distributional profiles are related to a cross-linguistically stable, and hence principled, mapping between syntax and semantics, and that children are sensitive to the relevant features of syntax in acquiring attitude verbs. More generally, we’ve seen that syntactic structure and syntactic distribution play two important roles in the acquisition of verb meanings. First, the syntactic structure of a sentence can help a learner to identify a perspective on an event, allowing them to fix the referential intentions of a speaker and to identify the meaning of an unknown verb in that sentence. Second, the syntactic distribution of a word is related to its meaning in principled ways, allowing learners to identify semantic properties of a word’s meaning by observing relevant syntactic features.
17.7 Syntactic bootstrapping beyond verbs While syntactic bootstrapping effects have been predominantly associated with verb learning, syntactic environment is a cue to word meaning across several classes. The syntactic category of a word in many cases is sufficient to restrict its possible meaning. For example, Waxman and Booth (2009) show that infants as young as 14-months treat a novel word presented as an adjective as referring to an object property but treat a novel noun as referring to an object kind. They presented infants with four objects that were alike in both their kind (horses) and an accidental property (purple). One group of infants was introduced to each of these objects with a novel noun (This is a blicket). Another group was introduced to them with a novel adjective (This is a blickish one). They were then given a choice either between two objects that shared a kind, but differed in color (yellow horse vs. purple horse) or between two objects that shared a color but differed in kind (purple plate vs. purple horse). They found that 14-month-olds treated nouns as a label for object kind. When asked to find “another blicket,” they chose the purple horse over the purple plate, but showed no preference in the case of two horses. And they found a different pattern with adjectives. Infants who were asked to find “another blickish one” chose the purple horse over the yellow horse, and the purple horse over the purple plate. Similarly, Fisher, Kingler, and Song (2006) show that 26-month-old children treat a novel preposition as labeling a spatial configuration, but they treat a novel noun in the same
Children’s Use of Syntax In Word Learning 373 context as labeling an object (cf. Landau and Stecker, 1990). Children in this study were shown a duck on top of a box and heard either “This is a corp” (noun condition) or “This is acorp my box” (preposition condition). They were then shown either a new duck on top of the box or a new duck and a pair of glasses on top of the box and were asked, “What else is acorp (my box)?” Children in the noun condition looked more at the new duck by itself whereas those in the preposition condition looked more at the scene with the duck and glasses. This suggests that they learned that the novel noun labeled the duck and that the novel preposition labeled the spatial configuration. Finally, Wellwood, Gagliardi, and Lidz (2014) show that 3-year-olds treat a novel superlative presented as a determiner as having a number-based meaning, but they treat a novel superlative adjective as having a property-based meaning. Here, participants were shown several scenes in which there were two sets of cows, one set by the barn and one set in the meadow. The cows by the barn were always more numerous than the cows in the meadow, and the cows by the barn had more spots on them than the cows in the meadow. As they saw these scenes, children were told either that a picky puppet liked it better when “gleebest of the cows are by barn” (determiner condition) or that he liked it better when “the gleebest cows are by the barn.” (adjective condition) Then, they were asked about scenes in which the numerosity of cows and the numerosity of spots were dissociated, so that the cows by the barn were more numerous but less spotted, or less numerous and more spotted. The children in the determiner condition learned that the puppet likes it when the larger set of cows is by the barn, whereas those in the adjective condition learned that the puppet likes it when more spotted (but less numerous) cows are by the barn.3 Again, syntactic category has a powerful influence on the kinds of meanings that learners assign to novel words. In each of these cases, patterns of extension for a novel word were determined by the syntactic category of the novel word, an illustration of how syntactic information can shape the perspective that a learner takes on a scene in learning a novel word. This effect of syntax has been studied most widely in the domain of verb learning, but the potential for learners to exploit information about syntax is present in any area of grammar where there are systematic relations between a word’s syntactic (sub)category and its meaning.
17.8 Origins of syntactic bootstrapping While the primary purpose of this chapter is to show that syntactic distribution provides information about a novel word’s meaning and that children are sensitive to this 3
This effect is not likely to be an analogy to most. The earliest reported success with most is at age 3;6 (Halberda, Taing, and Lidz, 2008) and many studies do not reveal knowledge of most until much later (Barner, Chow, and Yang, 2009; Papafragou and Schwarz, 2005; Sullivan, Boucher, Kiefer, Williams, and Barner, 2019). Moreover, since most is grammatical in both frames, the different meanings assigned in the different frames cannot be achieved by drawing a simple analogy to the distribution of most.
374 Jeffrey Lidz information, it is important to also ask about the origins of these abilities. There are two important questions to ask in this domain. First, to what extent do children’s abilities to use syntactic features as evidence for word meanings reflect architectural properties of the language faculty as opposed to learned generalizations? Second, given that children are not born knowing the syntax of their language, how do they acquire enough syntax to allow these kinds of bootstrapping effects to manifest? We take up these questions briefly in turn. Let us first consider the possibility that children learn some words in a distributional category via some non-syntactic method and then treat other words in that category as being semantically similar to those. On this view, studies showing that children use syntax as a cue to meaning reveal statistical generalizations that are acquired only after the meanings and syntactic properties of some small subset of words in that category are acquired. There are several reasons to be skeptical toward such an account. First, Lidz, Gleitman, and Gleitman (2003) examined learners of Kannada in a task where known verbs were put into syntactic environments that do not occur in speech to children. They found that learners extended the meanings of such verbs based on features related to grammatical architecture rather than to more reliable statistical cues in the language. This suggests that the role that syntax plays in cueing meaning does not simply derive from statistical features of the environment, but rather from properties internal to the child (see also Trueswell, Kaufman, Hafri, and Lidz, 2012; Pozzan and Trueswell, 2016). Second, recall the effect of syntactic structure in cueing thematic relations in Lidz, White and Baier (2017). 16-month-olds who heard a novel NP in the direct object position interpreted that NP as referring to the patient of the event, whereas those who heard the novel NP as a prepositional object interpreted it as referring to the instrument of the event. These authors found that the effect of syntactic position is negatively related to children’s knowledge of specific verbs. As verb vocabulary goes up, children’s ability to use the syntactic position as evidence of the thematic relation goes down. Lidz et al. argue that this effect results from the lexical statistics of the verb making an independent contribution to parsing and understanding. Because the verbs used in this study occurred in speech to children with direct objects roughly 80% of the time, this statistical information overpowered the bottom-up information derived from the syntax. In the current context, what this means is that the effect of syntax cannot be derived from the lexical statistics, since the children who know the lexical statistics use this information at the expense of syntactic information in guiding their interpretations. Finally, a theory in which syntactic cueing of meaning begins with first acquiring some meanings via some method other than syntactic bootstrapping, presumably by perceiving the meaning directly. However, as reviewed in the beginning of this chapter, sentence meanings are exceedingly difficult to observe directly. This difficulty derives from the fact that sentences are not mere descriptions of the events around us and the fact that perspectives on events are often not shared among people, especially caregivers and children, in a given situation. Support for this difficulty comes from the human simulation studies, finding that both adults and children are extraordinarily
Children’s Use of Syntax In Word Learning 375 bad at guessing what someone is likely to say based only on extralinguistic information (Gillette et al., 1999; Piccin and Waxman, 2007). The alternative view is that some syntactic bootstrapping effects reflect architectural features of the language faculty. Bootstrapping effects are about learners taking advantage of relations between representations, in this case relations between syntactic distribution and meaning. In the domain of action verbs, the relevant features are about the relations between syntactic positions and thematic relations. A child who can use the syntactic position of arguments to make inferences about their likely thematic relations can then use the thematic relations of those arguments to identify the event described by a sentence. And knowing the event that the sentence describes goes a long way toward identifying the meaning of the verb in that sentence. In the domain of attitude verbs, learners would need the following expectations: (a) that some verbs embed clauses and that when they do, they report a relation between the subject and some state of affairs described by the complement, (b) that certain clause types are strongly associated with certain speech acts (e.g., declarative clauses with assertions), (c) that speech acts can be indirect, and (d) that the type of complement an attitude takes is predictive of its meaning in matching the clause type of the canonical speech act that the verb lends itself to. Armed with these expectations, a learner would need to identify the surface hallmarks of various clause types in the particular language being acquired. Having done so, this information, in combination with the expectations (a–d) would allow them to classify novel attitude verbs as being either representational or preferential. It is worth noting, however, that while these prior expectations about the mapping between syntax and semantics allow learners to make rough initial classifications, there is nonetheless a broad range of verb meanings that will not be acquirable on the basis of this information. Finer subcategories of both action verbs (Levin, 1993) and attitude verbs (White and Rawlins, 2016) show systematic relations between form and meaning, but at least some of these relations appear to be idiosyncratic and language particular. For such cases, it is likely that learners’ initial expectations play less of a role in guiding acquisition. Instead, we can envision a learning model more along the lines of how gender classes are acquired (Gagliardi and Lidz, 2014), where semantic features are only probabilistically associated with syntactic features. On such a view, the semantic features would have to be acquirable based on the initial classification of the verb in concert with the rich understanding of linguistic and extralinguistic context that comes with being an experienced language user. In other words, the fact that there are biases concerning the syntax-semantics mapping that provide a foothold for children’s first steps into meaning does not preclude the possibility that later acquisitions can be driven in part by previously acquired knowledge. The second issue concerning the origins of syntactic bootstrapping effects concerns the initial acquisition of syntax. If children need syntactic information to guide the acquisition of word meaning, we are forced to the question of how children learn the syntax to begin with. One possibility that has been considered widely is that meaning
376 Jeffrey Lidz can provide a foothold into the acquisition of syntax (Pinker, 1984, 1989). But here we run into potential problems of circularity—if you need semantics to acquire syntax and syntax to acquire semantics, how can you ever get started? There are two classes of proposals for dealing with the initial steps into syntax, both of which help to avoid the potential circularity, and which are not mutually exclusive. One class of proposals holds that certain words can be acquired without the help of syntax and that these then function as anchors for subsequent category building and for the discovery of syntax (Gleitman and Trueswell, 2020; Fisher, Jin, and Scott, 2020). A second class of proposals suggests that prosodic cues to syntactic structure, in concert with the identification of functional vocabulary at the edges of prosodic constituents, allow learners to build an initial syntactic skeleton containing the information needed for identifying core syntactic categories and the subject-predicate divide (Christophe, Millotte, Limissuri and Marguler, 2008; de Carvalho, He, Lidz, and Christophe, 2019). This information would provide sufficient information to drive further acquisition of syntax and to guide early syntactic bootstrapping.
17.9 Conclusions We began this chapter noting that the world of perception and the world of language are not always in alignment. This misalignment has many sources, ranging from the diverse perspectives we can take on events, through to the wide range of conversational goals that we use language to achieve. In the worst case, utterances may achieve the same conversational goals despite having very different forms. A parent sending his children to bed might just as well say any of the following sentences: (23)
a. b. c. d.
Go to bed. It’s time for bed. I want you to go to bed. I think it’s bedtime.
Moreover, these same sentences might be used to achieve different conversational goals. If, for example, my friends invite me out for a drink after a long day at work, I might use (23d) to decline the invitation. One might think that the gulf between what we say, what we mean, and what is happening around us might make language learning next to impossible. But, as we’ve seen in this chapter, there are systematic relations between a word’s syntactic distribution and its meaning that can help bridge the gap. These systematic relations are principled, allowing them to serve as an inductive base to guide learning. They are reliably expressed in speech to children, allowing them to be detected. And children appear to be quite capable of making the inferences from syntactic form to lexical meaning. These inferences play a foundational role in word learning across the lexicon and guide children’s earliest steps into word learning.
Children’s Use of Syntax In Word Learning 377
Acknowledgments Many thanks to Alexander Williams, Valentine Hacquard, and Laurel Perkins, all of whom have helped to clarify my thinking on these issues over the past 5–10 years. This work was supported in part by NSF grant BCS-1551629.
Chapter 18
Easy word s Reference resolution in a malevolent referent world Lila R. Gleitman and John C. Trueswell 1
18.1 Introduction In 2005, our research group outlined an updated theory of how language learners acquire what we called “hard words” (Gleitman, Cassidy, Nappa, Papafragou, and Trueswell, 2005). “Hard” and “easy” here pertain to the facts about acquisition. “Acquisition” in the regards we will discuss means gaining the knowledge that the concept ‘dog’ is expressed by the sound segment /dɒg/in English.2 As this learning process begins, it seems reasonable to suppose that the child’s only recourse for solving this problem is to examine the relation between some recurrent sound segment /dɒg/ and what is happening out in the world (in the best case, dog sightings). Hard words, roughly, are the ones whose meanings do not come readily to mind as a consequence of this word-to-world pairing procedure. On this definition the word think would be classified as “hard” because observers do not usually guess “They’re thinking” upon observing thinkers thinking. In contrast, the word jump would be classified as “easy” because observers, quite consensually, guess “They’re jumping” upon observing jumpers jumping. And indeed language learners acquire words like dog and jump before words like idea and think. We proposed that the acquisition of the latter require the use of linguistic context, which must be organized as a structured syntactic representation; it is the easy words that allow the learner to construct those representations. Putting this theory in shorthand terms, the postulate was that learners could align their sampled 1
Reprinted with permission from Gleitman, L. R., and Trueswell, J. C. 2020. “Easy words: Reference resolution in a malevolent referent world.” Topics in Cognitive Science 12(1): 22-47. Copyright 2020, John Wiley and Sons, Inc. 2 Notationally, we use italics for mention of a phrase or word, “double quotes” for its utterance or sound, and ‘single quotes’ for a concept.
Easy words 379 structured representations (syntactically analyzed sentences of, say, English) with the counterparts of these in a structured representation of the observed situation—a procedure called “syntactic bootstrapping” (Gleitman, 1990; Landau and Gleitman, 1985). This approach found support and was materially fleshed out in a series of empirical demonstrations that followed over the last few decades (e.g., Nappa, Wessell, McEldoon, Gleitman, and Trueswell, 2009; Papafragou, Massey, and Gleitman, 2006; Snedeker and Gleitman, 2003; Yuan, Fisher, and Snedeker, 2012) and, indeed, had been documented in many of its elements by several demonstrations and theoretical commentaries that led up to it (e.g., Fisher, Gleitman, and Gleitman, 1991; Fisher 1996; Gleitman, 1990; Gleitman and Landau, 1994; Lidz, Gleitman, and Gleitman, 2003; Naigles 1990, 1996; Naigles, Gleitman, and Gleitman, 1986; for reviews, Gleitman and Fisher, 2005; Fisher and Gleitman, 2002). In retrospect, progress on the hard-word front might have been expected because the explanation drew upon, and was couched in, the kind of theoretical apparatus and terminology that Cognitive Science was designed for and proved good at (cf. Turing, 1950; Chomsky, 1959; Fodor, 1975; Pinker, 1984; Wexler and Culicover, 1980). So, for instance, Fisher, Gleitman, and Gleitman (1991) could predict with relative confidence why laugh (which describes a self-caused action) occurred in one-argument (intransitive) sentences, and by so doing recover the argument-taking properties of this predicate in a formally satisfying way, with obvious pointers to a learning schema. But these same authors acknowledged that this approach casts little light on the “other part” of the meaning of laugh, namely, the “ha-ha” part. Contemporary critics, and the authors of such theories themselves, looked hopefully to “the world of observation” (how words line up with their extralinguistic contingencies) to solve this “other part” of the lexical acquisition problem, sometimes being unkind enough to remark that the “ha-ha” part was what, after all, everybody else had taken to be the meaning of laugh in the first place (cf. Pinker, 1994; Fodor, personal communication) and that we are light-years from a theory of this part. Perhaps paradoxically, the easy words and the easy parts of the hard words are the more resistant to current theorizing and explanation. In the present essay, we review some of the contemporary thinking and findings on this problem: how the “easy words” (and the “easy” parts of hard words) could be identified by a suitably endowed organism—by hypothesis, the human infant who is attempting to map words to their referents. But first, the problem and why it is so hard.
18.1.1 The hard problem of learning easy words from observation We have just asserted (as if it were a truism) that in order for language learners to acquire the meanings of their first words, they must rely exclusively on what they can observe from the immediate situational context. At this initial stage, by definition, they don’t have internal linguistic information to help with the task. For instance, from an
380 Lila R. Gleitman and John C. Trueswell utterance like “The chef baked the cake” it would be easier to learn the meaning of the word bake if you had previously learned the meaning of chef and cake. And it would be easier still if you knew that “chef ” was the Subject of the sentence, in which case you might conjecture the meaning ‘bake,’ whereas if you knew “cake” was the Subject you might conjecture ‘poisoned.’ Instead, learners without such knowledge must see (hear, feel, smell) things that exist out in the world, hear a speech segment, and hopefully make the correct connection. Speaking practically, how hard is this? Figure 18.1, which shows the vocabulary growth of the child, provides some hints to the answer to this question. Infant word learning appears to be slow and laborious until the dawn of syntax. For the first 100 words or so, which are likely acquired from observation alone (Lenneberg, 1967; Gleitman et al., 2005), the rate of vocabulary accrual is unimpressive at best. Even after 18 months, the child’s vocabulary size is still only about 100 words. But a dramatic inflection point is observed immediately thereafter, when the first rudiments of connected speech (and by implication syntax) are observed. Very similar patterns are observed in estimates of child receptive vocabulary (see Caselli, Bates, Casadio, Fenson, Fenson, Sanderl, and Weir, 1995). And indeed, as many have noted before (e.g., Chomsky, 1959; Quine, 1960), learning one’s first words is not an easy problem to solve even if we make reasonable, evidentially supported assumptions. The primary critical assumption is that even early word learning is a mapping problem between two kinds of categories (linguistic categories on the one hand, and conceptual categories on the other), and not, for example, an unprepared multi-modal sensory associative process. The child who is prepared to learn words must, for example, already have candidate word forms to work from—i.e., linguistic categories
Cumulative Number of Vocab. Words
1200
13 higher SES children (professional)
1000 23 middle / lowerSES children (working class)
800
600
6 welfare children
400
200
0
10 12 14 16 18 20 22 24 26 28 30 32 34 36 Age of child (months)
Figure 18.1 Number of word types produced as a function of age of child. Adapted from Hart and Risley (1995).
Easy words 381
Figure 18.2 Example of a referential context. Reproduced from Medina et al. (2011).
comprised of syllables and phonemes that permit correct generalization across speakers and permit easy detection within the continuously varying speech stream. This preparation presumably takes about 6–8 months of exposure to the language to start a candidate lexicon of word forms (e.g., Johnson and Jusczyk, 2001; Thiessen and Saffran, 2003; Saffran, Aslin, and Newport, 1996) and morphosyntax (Marcus, Vijayan, Bandi Rao, and Vishton, 1999). On the world side of the word-to-world equation, we also cannot assume that a learner’s mental representation is an unprepared sensory stimulation: just like speech sounds differ from different speakers on different occasions, a dog sighting stimulates the mind in different ways every time it is experienced—different angle, different lighting, even a different dog. At the time of word learning, we have assumed these different circumstances all evoke the category ‘dog.’3 Here though is the where the problem of early word learning gets hard. Very hard. Consider the situation illustrated in Figure 18.2, in which a boy is hearing the word
3 There is great disagreement as to what could be meant by the concept ‘concept’ in the first place. We cannot of course engage this enormous and ill-formulated problem here. Suffice to say, some have thought that concepts are mental images (Hume). Others have thought that a concept is a definition (a set necessary and sufficient conditions for the object to fall under this concept, Katz and Fodor, 1963). And still others have thought it is the typical or prototypical usage conditions (Rosch, 1975; Rosch and Mervis, 1975). For present purposes, we can remain blissfully neutral in these controversies, but granting on the principle of charity that the child’s success depends on reaching consensus with the adult world on such representations. Some commentators hold that child representations differ somewhat or perhaps even radically from adults and change as a function of concept growth and acquisition (Carey, 1978; Gentner, 1982; Smiley and Huttenlocher, 1995). Whereas others hold that information availability rather than concept change accounts for the changing character of early vocabulary (Gillette, Gleitman, Gleitman, and Lederer, 1999).
382 Lila R. Gleitman and John C. Trueswell shoe uttered in a typical context. If this confrontation with English is to be of use to this word learner, he must confront the problem of reference resolution, the main burden of the present chapter. As Figure 18.2 illustrates, shoes do not usually appear on white backgrounds in the form of a still photograph; there are many other objects, events, and qualia present with any particular shoe sighting. If the learner duly records every entity, relation, or implied motion in this scene, the task of learning looks hopeless. But a first clue is that the child is looking toward the shoes. Perhaps we can understand the problem of reference resolution by concentrating on such social-attentive cues. Several researchers have provided positive evidence in this regard. Bruner (1974/1975) usefully pointed out that speaker and child are in cahoots, concentrating their attention jointly on only certain elements in the scene. This suggestion has been backed up by a wide range of experimental findings demonstrating that the presence of these social- attentional cues to reference facilitate infant and child word learning (e.g., Baldwin, 1991, 1993, 1995; Bloom, 2000; Jaswal, 2010; Nappa, Wessell, McEldoon, Gleitman, and Trueswell, 2009; Tomasello and Farrar, 1986; Southgate, Chevallier, and Csibra, 2010; Woodward, 2003; for a recent review, see Papafragou, Friedberg, and Cohen, 2017). In her classic study, Baldwin (1991) found that 16–19-month-old infants are sensitive to a parent’s attentional stance (cued via eye gaze, head-and body-posture) when identifying the referent and learning the meaning of a novel word. Infants connected the parent’s utterance of the word to an attended object if and only if the parent was also attending to that object. An optimistic picture that could emerge from this body of evidence is that early word learning, so filtered, is not so hard after all. If social-attentional alignment between the parent and child is commonplace—the rule rather than the exception—it is possible that a large proportion of word occurrences are informative learning moments. Learners who are aggregating across these instances would be able to identify reference and meaning rapidly as they get convergence (Yu and Smith, 2007). And yet, until quite recently (e.g., Clerkin, Hart, Rehg, Yu, and Smith, 2017; Roy, Frank, and Roy, 2009), surprisingly little research has asked what the situational context is like in the home for children learning their first words. This is where our inquiries began (see Section 18.2: What is the situational context like in the home?). It is likely that knowing what the informativity patterns are in the home will allow us to offer better theories of the cognitive machinery that supports word learning (a topic we will take up in Section 18.3: What kind of learner operates on this input?).
18.2 What is the situational context like in the home? For if we will observe how children learn languages, we shall find that, to make them understand what the names of simple ideas or substances
Easy words 383 stand for, people ordinarily show them the thing whereof they would have them have the idea; and then repeat to them the name that stands for it; as white, sweet, milk, sugar, cat, dog. (John Locke, 1690, Book 3, IX.9)
The laboratory work on reference resolution mentioned in the previous section establishes that young language learners can use several social-attentional cues to determine the referential intent of a speaker and thus acquire an unfamiliar word’s meaning. Moreover, observational research examining parent-child interactions has established that parents who spontaneously exhibit these behaviors during object play, for example, of labeling what a child is attending, have children with larger vocabularies (e.g., Harris, Jones, Brookes, and Grant, 1986; Tomasello and Farrar, 1986; Tomasello, Mannle, and Kruger, 1986; Tomasello and Todd, 1983). However, neither the laboratory nor the observational studies indicate what the rate of informative learning instances is in the home under more common circumstances. The observational work has been almost exclusively limited to object play, typically with a small set of predetermined objects. Moreover, with few exceptions, both the laboratory work and the observational work has been limited to situations when parents utter nouns with concrete meanings. This certainly overestimates the informativity of word learning situations. A child learning her first words does not know which ones have concrete meanings and which do not, nor are they likely to know which label objects and which do not. If we expand our assessment of referential informativity to content words in general (i.e., recurring word forms that receive some kind of prosodic stress above and beyond function words), and we expand our assessment to more than just object play, it is entirely possible that these informative situations are exceedingly rare and atypical.
18.2.1 Rarity of referential gems and their contribution to vocabulary growth Our own work suggests that moments of referential clarity are very rare in the home. Much of the evidence comes from a procedure we have called the Human Simulation Paradigm (HSP) first introduced in Gillette, Gleitman, Gleitman, and Lederer (1999). The authors suggested that different sources of evidence as to an unfamiliar word’s meaning might be available from various aspects of the input stream. In principle, as we have discussed so far, the situational context (“observation”) would supply some evidence; later, the co-occurrence of the unfamiliar word with familiar ones (“distribution”) could supply further evidence; and later still, the structural position of the unfamiliar word in the sentence (“syntax”) could provide yet even more evidence. An experimental procedure was developed to estimate the relative contribution of these several sources of evidence taken alone or jointly. Specifically, a video corpus was built of examples of parents uttering common nouns and verbs within natural sentences to their offspring, partitioned into 40-second “vignettes” in which the word of interest
384 Lila R. Gleitman and John C. Trueswell (henceforth, the “mystery word”) had been uttered by the adult 30 seconds into each vignette. For each mystery word, six vignettes were chosen randomly from the corpus. One group of participants observed the vignettes with the sound off with the mystery word indicated by an audible beep at its exact occurrence, a proxy for the extralinguistic situation taken alone (“observation”). A second group saw just “distributional” evidence for the mystery word, that is, an alphabetical list of the nouns as they occurred in each of these six sentences (e.g., “BLEPPED: cake, chef ” for the sentence “The chef BAKED the cake.”). A third group was provided with “jabberwocky” versions of the parent’s utterances, that is, with content words replaced with nonsense words (“syntax”: “The florp BLEPPED the dax”). Further participant groups received two or all three of these kinds of information together (see especially Snedeker and Gleitman, 2003). In all of these conditions, participants were given six examples of any one word in a row and, in Gillette et al., participants were told in advance whether the mystery word was a noun or a verb. The central finding of this work was that the different participant groups learned different aspects of vocabulary based on these different input representations. Those who received only the observational evidence learned mainly concrete nouns and not verbs, thus reproducing in adults the first stages of vocabulary acquisition by infants (e.g., as documented in Bates, Dale, and Thal, 1995). Adults who were provided with the “distribution” and especially “syntax” information successfully identified more abstract words including the verbs and particularly those with abstract meaning, thus reproducing the aspects of later child vocabulary. The bottom line conclusion was that the course of early vocabulary learning is legislated by the order in which the child has access to the different sources of information and not perhaps by some hypothesized conceptual change in the nature of the learner—information change rather than conceptual change. The child hears sentences in situational contexts from the beginning but appreciates the distributional and syntactic cues only at later stages. Perhaps the most relevant aspect of the Gillette et al. (1999) findings to the present discussion is that the initial learning procedure (“observation”) was somewhat unimpressive even in its own terms. Adults in the “observation” condition, who were given a video recording of the local, seemingly quite relevant, situational context, were only able to guess the correct meaning of 40% of the most common nouns and 15% of the most common verbs. Even this is an overestimate of the informativity of the observational database because: (1) all six vignettes of, for example, ‘dog’-mentionings, were presented in sequence, after which learners offered a final guess (this massed-trial procedure is surely not a feature of everyday conversation); and (2) participants were actually told whether the mystery word was a noun or a verb (again surely not information explicitly provided to real learners). This raises the question of how informative observational evidence actually is for first words. To this end, our group set out to understand better the true patterning of parental word use in the home that might be underpinning the acquisition of first words. Medina, Snedeker, Trueswell and Gleitman (2011: Exp. 1) used the “observational” condition of the HSP to answer this question. Medina et al. improved the procedure of Gillette et al. (1999). They used a new vignette corpus that sampled more widely across common
Easy words 385 daily activities, such as bath, meal, and playtime. As in Gillette et al., the resulting video corpus consisted of 288 vignettes in which parents uttered common content nouns or common content verbs (144 vignettes each, 24 word types each, 6 vignettes for each word). Videos were again muted and a beep occurred just as the parent had uttered the mystery word. Unlike as in Gillette et al., participants were not told which words were nouns and which were verbs, nor were the six examples of a word strung together in a row; vignettes were intermixed and appeared in a random order, eliminating the benefit of comparing information from successive instances of a single word. Under these more natural sampling conditions, nouns were guessed correctly only 17% of the time and verbs a paltry 6% of the time. Medina et al. also noted that not all word exposures (i.e., vignettes) are created equal. Only a small percentage (7%) were informative above the 50% accuracy rate. We can think of these as referential “gems,” in which the situational context, including the social interaction between parent and child offered enough information to allow for the majority of naïve observers to guess what the parent was saying. Strikingly, all gems were nouns, and none verbs. Moreover, most other vignettes might be characterized as referential “junk,” yielding a less than 33% accuracy rate. Worse, observers typically offer a scatter of different false conjectures rather than one or two (as also observed in Gillette et al.). The picture that emerges, then, is a learning environment in which most instances of a given word’s use are uninformative but punctuated by rare occasions of referential clarity. But before taking these findings and conclusions at face value, a pressing question that must be answered is whether adult responses in the HSP are in fact an adequate proxy for the referential experiences of the children in these videos. After all, the videos are taken from a third-person angle and judgments were from adults rather than children. At least three findings in the literature alleviate these concerns: (1) Children perform similarly to adults on the HSP task. It is possible that responses from adults in the HSP overestimate the complexity of referential contexts, since adults may possess more concepts and more ways of interpreting scenes. Children may consider far less information and thus may be better at identifying referents if they preferentially select common word meanings —a form of the “less is more” argument (Elman, 1993; Newport, 1990). However, results do not support this possibility. Adult HSP viewers tend to guess meanings corresponding to children’s first words (Gillette et al., 1999). Moreover, child-friendly HSP studies have been conducted, in which children (4 years of age and older) produce results similar to adults (Medina et al., 2011; Piccin and Waxman, 2007). Rather than doing better than adults, children perform worse overall (Medina et al.) but nevertheless present similar patterns in their data. In particular, similar to adults, they are more accurate on nouns than verbs (Piccin and Waxman) and find the same contexts highly informative (Medina et al.). (2) Third-person angle is informative. The child’s own view in these HSP vignettes is quite different from the one offered to HSP observers who viewed the situation from a third-person camera angle. If HSP observers were not focusing on where the child
386 Lila R. Gleitman and John C. Trueswell was looking in these videos, it is possible that HSP observers consider very different information than the children in these videos. After all, the third person videos step back and show a rather broad view of the whole scene, while the child’s focused vision is much more restrictive (Smith, Colunga, and Yoshida, 2010; Smith, Yu, and Pereira, 2011). The question of course is whether that distinction, dramatic as it seems, makes a difference. Yet, when compared, HSP responses from videos from a first-person head-mounted child camera yield very similar accuracy results to responses from videos of the same parent-child interactions viewed from a third person fixed camera angle (Yurovsky, Smith, and Yu, 2013). In particular, HSP accuracy was exactly 58% correct from both angles and yielded very similar distributions of accuracy across items. (Higher accuracy was observed here than all other HSP studies presumably because only object play was considered and only utterances that referred to co- present objects were selected as stimuli.) First person advantages were very small and were found only in a second study using the least informative learning instances. The lack of substantial differences between first and third-person camera angles is less surprising when one considers that adult observers are able to judge what another person is looking at under live-action conditions in which head-turn and gaze information are present (Doherty, Anderson, and Howeieson, 2009). (3) Effects of input conditions on vocabulary size and growth. Now we turn to the truly newsworthy outcomes of such studies—that is, evidence of real-world applicability of these laboratory HSP findings (Cartmill, Armstrong, Gleitman, Goldin- Meadow, Medina, and Trueswell, 2013). Cartmill et al. compiled video vignettes of parents uttering common concrete nouns to their 14-to 18-month-olds sampled from 56 different families ranging in Socio-Economic Status (SES) (as part of a larger longitudinal study, see e.g., Goldin-Meadow, Levine, Hedges, Huttenlocher, Raudenbush, and Small, 2014). Cartmill et al. were able to show that differential informativity (that is, the ratio of gems to junk) varies greatly across families and predicts measures of vocabulary size when assessed three years later at school entry. Specifically, for the most referentially transparent families, HSP participants guessed the parent’s intended meaning 45% of the time, and for the most referentially opaque families only 5% of the time! This difference predicted Peabody Vocabulary test scores measured at kindergarten-entry, with this relationship holding even after controlling for the amount of talk in the home (see Figure 18.3). Cartmill et al. found that both quantity of talk and its quality (i.e., its referential transparency) predict vocabulary growth. Notably, although the amount of talk was found to be positively related to family income (as observed in many studies previously, e.g., Hart and Risley, 1995), the difference in the proportion of gems to junk is a familial and not an SES variable. Although more advantaged families talk more to their children, they do not provide as a group a larger proportion of informative learning instances. What this means commonsensically is that it matters whether you are talking with your children rather than at them. That is, the difference comes down to commenting on the visibly passing scene more than on beliefs, desires, and commands pertaining to the world at large.
Easy words 387
Figure 18.3 The quality of referential environment in the home predicts child vocabulary size three years later. Each data point represents a parent-child dyad. The x-axis is the average accuracy with which naïve adult observers could guess what the parent in this dyad was saying to their 14–18-month-old child, based on a set of muted video vignettes of everyday parent-child interactions in the home at age 14–18 months (Human-Simulation-Paradigm, HSP, Accuracy). The y-axis is that same child’s vocabulary size three years later at school entry (Peabody Picture Vocabulary Test, PPVT, at age 54 months). Panel A is the direct relationship, and Panel B is the relationship after controlling for the quantity of early input (the average number of words per minute uttered by the parent to their child at 14–18 months). Reproduced from Cartmill et al. (2013).
18.2.2 Characterizing referential gems Another pressing question that all of this work raises is what exactly makes a referential act a gem? If laboratory evidence is a good guide, one might expect that social-attentive behaviors present in these videos determine moments of referential clarity. Trueswell, Lin, Armstrong, Cartmill, Goldin-Meadow, and Gleitman (2016) reported a large- scale video coding project of the Cartmill et al. (2013) HSP vignettes and confirmed that social-attentive behaviors are indeed associated with referential clarity, but they also identified a crucial amendment to this conclusion: it is the precise timing of these cues to reference, not their mere presence, that determines referential clarity. For each 40 second HSP vignette from the Cartmill et al. (2013) study, trained coders annotated the videos on a moment-by-moment basis for the presence of social-attentive cues to reference, including referent presence, parent attention to the referent, child attention to the referent, and parent gesture/manipulation of the referent. The timing of this information was found to predict the HSP accuracy score for each vignette. For instance, although low informative vignettes (junk) were overall less likely to have the referent object present during the interaction as compared to high informative gems, Figure 18.4 shows that it is really the sudden appearance of the referent just prior to its mention that is an informative cue. The importance of the timing was also observed for social-attentive cues to reference, such that: (1) gems, not junk, are more likely to contain a sudden shift in attention by
388 Lila R. Gleitman and John C. Trueswell 1 0.9
Referent Presence
Proportion of Vignettes
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0
–20 –18 –16 –14 –12 –10 –8 –6 –4 –2 0 2 4 6 8 10
0.1
Time from Word Onset (seconds) Referential Informativity High Medium Low
Figure 18.4 Object appearance, not continuous presence characterizes high informative contexts. Proportion of Human-Simulation-Paradigm (HSP) Vignettes in which the object being referred to the parent was coded as present, on a second-by-second basis. High Informative Vignettes were ones where HSP observers guessed the parent’s utterance over 50% of the time. These vignettes were rare and characterized by an increased probability of the referent being present just before word onset, peaking at essentially 100% presence one second after. Low Informative Vignettes were those where HSP observers guessed correctly less than 10% of the time. These vignettes were common, had the referent present on about 50% of vignettes, with no changes over time. Medium vignettes were the rest of all vignettes and fell in between. Shaded areas reflect the time periods for which the social-attentive behavior was reliably predicted by HSP accuracy scores (* p ANY). From both a distributional and a meaning perspective, then, some and any appear to be polar opposites at first look. However, both some and any are sometimes permitted in the same linguistic environments and are assigned the same interpretation in these environments.2 For example, both some and any can appear in the subject phrase of sentences with the universal quantifier, every. This is illustrated in (5). The sentences in (6) show that both expressions can also appear in the predicate phrase of sentences with the focus adverb, only. Notice that there is no detectable difference in meaning between the (a) and (b) versions of these sentences. (5)
a. Every passenger who ate some of the fruit became ill. b. Every passenger who ate any of the fruit became ill.
(6)
a. Only the passenger who became ill had eaten some of the fruit. b. Only the passenger who became ill had eaten any of the fruit.
Although some and any are interchangeable in (5) and (6), a change in structural position in these sentences resurrects the polar opposition between them. As illustrated in (7a), some but not any can appear in the predicate phrase of sentences with the universal 1 English
any is in a class of expressions called Negative Polarity Items. One linguistic environment that accepts Negative Polarity Items is called Downward Entailing. For seminal work on this topic, see Ladusaw (1979), Klima (1964), Baker (1970). 2 The subject phrase of the universal quantifier every and the predicate phrase of the focus adverb only are in the class of Downward Entailing expressions (see footnote 1).
Early Logic and Language 403 quantifier, every, and (7b) shows that some but not any can appear in the subject phrase of sentences with the focus adverb, only. (7) a. Every passenger who became ill ate some/*any of the fruit. b. Only the passenger who ate some/*any fruit became ill. To master the distributional properties of English any, then, language learners must distinguish between the subject phrase (SUB) and the predicate phrase (PRED) of sentences, in order to keep track of the structural asymmetries in the acceptability of any. These asymmetries are represented in (8) and (9). (8) a. Every SUB[passenger who ate any of the fruit] PRED[became ill] b. Every SUB[passenger who became ill] PRED[ate *any of the fruit] (9) a. Only SUB[the passenger who ate *any of the fruit] PRED[became ill] b. Only SUB[the passenger who became ill] PRED[ate any of the fruit] A further challenge for child language learners is illustrated in example (10), which adds negation to example (8b). Without negation, any is not tolerated in the predicate phrase of sentences with the universal quantifier, every. However, adding negation makes any acceptable. (10) a. Not every passenger who became ill ate any of the fruit. b. Not 8b[Every SUB[passenger who became ill] PRED[ate any of the fruit]] This section has emphasized the distributional and interpretive challenges facing child language learners in attaining adult-like competence with the indefinite expressions some and any. For the most part, children have been found to be up to these challenges, as we now discuss.
19.3 Any and Some in child language 19.3.1 Any The indefinite expression any emerges early in English-speaking children’s spontaneous speech. The word any is produced by children as young as 2;0 (Tieu and Lidz, 2016: 315). A survey of the transcripts of 40 children, by Tieu (2010, 2013), reported that 26 of these children produced at least 15 utterances with any, and less than 3% of them lacked a licensing expression, such as negation (Tieu, 2013, 48). For example, one child (Abe) produced 228 utterances with any between the ages of 2;4 and 5;0, and negation was present 95% of the time in these utterances (Tieu, 2013, 54).
404 Stephen Crain The first experimental investigation of children’s knowledge of the licensing conditions on English any was conducted by O’Leary and Crain (1994). The study used a Truth Value Judgment task with an elicitation component. One experimenter acted out stories in front of each child participant and a puppet, Kermit the Frog, who was played by a second experimenter. At the end of each story, Kermit produced a sentence, and the child’s task was to judge whether Kermit’s sentence accurately described what had happened in the story. On the critical trials, Kermit’s sentences were false. Whenever a child rejected one of Kermit’s false sentences, the child was asked to explain to Kermit “what really happened in the story.” The justifications children gave for rejecting Kermit’s false statements provided critical information about their knowledge of the licensing conditions on any. Kermit produced two types of false sentences with any. These are illustrated in (11) and (12). In response to these sentences, child participants with adult-like linguistic knowledge were expected to produce affirmative sentences where any is not licensed. These children were therefore expected to produce sentences with some instead of any. (11)
Type 1 Kermit: None of the Ninja Turtles got any presents from Santa. Child: No, this one found something from Santa.
(12)
Type 2 Kermit: Only one of the reindeer found any food to eat. Child: No, every reindeer found something to eat.
Eleven child participants took part in the study. They ranged in age between 4;4 and 5;4. In responding to Kermit’s false Type 1 and Type 2 sentences, the child participants almost never reproduced any. Instead, children’s sentences often contained some, as illustrated in (11) and (12). Consider the child’s justification in (11). This is an affirmative sentence and, as expected, it contains some, rather than repeating Kermit’s preceding use of any. Another child’s justification, in (12), contains the universal quantifier, every. For adults, any is not tolerated in the predicate phrase of sentences with the universal quantifier (see 8b). The child participants in the study also avoided using any in this structural position, inserting some instead. The fact that child participants reverted to some in both of these linguistic contexts was taken as evidence that they were aware of the restrictions on the linguistic environments that admit any. A study by Thornton (1995) further assessed children’s understanding of any in two types of Yes/No questions with negation, as illustrated in (13) and (14). (13)
Did any of the turtles not buy an apple?
Answer: These turtles didn’t.
(14)
Didn’t any of the turtles buy an apple?
Answer: These turtles did.
In (13), any takes scope over negation, so the question asks if there are turtles that did not buy apples. In (14), negation takes scope over any, so the question asks whether
Early Logic and Language 405 there are turtles that bought apples. The Thornton study tested ten 3-and 4-year-old children. In response to (13), the child participants answered “Yes” 93% of the time and pointed to the turtles that had not purchased apples. In response to (14), the child participants also answered “Yes” 85% of the time, but this time they pointed to the turtles that had purchased apples. By age 4, then, English-speaking children have achieved adult-like mastery of the different ways that any interacts with negation in Yes/No questions.
19.3.2 Some As noted earlier, negation takes scope over any in negative sentences such as (15). By contrast, when some replaces any, as in (16), some is interpreted as taking scope over negation. For adult English speakers, (16) can be paraphrased using the cleft sentence in (17), where some takes scope over negation in the surface syntax. We refer to this as the polarity sensitivity of some. (15)
One of the dinosaurs did not eat any food.
(16)
One of the dinosaurs did not eat some food.
(17)
There is some food that one of the dinosaurs did not eat.
In addition to the Type 1 and Type 2 sentences with any, discussed earlier, the O’Leary and Crain (1994) study included two additional conditions that were specifically designed to investigate children’s knowledge that some is polarity sensitive. In one condition, the word some was combined with the universal quantifier every (Type 3). In another condition, some was combined with the focus adverb only (Type 4). (18)
Type 3 Puppet: Every dinosaur found some of the food. Child: No, this one didn’t find any of the food.
(19)
Type 4 Puppet: Only one of the friends gave some presents to Gonzo. Child: No, none of the friends gave any presents to Gonzo.
Like the Type 1 and Type 2 sentences, the Type 3 and Type 4 sentences that Kermit the Frog produced did not accurately describe the events that had taken place in the stories. Therefore, the child participants were expected to reject these sentences. To justify their rejections, child participants who knew that some is polarity sensitive were expected to reformulate Kermit’s sentences, using any instead of some, because the experiment was designed to elicit sentences with negative expressions such as “not” and “none.”
406 Stephen Crain To illustrate the experimental design, we will describe the story context for the Type 3 sentence (18). The story was about three dinosaurs who were searching for food. Two of the dinosaurs succeeded in finding some food, but one dinosaur failed to find any. Following the story, Kermit produced sentence (18), Every dinosaur found some of the food. To explain what really happened in the story, the child was required to use a negative sentence with any: No, this one didn’t find any of the food. Suppose children produced negative sentences with some, for example, No, this one didn’t find some of the food. This sentence does not convey the intended meaning for adult speakers of English, since some is polarity sensitive for adults. Assuming that children’s negative sentences with some did, in fact, convey the intended meaning, which adults express using any, these sentences by the child participants were taken as evidence that some is not polarity sensitive in the grammars of these children. The child participants in the O’Leary and Crain study frequently produced non-adult negative sentences with some. Examples are in (20). Presumably, in the grammars of these children, some and any are in free variation in negative sentences, just as they are for adult speakers in the subject phrase of sentences with the universal quantifier every (see (5)), and in the predicate phrase of sentences with the focus adverb only (see (6)). (20)
a. b. c. d.
He didn’t get something to eat. (C.E-K. 4;6) Well, they didn’t get some food. (E.E. 4;7) None people had some presents. (E.P. 4;9) So he didn’t get some money. (E.G. 4;10)
A comprehension study by Musolino (1998) reached a similar conclusion—that young English-speaking children do not analyze the indefinite expression some as being polarity sensitive. The Musolino study used a Truth Value Judgment task. The test sentences were all similar to (21), with negation taking scope over some in the surface syntax. (21)
The detective didn’t find some guys.
The indefinite expression some is polarity sensitive for adult English speakers, so adults accept (21) as a description of a story in which a detective tried to find three guys, but only succeeded in finding one. We saw earlier, however, that negation can take scope over some for young English-speaking children, such that some is interchangeable with any in children’s grammars. We would therefore expect some child participants to reject (21) as a description of the story in which the detective found one guy. There were 30 child participants in the Musolino study. The child participants were divided into two groups by age. The younger group consisted of 15 children ranging in age from 3;10 to 5;2. The remaining 15 children ranged in age from 5;2 to 6;6. The proportion of children’s adult-like “Yes” responses to the puppet’s statements increased with age. The younger children only accepted the test sentences, such as (21), 35% of the time.
Early Logic and Language 407 By contrast, the older group of children accepted the test sentences 65% of the time, and a control group of adults accepted them 100% of the time. Statistically, the pattern differed significantly across groups. As before, children’s justifications for rejecting the test sentences were revealing. On the trial with the detective, younger children indicated that the test sentence was wrong because the detective did find one guy. This finding reinforces the conclusion that English-speaking children do not initially analyze some as polarity sensitive. Instead, they interpret some in its surface position, where it has the same meaning as any in negative sentences. For older children and adults, some is polarity sensitive, taking scope over local negation. To summarize, both the transcripts of children’s spontaneous speech and experimental investigations of children’s productions and comprehension invite the conclusion that, by age 4, children assign an adult-like analysis to the indefinite expression any (cf. van der Wal, 1996). By contrast, the research findings invite a different conclusion about some. Whereas some is polarity sensitive for adults and older children, some is interpreted in situ in the grammars of young children.
19.4 Disjunction in logic and in language In classical logic, disjunction is represented by the symbol ‘∨.’ Classical logic defines disjunction as inclusive-or, so a formula of the form A ∨ B is true if either one of the disjuncts is true and also if both disjuncts are true. In English, a disjunction phrase, e.g., an apple or a banana, also has an inclusive-or interpretation in certain linguistic contexts. For example, the inclusive-or interpretation is accessed when a disjunction phrase appears in the antecedent clause of a conditional statement, as illustrated by (22). This is why (22) and (23) can both be true. (22)
If Yi-ching gets an apple or a banana, she will be happy.
(23)
Yi-ching got both an apple and a banana. She was thrilled.
The negation of a disjunction is represented in classical logic as ¬ (A ∨ B). A negated disjunction excludes the possibility that A and it excludes the possibility that B. This makes a negated disjunction equivalent to a conjunction of the two exclusions, as represented in (24), where ‘∧’ is the logical symbol for conjunction. (24)
¬ (A ∨ B) entails (¬ A ∧ ¬ B)
As in classical logic, there is an equivalence in English between a negative statement with disjunction, as in (25), and a conjunction of two negative statements (26).
408 Stephen Crain (25)
Yi-ching didn’t get an apple or a banana.
(26)
Yi-ching didn’t get an apple and Yi-ching didn’t get a banana.
Adult speakers of some languages do not judge negated disjunctions to be true only if both disjuncts are excluded. When (25) is translated into these languages, adult speakers accept the analogs to sentence (25) in circumstances where at least one of the disjuncts is excluded. In English, there are two ways to paraphrase the truth conditions assigned to negated disjunctions by adult speakers of these languages. One way is to use disjunction, as in (27). The other way is to reverse the order of disjunction and negation, as in the cleft sentence in (28). In both cases, the disjunction word is inclusive-or, so both (27) and (28) are true in circumstances where Yi-ching didn’t get either an apple or a banana. In English, by contrast, this is the only circumstance in which sentence (25) is true. (27)
Yi-ching didn’t get an apple or Yi-ching didn’t get a banana.
(28)
It was an apple or a banana that Yi-ching didn’t get.
It is plausible to suppose that the word for disjunction has the same basic meaning in all languages, namely inclusive-or. However, languages vary in how disjunction interacts with negation, i.e., which of the two expressions takes wider scope (see section 19.6). The interpretation of disjunction can vary even within a language, as we discuss in the next two sections.
19.4.1 Or and some: the “either. . . or . . . ” interpretation In adult English, the disjunction word or expresses different truth conditions in different linguistic environments. These different interpretations of or closely track the distributional facts about the indefinite expressions some and any. One interpretation of the disjunction word or is assigned in structural positions that tolerate some, but where any cannot appear. Sentences (29)–(31) illustrate three such contexts. The first is the predicate phrase of affirmative sentences, as in (29). The second is the predicate phrase of sentences with the universal quantifier, every, as in (30). The third is the subject phrase of sentences with the focus operator only, as in (31). (29)
a. The passenger ordered some /*any dinner. b. The passenger ordered the chicken or the fish.
(30)
a. Every passenger ordered some /*any dinner. b. Every passenger ordered the chicken or the fish.
(31)
a. Only some /*any passengers ordered dinner. b. Only Aijun or Yurong ordered dinner.
Early Logic and Language 409 In the (b) examples of all three sentences where some can appear, we have replaced some by a disjunction phrase. In these sentences, adult speakers assign an ‘either. . . or . . . ’ interpretation to the disjunction phrase. That is, for adults, disjunction typically invites the inference that not both of the disjuncts are true. For example, adults typically infer from (31b) that only Aijun or only Yurong ordered dinner.3 This “exclusivity” inference is seen to be derived from the fact that, had the speaker been confident that both Aijun and Yurong had ordered dinner, then the speaker would have produced an alternative sentence with and, instead of using or (Grice, 1975). The fact that the speaker did not produce the alternative sentence with and invites the hearer to infer the speaker was not confident that both Aijun and Yurong had ordered dinner. The hearer draws the inference that not both of them ordered dinner. The exclusivity inference is credited to conversational norms that implore speakers to use the most informative statements that they are in a position to make. Interestingly, children younger than about 6 years of age are far less sensitive than adults are to the exclusivity inference associated with disjunction (e.g., Chierchia, Crain, Guasti, Gualmini, and Meroni, 2001; Chierchia, Guasti, Gualmini, Meroni, and Crain, 2004; Foppolo, Guasti, and Chierchia, 2012).
19.4.2 Or and any: the conjunctive interpretation In English, disjunction is assigned a conjunctive interpretation in linguistic environments in which any can appear (see Chierchia, 2013). The conjunctive interpretation corresponds to the interpretation assigned to negated disjunctions in classical logic, where ¬ (A ∨ B) entails (¬ A ∧ ¬ B).4 Accordingly, the English sentence (32b) entails (32c). (32)
a. The passenger did not order any dinner. b. The passenger did not order the chicken or the fish. c. The passenger did not order the chicken and did not order the fish.
We saw earlier that the English indefinite expression any is permitted in the subject phrase of the universal quantifier, every, but not in the predicate phrase. This difference in acceptability is shown in (33). (33) a. b.
Every SUB[passenger who ordered any dinner] PRED[became ill]. Every SUB[passenger who became ill] PRED[ordered *any dinner].
3 There are other pragmatic inferences associated with disjunctive statements. For example, (31b) invites the inference that the speaker is unsure which of the two people, Aijun or Yurong, ordered dinner. The acquisition of pragmatic inferences is beyond the scope of the present article (see Grigoroglou and Papafragou, this volume). 4 Both Jackendoff (1972) and Jayaseelan (2001) have proposed that the meaning of any is disjunctive (this or this or this or . . . ).
410 Stephen Crain The parallel between the acceptability of any and the conjunctive interpretation of or leads to the prediction that the disjunction phrase will be assigned a conjunctive interpretation in (34a), but not in (34b), where the disjunction phrase is expected to have an ‘either. . . or . . . ’ interpretation. (34) a. Every SUB[passenger who ordered the chicken or the fish] PRED[became ill]. b. Every SUB[passenger who became ill] PRED[ordered the chicken or the fish]. We also noted earlier (see example (10)) that, when negation is added to sentences with the universal quantifier every, the indefinite expression any can appear in the predicate phrase, where it was previously not accepted. The acceptability of any in this position is illustrated in (35). This leads us to predict that a disjunction phrase will be assigned a conjunctive interpretation in this position. This is verified in (36a), which has the truth conditions in (34b). We saw in (34b) that, without negation, disjunction is not assigned a conjunctive interpretation in the same position. (35)
Not every SUB[passenger who became ill] PRED[ordered any dinner]
(36)
a. Not every SUB[passenger who became ill] PRED[ordered the chicken or the fish] b. At least one passenger who became ill did not order the chicken and did not order the fish.
We now discuss when child language learners acquire these subtle facts about how disjunction combines with other logical expressions.
19.5 Or in child language A study by Gualmini, Meroni and Crain (2003) used a Truth Value Judgment task to investigate children’s interpretation of disjunction in the subject phrase versus the predicate phrase of sentences with the universal quantifier, every. The child participants were asked to judge sentences like (37) and (38). (37)
Every woman bought eggs or bananas.
(38)
Every woman who bought eggs or bananas got a basket.
On a typical trial, sentence (37) was presented to children in a context in which every woman bought eggs, but none of them bought bananas. The child subjects consistently accepted (37) in this condition, showing that they assigned a disjunctive interpretation to or in the predicate phrase of the universal quantifier, every. On another trial, sentence
Early Logic and Language 411 (38) was presented in a different context, in which the women who bought eggs were given a basket, but the women who bought bananas were not given a basket. The child subjects consistently rejected the test sentences in this condition. This finding is evidence that children generated a conjunctive interpretation for disjunction in the subject phrase of every. A variant of the Gualmini et al. (2003) study was conducted in Mandarin Chinese by Su and Crain (2013). The findings were essentially the same. It appears, then, that children acquiring even typologically distinct languages reach the same conclusions about the asymmetrical interpretation of disjunction. Another asymmetry was investigated in the Su and Crain study. They investigated children’s interpretation of the Mandarin disjunction word huozhe ‘or’ in sentences like (39) and (40). In (39), disjunction appears in the predicate phrase of the negative quantifier meiyou ‘not.’ In (40), disjunction appears in the predicate phrase of the quantifier mei-zhi . . . dou ‘every . . . all’. In adult Mandarin, disjunction is assigned a conjunctive interpretation in (39) and an ‘either . . . or . . . ’ interpretation in (40). (39)
Meiyou xiaogou zhai- guo baoshizihua huozhe baoshihonghua. not dog pick- ASP purple jewel or red jewel ‘None of the dogs picked a purple jewel or a red jewel.’
(40)
Mei-zhi xiaogou dou zhai-le baoshilvhua huozhe baoshihonghua every- CL dog all pick- ASP green jewel or red jewel ‘Every dog picked a green jewel or a red jewel.’
There were 24 child participants in the study. They ranged in age from 4;01 to 5;03. Sentences (39) and (40) were presented to different child participants, following the same story. In the story, Minnie Mouse was giving gifts to four dogs. Minnie Mouse put out lots of things for the dogs to choose from, including red jewels, green jewels, and purple jewels. Several dogs thought about taking a purple jewel but, in the end, two dogs picked red jewels and two dogs picked green jewels. Here are the main findings. Children rejected sentences with meiyou ‘not’ 90% of the time. They justified their rejections to (39) by pointing out that two dogs chose red jewels. For the child participants, then, (39) entailed that the dogs did not pick purple jewels and did not pick red jewels. So, this is evidence that children computed the conjunctive interpretation of disjunction for sentence (39). By contrast, children accepted sentences with mei-zhi . . . dou ‘every . . . all,’ as in (40), 94% of the time. We saw that adding negation to sentences with the universal quantifier every results in a change in the interpretation of disjunction in the predicate phrase. Without negation, disjunction is assigned an ‘either. . . or . . . ’ interpretation, as in (41). When negation is added, disjunction is assigned a conjunctive interpretation, as in (42). (41) Every princess took a shell or a star. ∴ There is no princess who did not take a shell or a star.
412 Stephen Crain (42) Not every princess took a shell or a star. ∴ At least one princess did not take a shell and did not take a star. A study by Notley, Zhou and Crain (2012) used a Truth Value Judgment task to investigate children’s knowledge of this reversal in the interpretation of disjunction. The story associated with sentence (42) was designed so that children should reject (42) if they assigned a conjunctive interpretation to the disjunction phrase, a shell or a star. In the story, all of the princesses refused to take shells, but all of them took stars. The fact that all the princesses took stars makes sentence (42) false on the conjunctive interpretation of disjunction. The participants were 22 English-speaking children, mainly 4-year-olds. Despite the complexity of the test sentences and the story contexts, the pattern of linguistic behavior by the child participants coincided with adult-like behavior. The child participants consistently accepted test sentences like (41), where the princesses had all taken stars. By contrast, the child participants rejected sentences like (42) 82% of the time, and justified their rejections of (42) by pointing out that the princesses had all taken stars. Taken together, the findings of the studies reviewed in this section provide compelling evidence that, by age 4, children know a great deal about how different quantifiers interact with disjunction.
19.6 The interpretation of disjunction across languages In English, if negation takes scope over disjunction in the surface syntax, then negation also takes scope in the semantic interpretation. For example, the English sentence (43) excludes the possibility that Ali speaks Spanish and it excludes the possibility that Ali speaks French. The same exclusions hold in classical logic (see e.g., Crain, 2012; Szabolcsi, 2002; Verbuk, 2006). (43)
Ali doesn’t speak Spanish or French
This is not true in other languages. Example (44) is the Mandarin Chinese translation of the English sentence (38), where the Mandarin disjunction word is huòzhě ‘or’ and the negation word is bu ‘not.’ Adult speakers of Mandarin judge (44) to express a meaning that can be paraphrased using as the English cleft sentence (45), where disjunction takes scope over negation in the surface syntax. (44) Ali bu shuo xibanyayu huozhe fayu. Ali not speak Spanish or French ‘Ali does not speak Spanish or Ali does not speak French.’
Early Logic and Language 413 (45)
It is Spanish or French that Ali does not speak.
Notice that Mandarin and English have the same word order. In contrast to English, the surface word order in Mandarin does not dictate the semantic interpretation. Although negation takes scope over disjunction in the surface syntax in (44), disjunction takes scope over negation in the semantic interpretation. Inspired by a suggestion by Szabolcsi (2002), Goro (2004a, b) proposed that disjunction words across languages are governed by a lexical parameter, called the Disjunction Parameter (Goro, 2004a,b; Crain, Goro, and Minai, 2007; Crain, Goro, and Thornton, 2006; Zhou and Crain, 2009). The parameter has two values. These values yield different scope assignments in negative sentences: (NEG > OR) versus (OR > NEG). For adult Mandarin-speakers, disjunction is interpreted as taking scope of local negation, as witnessed in (44). This is the positive value of the Disjunction Parameter [+P]. By contrast, disjunction words have the negative value [-P] in many other languages, including English. In these languages, disjunction phrases are interpreted in their surface position (in situ) in negative sentences. The different values of the Disjunction Parameter stand in a subset/superset relation. On the [-P] value of the parameter, negative sentences with disjunction are true in just one circumstance, where both disjunctions are excluded: NOT A and NOT B. The [+P] value makes such sentence true in a broader range of circumstances. In [+P] languages, negative sentences with disjunction are true when both disjuncts are false, but they are also true when only one disjunct is false. Based on this subset/superset relationship, Goro (2004a,b) reasoned that children acquiring all languages would initially adopt the [-P] value of the Disjunction Parameter. A child who initially selected the [+P] value would confront a learnability problem acquiring languages like English. Unlike adults, this child would accept negative sentences with disjunction when only one disjunct was false. In the absence of negative evidence, it is difficult to see how such a child could narrow in on just the one circumstance that makes negated disjunctions true for adult English speakers: NOT A and NOT B. This problem is avoided if children acquiring all languages initially select the more restrictive, subset value of the parameter, [-P]. As an empirical consequence, Mandarin-speaking children were predicted to interpret negated disjunctions in the same way as English-speaking children and adults. That is, Mandarin-speaking children were expected to initially take negation to have scope over disjunction (NOT > OR), just as English-speaking children and adults do. The empirical predictions tied to the Disjunction Parameter have been pursued in over ten languages to date, and a variant of the Truth Value Judgment task was developed explicitly for the purpose of investigating children’s interpretation of disjunction in negative sentences. The prediction that all children initially adopt the [-P] value of the Disjunction Parameter has been confirmed, at least so far (e.g., Crain, 2012; Geçkin, Thornton, and Crain, 2018; Pagliarini, Crain, and Guasti, 2018; Zhou and Crain, 2009).
414 Stephen Crain
19.7 Negation There are two types of negation in natural languages. One type is adverbial negation (Haegeman, 1995; Zanuttini, 2001). Adverbial negation does not require a negation phrase in the hierarchical structure; negative adverbs such as never are just adverbs, like English ‘always’ and ‘usually.’ The second kind of negation (e.g., English not) requires the addition of a special phrasal projection called the Negation Phrase (NegP). English has both types of negation. In affirmative sentences such as (46), the main verb carries tense. In this example, tense is contributed by the third person ‘s’ morpheme, which agrees in number with the subject noun phrase, Susie. Example (47) is an example of adverbial negation. Notice that the form of the main verb does not change in sentence with a negative adverb (e.g., never). (46)
Susie eats broccoli.
(47)
Susie never eats broccoli.
There are several syntactic features of negative sentences with the head form of negation. The main features are illustrated in examples (48)–(50). In contrast to adverbial negation, an inflected main verb with the third person ‘s’ (e.g. eats) is not acceptable in negative sentences with not as the head. This is illustrated in (48). Negative sentences with not can be rescued, however, by inserting the auxiliary verb do, as in (49). The auxiliary verb do carries the third person ‘s’ morpheme, so the main verb is not inflected. The main verb also remains uninflected when the contracted (clitic) form of negation, n’t, is the head of the negation phrase. The auxiliary verb do not only carries the third person ‘s’ morpheme in (50), it also hosts the contracted negation, n’t. Example (50) also shows that the contracted negative head, n’t, can be an affix supported by the inflected auxiliary verb does. This means that the negative auxiliary verb doesn’t is decomposed into three parts: do + s + n’t. (48) *Susie not eats broccoli. (cf. Susie never eats broccoli) (49) Susie does not eat broccoli. (50)
Susie doesn’t eat broccoli.
Zeijlstra (2004) proposed that adverbial negation is simpler than the head form of negation, because the negative head requires an additional phrasal projection, NegP, beyond the phrasal structure needed for adverbial negation. Therefore, child language learners who initially hypothesized a NegP for adverbial negation would be building structure that is not motivated by their linguistic input. Assuming that children do not build “extra” structure, they are expected to initially analyze all negative expressions as
Early Logic and Language 415 adverbs, rather than as heads. In addition to its inherent structural complexity, the head form of negation also requires children to deal with an idiosyncratic feature of auxiliary verbs in English, namely do-support. The next section reviews the acquisition of these features of the grammar of negation by children acquiring English.
19.7.1 Negation in child language The first detailed study of the acquisition of negation by children acquiring standard English was reported in Bellugi (1967). The Bellugi study examined the transcripts of the spontaneous speech of three children, who have come to be known as the Harvard children (Brown, 1973). Based on the transcripts of these children’s spontaneous speech, Bellugi (1967) distinguished three stages of negation. At the first stage, negation was said to be primitive. Negation was instantiated by the use of not (and, to a lesser extent, no). The negation marker appeared at either the beginning or at the end of what Bellugi termed the nucleus of the utterance, which could be a word, a phrase, or possibly even a sentence. At Bellugi’s Stage 2, negation appeared sentence-internally. As in Stage 1, children at Stage 2 continued to use the negation marker not (and no). Bellugi also reported that Stage 2 children produced two negative auxiliary verbs, don’t and can’t. Because children at this stage lacked productive use of the corresponding affirmative auxiliaries do and can (and other auxiliary verbs), Bellugi argued that the negative auxiliaries don’t and can’t were analyzed by children as fixed forms, like the negative adverb never (but see Schütze (2010) for an alternative view). Finally, at Stage 3, children gained productive use of both affirmative and negative auxiliary verbs. That is, Stage 3 children exhibited productive adult-like sentential negation. There has been confirming evidence that not and the negative expressions can’t and don’t are initially analyzed as negative adverbs in children’s grammars. This evidence was reported in a series of studies by Thornton and her colleagues. One was a longitudinal study of four 2-year-old children. The findings are reported in Thornton and Tesan (2007, 2013). The longitudinal study incorporated an elicited production task that managed to successfully elicit negative sentences with the third person ‘s’ morpheme. The 2-year-old child participants produced 497 negative sentences that contained both a third person subject noun phrase and a main verb. The main verb was inflected in 99 of children’s 497 negative sentences (20%). The majority of children’s negative sentences contained the negative marker not (e.g., Minnie Mouse not fits), but some contained don’t (e.g., Minnie Mouse don’t fits). If negation is adverbial in young children’s grammars, then children would be expected to produce negative sentences such as It not fits, and It don’t fits. Notice that these not-adult negative sentences fit the same mold as adult-like negative sentences with the adverb never; for example, It never fits. Because children at Stage 2 sometimes omit tense, they are also expected to produce sentences that combine adverbial negation with an uninflected verb, such as It not fit, and It don’t fit. In order to converge on the adult grammar of standard English, Stage 3, children need to discover that n’t is a head form of negation. Thornton and her colleagues hypothesized that the
416 Stephen Crain negative auxiliary verb doesn’t triggers children’s transition from Stage 2 to Stage 3— revealing the fact that n’t is a head form of negation. If doesn’t serves to trigger children’s transition to Stage 3, children’s non-adult forms of negation should disappear once they start producing negative sentences with doesn’t. To evaluate this prediction, Thornton and Rombough (2015) elicited negative sentences from 25 2-and 3-year-old children. Then, the child participants were divided into groups, depending on whether or not they produced adult-like negative sentences with the auxiliary verb doesn’t. The children (n = 12) who produced at least five instances of doesn’t were called the Advanced group, and the remaining children (n = 13) were called the Less Advanced group. The between-group findings were exactly as predicted. The children in the Less Advanced group produced a total of four utterances with doesn’t. These children produced 89 non-adult negative sentences with an inflected main verb (e.g., It not/don’t fits). By contrast, the children in the Advanced group produced 228 adult-like negative sentences with doesn't, and they produced only five non-adult negative sentences with an inflected main verb. Taken together, the findings from both the cross-sectional study and from the longitudinal study offer compelling evidence that the negative auxiliary doesn’t is critical for English-speaking children’s convergence to the adult grammar of negation.
19.7.2 Negative concord and double negation in child language Across languages, adult speakers assign different interpretations to negative sentences with two markers of negation, such as (46). (46)
John didn’t eat nothing before running the marathon. a. John didn’t eat anything before running the marathon. b. John ate something before running the marathon.
(NC) (DN)
Speakers of some dialects of English assign a negative concord (NC) interpretation to sentences like (46). In these dialects, two negation markers yield a single semantic negation, so (46) can be paraphrased using the negative polarity item anything, as in (46a).5 Speakers of other dialects assign a double negation (DN) interpretation to sentences like (46). In these dialects, the two negation markers cancel each other out, yielding an affirmative meaning, as indicated in (46b). Speakers of a DN dialect can access the NC interpretation of sentences like (46), although they do not produce these sentences. 5 These alternative interpretations arise when sentential negation is combined with a second negation marker that has been drawn from a particular set of negative expressions, called n-words (Laka, 1990; Giannakidou, 2005). English n-words include nobody, nothing, and nowhere.
Early Logic and Language 417 This could be due, in part, to the fact that they have abundant exposure to speakers with NC dialects in the media (e.g., I can’t get no satisfaction). An alternative possibility is that English is inherently an NC language (e.g., Blanchette, 2013; Tubau, 2008; Zeijlstra, 2004).6 Researchers in child language have searched for evidence, one way or the other, showing that children acquiring English initially take it to be an NC language. On the one hand, Bellugi (1967) reports that she did not find a single sentence of this kind in the transcripts of the three children she studied. However, the child participants in a study by Coles-White (2004) showed a marked preference for the NC interpretation of sentences that were potentially ambiguous between an NC interpretation and a DN interpretation.7 A study by Thornton, Notley, Moscato, and Crain (2016) investigated the possibility that preschool English-speaking children assign NC interpretations to potentially ambiguous sentences. The test sentences contained two negative markers, sentential negation followed by nothing. An example is (47). (47)
The girl who skipped didn’t buy nothing. a. The girl who skipped bought something. b. The girl who skipped bought nothing.
(DN Interpretation) (NC Interpretation)
The experimental contexts lent contextual support for both interpretations. If children acquiring initially adopt a grammar that generates NC interpretations, they should assign the interpretation in (47b). However, if children initially posit grammars that do not permit NC, they should assign DN interpretations, as in (47a). The study also included control sentences with two negation markers such as (48). (48) The girl who didn’t skip bought nothing. The controls were designed to see whether or not the child participants experience difficulty in processing sentences with two negations, but ones that do not license either NC or DN interpretations. In the control sentences, one negative marker appeared in the main clause, and the other appeared inside the relative clause, as in (48). In the test sentences, both negative markers appeared in the main clause, as in (47). Twenty-four English-speaking children participated in the experiment. The children ranged in age from 3;6–5;8. A control group of adult participants produced DN responses to the test sentences 82% of the time, whereas the child participants produced DN responses only 25% of the time. Both children and adults showed a similar pattern 6
Another possibility is that speakers of double negation dialects avoid producing negative concord sentences because they attach some kind of social stigma to such sentences (cf. Nevalainen, 2006; Horn, 2010). 7 It should be noted that comprehension precedes production in most aspects of child language, just as recognition is typically superior to recall for adults in the vast majority of psychological tasks.
418 Stephen Crain of responses to the control sentences; children responded correctly to the control sentences 84% of the time, and the adult control group responded correctly 94% of the time. The findings support the proposal that English is inherently an NC language.
19.8 Conclusion This chapter reviewed the findings of research investigating the meanings that young children initially assign to logical expressions (see Hacquard, this volume). We compared how the meanings initially assigned by children compared with those assigned by adults, and with the meanings assigned to the corresponding vocabulary of classical logic. Although the acquisition of the distributional properties and the meanings of logical expressions is a formidable task, the findings of recent research indicate that children converge on a grammar that is equivalent to that of adults in their preschool years. Moreover, research findings suggest that there is a strong overlap in the interpretations that children assign to logical words and the meanings that are assigned to the corresponding expressions in classical logic.
Pa rt I I C
I N T E R FAC E S A N D B OU N DA R I E S
Chapter 20
C ontribu t i ons of pragm at i c s to word learni ng a nd interpretat i on Myrto Grigoroglou and Anna Papafragou
20.1 Introduction A key feature of human language lies in the fact that the overall meaning of an utterance typically goes beyond the literal meaning of the individual words and the way the words combine within that utterance. Even a simple sentence such as “It’s too high” contains referential ambiguity (“it”), underspecification (“too high” for what?), and may convey different things depending on context (e.g., it can be an informational statement or an indirect request for help). Within linguistic theory, this property of natural language is captured by the distinction between semantic, literal meaning, and pragmatic, contextually derived meaning (see also Hacquard, this volume; Schwarz and Zehr, this volume). The underlying assumption behind this distinction is that some aspects of meaning are encoded in our mental lexicon or the compositional rules that allow the contents of the lexicon to be combined (i.e., semantics), while other aspects of meaning are derived contextually through inferential reasoning processes (i.e., pragmatics). Many theoretical accounts of the semantics-pragmatics interface treat pragmatic interpretation as a species of intention recognition, since hearers try to reconstruct what the speaker meant by uttering a sentence. For instance, on Grice’s theory of pragmatics (Grice, 1957, 1975), hearers expect speakers to be cooperative actors who follow rational conversational rules. Cooperative speakers are expected to make conversational contributions that are truthful (Maxim of Quality), relevant (Maxim of Relation), clear (Maxim of Manner), and as informative as required by the purpose of the conversational exchange (Maxim of Quantity). Deviations from these communicative principles
422 Myrto Grigoroglou and Anna Papafragou give hearers reasons to think that the speaker intended to convey meaning beyond the conventional meaning of the words in the utterance, thus giving rise to pragmatic inference (e.g., in the example “It’s too high,” the Maxim of Relation would yield different interpretations of the utterance depending on whether the speaker and hearer had been discussing the price or the location of a toy). Subsequent theoretical accounts reinterpreted and modified aspects of the Gricean framework (Carston, 1995; Chierchia, Fox, and Spector, 2009; Gazdar, 1979; Horn, 1972; Levinson, 2000; Noveck and Sperber, 2007; Sauerland, 2004; Sperber and Wilson, 1986/1995; Van Rooij and Schulz, 2004) but preserved the foundational idea that interpreting utterances (and other non-linguistic communicative acts) relies on the ability to “read the mind” of others and understand their intentions and beliefs. From a psychological perspective, these accounts suggest that pragmatic reasoning relies on a form of theory of mind, the capacity to recognize that others have belief states that differ from one’s own (Baron-Cohen, Leslie, and Frith, 1985; Wimmer and Perner, 1983). Furthermore, tracking the belief states of others during communication requires people to identify what information is shared (or not) with their conversational partners, a psychological construct referred to as common ground (Clark and Marshall, 1981; Stalnaker, 1970). Experimental evidence to date confirms the idea that adult comprehenders flexibly integrate information about the speaker’s beliefs in interpreting language in context (Bergen and Grodner, 2012; Breheny, Ferguson, and Katsos, 2013; Brown-Schmidt, Gunlogson, and Tanenhaus, 2008; Fairchild and Papafragou, 2018; Heller, Grodner, and Tanenhaus, 2008; Nadig and Sedivy, 2002; Tanenhaus, Spivey, Eberhard, and Sedivy, 1995). For the child learner, as for more mature communicators, pragmatic mechanisms along the broad lines described by Grice can be important for bridging the gap between what words and sentences mean and what the speaker intended to communicate by uttering them in a specific context. In addition, since the meanings of words and sentences themselves are initially inaccessible to the young child, pragmatic mechanisms of intention recognition could be used to discover word meaning. For example, by assuming that the speaker is being cooperative, the child can conclude that a novel label uttered by a speaker is relevant to the present exchange, informative, unambiguous, and truthful. Within the developmental literature, however, the nature and extent of children’s ability to engage in rich pragmatic reasoning when interpreting the meaning of (new) words and sentences have traditionally been topics of considerable debate. In the case of word learning, general (associationist) learning accounts propose that children learn words by associating sounds (words) and perceptual stimuli (objects in a scene) without necessarily or constantly engaging in social-pragmatic considerations (e.g., Locke, 1690/1964; Piaget, 1952; Smith, 2000; Vygotsky, 1978; Werker, Cohen, Lloyd, Casasola, and Stager, 1998). A different research tradition takes the view that children learn new words through pragmatic inference by actively consulting the speaker’s mind and trying to figure out what was meant (e.g., Baldwin, 1991, 1993; Bloom, 2000; E. Clark, 2007; Diesendruck and Markson, 2001). Similarly, in the case of language interpretation, some researchers propose that children have limited ability to use pragmatic
Pragmatics TO Word Learning and Interpretation 423 computations to derive intended but “unsaid” aspects of meaning (and therefore appear egocentric or literal; Epley, Morewedge, and Keysar, 2004; Piaget 1952), while others attribute much greater pragmatic sophistication to young learners (Clark and Amaral, 2010; Tomasello, 2000; cf. Grigoroglou and Papafragou, 2017). In this chapter, we review the available empirical evidence to evaluate the role of pragmatics in how children acquire and contextually interpret the words in their language. To advance the state of the art, we present and synthesize currently disparate sets of experimental findings across a variety of pragmatic tasks and phenomena. We organize the chapter in terms of two major themes. In the first half of the chapter, we assess the extent to which young children use pragmatic mechanisms to build a mental lexicon (i.e., to learn new words). In the second half, we discuss the extent to which children use pragmatic inference to employ their mental lexicon (i.e., to interpret known words). Although these two areas of research have developed largely independently from one another and have produced findings that often appear contradictory, we take an integrative approach that highlights the commonalities of the mechanisms underlying children’s pragmatic reasoning across these two domains despite inherent differences across phenomena of varying cognitive and linguistic complexity. To preview our conclusions, we find evidence for rich and massive effects of pragmatic reasoning in both domains. This evidence suggests continuity in pragmatic reasoning, whereby foundational aspects of the rich pragmatic system at work in adults are already in place in early stages of language development.
20.2 Building a lexicon: pragmatics in word learning A fundamental question in the study of language is how young children acquire the meaning of words. Since children are not born knowing the meaning of individual words in their language, vocabulary acquisition is, at least in part, environment driven. However, the exact properties of the environment that allow children to form mappings between words and their referents have puzzled thinkers for centuries (see Bloom, 2000; Gleitman and Trueswell, this volume, for discussion).1 As alluded to already, two prominent views have been developed to explain word leaning. On one view, word learning is considered an associative process between perceptual stimuli (Locke, 1690/1964; Piaget, 1952; Smith, 2000; Sloutsky Yim, Yao, and Dennis, 2017; Vygotsky, 1978; Werker et al., 1998). For example, children growing up in an English-speaking environment learn the
1 Here, we focus on the “easy” case of learning words for concrete objects. The problem becomes even more complex once we consider words that refer to entities or events that are not directly observable in the physical world (e.g., abstract nouns, mental state verbs, logical connectives, etc.; cf. Crain, this volume; Gleitman and Trueswell, this volume).
424 Myrto Grigoroglou and Anna Papafragou word “table” by associating the sound segment [ˈteɪbəl] with the piece of furniture immediately available to them in the environment when the label is uttered. This view has been attractive because it is parsimonious: if word learning depends on the contingency between an auditory stimulus (i.e., a label) and the visual stimulus that happens to be in focus of the child’s attention, it is no different than other types of (non-linguistic) learning that rely on simple cognitive mechanisms of attention and memory. On an alternative, pragmatic view, children learn new words through pragmatic inference by actively trying to figure out the speaker’s referential intention (e.g., Baldwin, 1991, 1993; Bloom, 2000; E. Clark, 2007; Diesendruck and Markson, 2001). For instance, children learn the word “table” because they understand that the speaker, by using that particular label, intended to refer to a specific piece of furniture and no other. Although both the associationist and the pragmatic views assume that word learning happens in cases of referential clarity, where uttering a label coincides with the child’s attention on a particular object, they differ in whether this coincidence is incidental or the result of interpreting the speaker’s intention. Specifically, on the pragmatic view, early word learning seems to be guided by children’s ability to pick up on several observable social cues in the environment that make speakers’ intentions recoverable. These include information provided by the speaker herself (e.g., body orientation, direction of gaze, touching, pointing etc.), as well as information provided by the communicative context, as experienced by the child in conjunction with the speaker (e.g., information shared in physical compresence or in prior engagement with the speaker). In the next sections, we review multiple strands of empirical evidence to adjudicate between these two perspectives.
20.2.1 Pragmatic sensitivity in early word learning Several pieces of evidence support the view that, even at its early stages, word learning is not merely a process of associating a sequence of sounds (a word) to a perceptual stimulus (an entity in a scene) but rather a process of reconstructing a speaker’s communicative intention. A first piece of evidence comes from work demonstrating that, from a very young age, children treat signals that are presumed to have communicative value as privileged for learning compared to other, non-communicative signals. For instance, 6-month-old infants are more likely to treat language-like sounds (i.e., novel words) as conveying information about a speaker’s intentions compared to non-communicative sounds (e.g., a cough; Vouloumanos, Martin, and Onishi, 2014; see also Martin, Onishi, and Vouloumanos, 2012; Vouloumanos, Onishi, and Pogue, 2012). A second piece of evidence comes from the fact that infants can assign reference to absent entities: already at 12 months children resolve ambiguous requests (e.g., “Can you give it to me?”) by assuming that the speaker refers to an object that was witnessed in a previous interaction with them but is absent at the time of the request (Ganea, 2005; Ganea and Saylor, 2007; Osina, Saylor, and Ganea, 2017; see also Bohn, Zimmermann, Call, and Tomasello,
Pragmatics TO Word Learning and Interpretation 425 2018). Thus, even at very early stages of development, children do not make superficial associations between the sound of a word and a perceptual stimulus in their environment but actively consider a speaker’s communicative intention when trying to interpret what a speaker says. Perhaps the most direct piece of evidence against associationist accounts comes from the observation that children monitor the speaker’s direction of gaze and actively use gaze information in word learning situations (e.g., Baldwin, 1991, 1993; Baldwin, Markman, Bill, Desjardins, Irwin, and Tidball, 1996; Bloom, 2000; Koenig and Echols, 2003). This ability arises at around 12 months of age and continues to develop during the second year of life (Pruden, Hirsh-Pasek, Michnick Golinkoff, and Hennon, 2006; Vaish, Demir, and Baldwin, 2011; Yurovsky and Frank, 2017). In a classic demonstration (Baldwin, 1991), 18-to 19-month-old children were given a novel toy to play with while another toy was placed inside a bucket in front of the experimenter. The experimenter then provided a label while looking at either the toy that the child was manipulating or the toy inside the bucket. If word learning relied on simple associations between a string of sounds and an object that happened to be in the child’s focus of attention, children should have associated the experimenter’s label with the toy that they happened to be playing with at the moment of naming, regardless of where the experimenter was looking. Children, however, made the label-to-toy mapping only when the experimenter was looking at their own toy but not when she was looking at the toy in the bucket. Finally, longitudinal studies of child-caregiver dyads underscore the importance of tracking the speaker’s direction of gaze and engaging in joint attention for vocabulary development: individual differences in children’s ability to engage in joint attention before their first birthday predict their vocabulary size later in development (Brooks and Meltzoff, 2005; Carpenter, Nagell, Tomasello, Butterworth, and Moore, 1998; Tomasello and Farrar, 1986; see also Bottema-Beutel, 2016, for a meta-analysis). Importantly, however, the mere presence of joint attention does not guarantee successful word learning. Analyses of naturalistic infant-parent interactions demonstrate that it is the (very fine-grained) timing of the presentation of all the relevant information that crucially restricts learners’ hypotheses about potential meanings of a given label (Trueswell, Lin, Armstrong, Cartmill, Goldin-Meadow, and Gleitman, 2016; see also Gleitman and Trueswell, this volume, for discussion). Overall, the present evidence suggests that social-pragmatic cues such as joint attention are powerful learning mechanisms for early word learning (on the availability of such cues in moments of actual word learning “in the wild”, see Trueswell et al., 2016).
20.2.2 Using common ground in word learning In addition to the mechanisms outlined above, children around their second year begin using more sophisticated sources of information about a speaker’s naming intentions. These sources include information shared in common ground with a speaker. Common
426 Myrto Grigoroglou and Anna Papafragou ground with an interlocutor can be established with various types of experiences and it can include perceptual information shared by interlocutors who are physically co-present in the same environment, linguistic information shared by interlocutors engaged in the same conversation or general information shared by members of the same community (Clark and Marshall, 1981). Establishing common ground with a communicative partner is essential for word learning. For instance, in one study, 15- to 20-month-old children did not learn new words when the labeling came from a disembodied voice outside the testing room: in the absence of a visible, physically co- present speaker, children did not consider this as an act of naming (Baldwin et al., 1996). Other evidence shows that, during word learning, children consult information shared with a speaker in prior discourse. In one demonstration, Akhtar, Carpenter, and Tomasello (1996) showed that 24-month-old children used discourse novelty to infer the meaning of a novel word. In this study, children and adults played with three novel objects (which remained unnamed during the playing session). These objects were later placed inside a box, along with a fourth, new (novel) object. The adults displayed excitement about the contents of the box and provided a novel label. Children assigned the label to the new object (rather than the older ones), thus assuming that the new label referred to the newly introduced object. Interestingly, children formed this mapping even when the object was new only to the speaker but not to themselves (see also Horowitz and Frank, 2015; Sullivan, Boucher, Kiefer, Williams, and Barner, 2019). Later work challenged the role of social-pragmatic reasoning in this context: 18-to 28-month-olds drew the same conclusions even when novelty did not arise from common ground with the speaker (i.e., from remembering that an object was new for the speaker) but from a contrast in the perceptual context (i.e., from noticing that the new object was introduced in a perceptually more salient or distinct setting compared to the older objects; Samuelson and Smith, 1998). In response to this challenge, further studies found that the effect of perceptual context was subject to social cues: children did not learn labels for objects in perceptually distinct settings unless they considered the setting to be intentional rather than accidental (Diesendruck, Markson, Akhtar, and Reudor, 2004). Thus social-intentional considerations built over interactions with a partner affect word learning (on the development of this ability, see also MacPherson and Moore, 2010; Tomasello and Haberl, 2003). Finally, children use contrastive inferences (or mutual exclusivity) to narrow down word meanings in ways that might invoke common ground. Specifically, children as young as 12 months tend to assume that a novel word refers to an object for which they do not have a name (Golinkoff, Hirsh-Pasek, Bailey, and Wenger, 1992; Graham, Poulin-Dubois, and Baker, 1998; Halberda, 2006; Markman, Wasow, and Hansen, 2003; Merriman and Bowman, 1989). For instance, if a child is presented with two objects, one familiar (e.g., a book) and one unfamiliar (e.g., a kaleidoscope), and a speaker uses a novel label to refer to one of the two objects (e.g., “Do you want to check out the kaleidoscope?”), the child infers that the speaker is referring to the unfamiliar object (i.e., the kaleidoscope). A prominent account of mutual exclusivity links it to a pragmatic mechanism guided by Gricean considerations of what the speaker said compared to what the
Pragmatics TO Word Learning and Interpretation 427 speaker could have said but did not (see Clark, 1990). Early support for this pragmatic view came from evidence that preschoolers suspend mutual exclusivity inferences when labels are offered by non-native or unreliable speakers, presumably because they tie such inferences to the profile of the speaker (Diesendruck, 2005; Diesendruck, Carmel, and Markson, 2010). However, this conclusion was later challenged by the observation that mutual exclusivity is not affected by the speaker’s context-specific knowledge: across different studies, children (and adults) drew mutual exclusivity inferences irrespectively of whether the speaker was present or not during the introduction of novel labels (Diesendruck and Markson, 2001; Srinivasan, Foushee, Bartnof, and Barner, 2019; see also de Marchena, Eigsti, Worek, Ono, and Snedeker, 2011). Although these two findings appear to contradict each other, they can be reconciled under the assumption that mutual exclusivity for words may be a pragmatic process, albeit one that does not require highly specific considerations of other people’s mental states (as in classic common ground adaptations) but simply relies on generic considerations of what is conventional within a linguistic community (a different, broader type of common ground; cf. also de Marchena et al., 2011). This conclusion is in accordance with the position that pragmatic computations of common ground operate at different levels of specificity and echoes findings from other domains showing that children adapt differently to varieties of common ground in production (see Grigoroglou and Papafragou, 2019; Moll and Kadipasaoglu, 2013).
20.2.3 Word learning and speaker belief A particularly strong test of social-pragmatic accounts to word learning comes from studies asking whether children learn words by considering the speaker’s knowledge when such knowledge contradicts their own. In this line of research, children are asked to learn new words from speakers with a false belief in tasks similar to the ones used in developmental literature to study children’s theory of mind in non-communicative situations (e.g., see Baron-Cohen, Leslie, and Frith, 1985; Buttelmann, Carpenter, and Tomasello, 2009; Hamlin, Ullman, Tenenbaum, Goodman, and Baker, 2013; Király, Oláh, Csibra, and Kovács, 2018; Onishi and Baillargeon, 2005; Wimmer and Perner, 1983). In one study (Southgate, Chevallier, and Csibra, 2010), 17-month-olds saw an experimenter place two novel objects in two boxes and leave the scene. While the first experimenter was away, a second experimenter entered the scene and switched the location of the objects. Then the first experimenter returned, pointed to one of the boxes and offered a label for its content (e.g., “There’s a sefo in the box”). The experimenter then opened both boxes without looking inside, and asked children to retrieve the named object (e.g., “Can you get the sefo for me?”). The majority of children searched inside the box that the experimenter had not pointed at, thus demonstrating an understanding of the speaker’s referential intention despite her false belief). Despite these early successes, other studies demonstrate failures on versions of this basic paradigm with children younger than 5 years old (Carpenter, Call, and Tomasello,
428 Myrto Grigoroglou and Anna Papafragou 2002; Houston- Price, Goddard, Séclier et al., 2011). For instance, in one study (Papafragou, Fairchild, Cohen, and Friedberg, 2017), 3-, 4-, and 5-year-old children watched one character place a novel object inside a box and then leave the scene. While the first character was away, a second character entered the scene and replaced the object in the box with a different novel object. Then the first character returned and named the content of the box using a novel label (e.g., “blicket”). Children were then presented with the two objects and were asked to identify the object that the label applied to (“Which one is the blicket?”). Only 5-year-olds reliably passed the task. Furthermore, children’s performance closely matched performance in an equivalent false belief task that did not involve word learning (cf. Carpenter et al., 2002; pace Happé and Loth, 2002). At present, there is considerable discussion about the interpretation of these and similar results (e.g., see Kulke, Johannsen, and Rakoczy, 2019; Poulin-Dubois and Yott, 2017; Powell, Hobbs, Bardis, Carey, and Saxe, 2018; Rubio-Fernandez and Geurts, 2013; Southgate, 2019, for discussions about infants’ and preschoolers’ performance in non-verbal theory of mind tasks). Notice that the kinds of studies reviewed above, even though similar, are not methodologically identical: for instance, infants in Southgate et al. (2010) were asked to actively participate in the task and provide an action-based response to the experimenter’s use of a new label (i.e., by helping the speaker find the object she had in mind as part of the task), but in Papafragou et al. (2017) children were asked to observe interactions between different characters and provide an explicit judgment of which object the label picked out (i.e., by pointing to the object they thought the speaker had in mind in a subsequent experimental phase). It is possible, therefore, that the discrepancy in the study findings connects to a broader pattern in the literature whereby infants have been shown to have an “implicit” awareness of others’ false beliefs (e.g., Buttelmann et al., 2009; Knudsen and Liszkowski, 2012; Hamlin, Ullman, Tenenbaum, Goodman, and Baker, 2013; Onishi and Baillargeon, 2005; Southgate, Senju, and Csibra, 2007; Surian, Caldi, and Sperber, 2007; Träuble, Marinović, and Pauen, 2010), but preschoolers before the age of 5 often fail in false belief tasks where an experimenter asks for “explicit” (verbal) responses about others’ mental states (e.g., Bartsch and Wellman, 1995; Perner, Lang, and Kloo, 2002; Wimmer and Perner, 1983; Wellman, Cross, and Watson, 2001). Alternatively, given task differences in infant and preschooler studies, it is possible that infants’ successes do not reflect a rich understanding of the content of others’ belief states but a (cognitively simpler) awareness of what events were co-experienced (or not) with other people, which could suffice for a wide range of infant theory of mind tasks (see Powel et al., 2018; Southgate, 2019, for discussion). For instance, it seems that theory of mind successes with infants replicate in tasks where an agent maintains contact with an object or event but fail to replicate in tasks where this contact is abrupted (e.g., Surian et al., 2007; Sodian and Thoermer, 2008; Yott and Poulin-Dubois, 2016; see Powell et al., 2018, for discussion). Even though this topic remains unsettled, a prominent explanation for the pattern of results in the literature attributes the older children’s difficulty in false belief tasks to the implementation of the ability to track mental states within specific tasks rather than
Pragmatics TO Word Learning and Interpretation 429 to inability to consult false beliefs tout court (e.g., Scott and Baillargeon, 2017; Rubio- Fernandez, and Geurts, 2013, among others). Within the context of word learning, two additional lines of research support this conclusion. First, in a study investigating the acquisition of mental state verbs such as think and believe, children younger than 5 tracked others’ false belief states and used them to constrain the meaning of unknown verbs (Papafragou, Cassidy, and Gleitman, 2007). Specifically, 4-year-olds (and adults) were more likely to correctly guess that a novel verb in a sentence (e.g., “Matt gorps that his grandmother is under the covers”) referred to an agent’s mental state when the agent held a false, as opposed to a true, belief (e.g., when someone other than Matt’s grandmother was hiding under the covers). Papafragou et al. (2017) concluded that false beliefs are more noteworthy or pragmatically salient compared to true beliefs and therefore more likely to promote mental verb conjectures (cf. also Hacquard and Lidz, 2018). For present purposes, the very sensitivity to false beliefs in the context of word learning is an important finding. Second, independent of false belief contexts, young children are known to consult a broad set of speaker properties in deciding what a novel word means: 3-4-year olds suspend word learning when interacting with speakers who express uncertainty about the referent of a novel label, overtly display ignorance of familiar labels or are generally unreliable (e.g., Birch, Vauthier, and Bloom, 2008; Jaswal and Neely, 2006; Sabbagh and Baldwin, 2001; Koenig and Harris, 2005; Koenig and Woodward, 2010; Sabbagh and Shafman, 2009; Scofield and Behrend, 2008; Sobel and Corriveau, 2010). In these tasks—unlike classic false belief contexts—it is the conversational history of the speaker that offers cues to their mental state. These results naturally fit within a social-pragmatic framework of word learning but are harder to reconcile with an associationist account of how word forms are linked to word meanings.
20.2.4 Pragmatic and discourse principles in word learning Finally, there is evidence for the role of general pragmatic and discourse mechanisms during word learning. One powerful such mechanism is the assumption that speakers are following Gricean maxims—for instance, they want to be informative in context. In one study (Frank and Goodman, 2014), 3-and 4-year-old children and adults were presented with a task where they had to identify the referent of a novel label (e.g., “a dinosaur with a dax”) by only relying on context. Context in this task involved two identical characters (e.g., dinosaurs): both characters had one feature in common (e.g., a bandana), but one of them had an additional feature (e.g., a headpiece). By assuming that the speaker was informative, both young learners and adults inferred that the novel label referred to the unique feature (i.e., the headpiece). Later work demonstrated that such general expectations of informativeness can be combined with other, social sources
430 Myrto Grigoroglou and Anna Papafragou of information (i.e., common ground shared with a speaker) to guide inferences about the meaning of words (Bohn, Tessler, and Frank, 2019). Inferences about the meaning of novel words can also be drawn from general assumptions about discourse coherence. Just as the meaning of words can be inferred by their syntactic environment (e.g., Brown, 1957; Gilette, Gleitman, Gleitman, and Lederer, 1998; Landau and Gleitman, 1985), discourse structure can also be informative for word learning. For instance, in a recent study (Sullivan et al., 2019), 2-to 6-year-olds and adults were asked to guess the meaning of novel words when these were presented within temporal and causal discourse structures (e.g., “One animal handed the baby to the other animal [and/because] the baby started crying in the talfa’s arms”). It was found that children (4-year-olds and older) and adults successfully identified the target referent (giver/receiver) for the novel labels (see also Horowitz and Frank, 2015; Sullivan and Barner, 2016 for similar findings). Taken together, the studies reviewed so far suggest that young children draw inferences about the possible meaning of a word in context based on a variety of social-pragmatic— as opposed to simply associationist—cues (e.g., a speaker’s direction of gaze, presence or absence during an event, prior conversational exchanges, and history). Furthermore, even in the absence of such cues, children can use general pragmatic principles and discourse properties to delimit the set of potential meanings for a newly encountered word.
20.3 Using the lexicon: pragmatics in early language comprehension Even after children acquire the semantic meaning of a word in their native language, they need to confront the fact that the same word can be understood differently depending on the context. Thus, to become mature communicators, children have to be able to bridge the gap between word meaning and the meaning that a speaker intended to communicate by using a word in a specific conversation (cf. Grice, 1975; Sperber and Wilson, 1986/1995). As in the case of word learning, a crucial question is how children develop this ability and whether they recruit similar mechanisms as adults in understanding what others say. Recall that some researchers propose that children have limited ability to use pragmatic computations to go beyond what the speaker has said (and therefore appear mostly egocentric or otherwise literal; Piaget, 1957; Epley et al., 2004), while other commentators grant children more sophisticated abilities (e.g., Tomasello, 2000; Grigoroglou and Papafragou, 2017, for overviews). Here we assess these proposals focusing on two pragmatic phenomena that have attracted wide attention in the developmental literature: reference and (quantity) implicature.
Pragmatics TO Word Learning and Interpretation 431
20.3.1 Reference comprehension A foundational aspect of communication involves understanding how different types of expressions are used to pick out objects and other entities in the world (e.g., “the red pen”, “the pen”, “it”). On classic theoretical models of reference, the interpretation of referential expressions largely depends on expectations of informativeness (Grice’s Maxim of Quantity), constrained by assumptions about what information is shared or not with a conversational partner in common ground (Clark and Marshall, 1981). For instance, the utterance “Give me the pen” is easy to interpret if there is a single pen in the scene but confusing in case there are multiple pens (unless the hearer can identify based on visual cues, prior discourse, or other common ground knowledge which pen the speaker has in mind). Psycholinguistic evidence confirms that adult comprehenders take into account the information shared in common ground with a speaker when resolving such referential ambiguities (e.g., Heller, Grodner, and Tanenhaus, 2008; Tanenhaus et al., 1995; Nadig and Sedivy, 2002). Of interest is whether children’s interpretation of ambiguous referential expressions is guided by speaker-oriented assumptions. Experimental paradigms designed to investigate reference resolution in children typically create a knowledge mismatch between a speaker and the child comprehender by manipulating the objects that a speaker can or cannot see. For example, participants in the task are presented with a box with different compartments, containing different objects. Crucially, the contents of one (or more) of the compartments is visible only by one of the participants. In critical trials, the speaker produces an utterance which is ambiguous from the child’s perspective (e.g., “Pick up the duck”, when the child can see two ducks) but unambiguous from the speaker’s perspective (i.e., the speaker can only see one duck). In such trials, if children take into account the speaker’s perspective in interpreting the request, they should reach for the object that is visible to both themselves and the speaker. In eye-tracking experiments, 5-to 6-year-olds (and to some extend 3-year-olds) interpret such ambiguous requests from the perspective of the speaker by quickly looking at and reaching for the mutually visible object (Nadig and Sedivy, 2002; Nilsen and Graham, 2009; see also Morisseau, Davies, and Matthews, 2013). In other circumstances, children’s (and adults’) ability to rapidly integrate information about the speaker’s perspective in reference interpretation is hindered. For instance, given a somewhat more complex visual array of objects, 4-to 12-year-old children and adults initially ignored the perspective of the speaker and looked at the object visible only to themselves (Epley et al., 2004; see also Wang, Ali, Frisson, and Apperly, 2016). Importantly, in this study, although adults eventually recovered from their initial egocentric bias, children largely did not (Epley et al., 2004). In fact, even in an earlier eye-tracking study where children successfully considered the speaker’s visual perspective, they were slower to integrate this perspective compared to adults (Nadig and Sedivy, 2002). Other studies demonstrate that children’s perspective-taking in reference resolution continues to develop well into preschool years and adolescence and
432 Myrto Grigoroglou and Anna Papafragou only gradually becomes adult-like (Dumontheil, Apperly, and Blakemore, 2010; Wang et al., 2016). Taken together, these findings suggest that one needs to draw a distinction between having an appreciation of others’ mental states and using this ability in communicative situations that require integration of multiple sources of information, including an interlocutor’s perspective, visual context, linguistic complexity, and so forth (Grigoroglou and Papafragou, 2017; Nilsen and Fecica, 2011; Lin, Keysar, and Epley, 2010). On this view, differences in children’s performance across referential communication tasks (and across developmental time) do not necessarily involve children’s ability to represent others’ perspective but rather the capacity to integrate perspective information with other types of information (e.g., the child’s own perspective, visual information from context, linguistic complexity of utterances etc.). In this light, seemingly small differences among tasks may produce significant discrepancies. For instance, the fact that children’s performance was overall better in Nadig and Sedivy (2002) compared to Epley et al. (2004) can be attributed to the fact that Epley et al. used a more complicated visual array and, within this array, the underspecified descriptions applied better to the referent that was visible only to the child, such that suppressing one’s own perspective was harder than in the Nadig and Sedivy study. Furthermore, in Nadig and Sedivy’s task, children were repeatedly reminded throughout the experiment that the speaker could not see what they did, thereby scaffolding the maintenance of visual perspective information in memory. Echoing earlier discussion in the context of word learning, we conclude that children appreciate the referential perspective of their communicative partner very early, but the implementation of this ability depends on demands of different tasks. Two additional pieces of evidence support this conclusion. First, in individual children the ability to align with the speaker’s visual perspective when resolving reference correlates with inhibition (i.e., the ability to suppress one stimulus in favor of another; Nilsen and Graham, 2009). Second, in reference resolution tasks where maintaining speaker- specific information does not involve contrasting one’s own visual perspective with the speaker’s, children are highly likely to consult the speaker’s profile. For instance, 3-and 4-year-olds are sensitive to a speaker’s action constraints (e.g., whether she has her hands empty or full) when they interpret referentially ambiguous requests (Collins, Graham, and Chambers, 2012). Relatedly, 4-and 5-year-olds (but not 3-year-olds) use the speaker’s emotional perspective (e.g., sad vs. happy tone of voice) to guide reference resolution (Berman, Chambers, and Graham, 2010; San Juan, Chambers, Berman, Humphry, and Graham, 2017).
20.3.2 Implicature comprehension Successful communicators routinely calculate components of speaker meaning that are conveyed without being explicitly stated. For example, when hearing the utterance “Some of the butterflies flew away”, one can easily infer that not all of the
Pragmatics TO Word Learning and Interpretation 433 butterflies flew away, although this inference was not part of the literal meaning of the utterance. Similarly, the utterance “John’s mother was very sick” invites the inference that John’s father was not. These examples are instantiations of quantity (or scalar) implicatures (Grice, 1975; cf. also Sperber and Wilson, 1986/1995), a type of conversational inference that arises when the speaker violates informativeness by using an informationally weaker expression (some vs. all; mother vs. mother and father) and expects the hearer to understand that the speaker is not in an epistemic position to offer the stronger term (see Horn, 1972, 1984; Hirshberg, 1985; Levinson, 1983, on different definitions of informational strength; see also Schwarz and Zehr, this volume). As in the case of reference, a key requirement for calculating quantity inferences is the ability to access and reason about relevant lexical alternatives (i.e., expressions that the speaker could have uttered to be more informative but did not). A second key requirement is to evaluate how much the speaker knows (more knowledgeable speakers are more likely to use a weaker term to implicate that the stronger term does not hold). In accordance with broad Gricean accounts, adult comprehenders take into account speaker-specific information, including speaker knowledge, alongside informativeness expectations when processing quantity implicatures (Bergen and Grodner, 2012; Breheny, Ferguson, and Katsos, 2013; Fairchild, Mathis, and Papafragou, 2020; Fairchild and Papafragou, 2018). A relatively small set of studies to date has asked whether children integrate speaker knowledge and informativeness in implicature computation (Hochstein, Bale, Fox, and Barner, 2014; Barner, Hochstein, Rubenson, and Bale, 2018; Papafragou, Friedberg, and Cohen, 2018; Kampa and Papafragou, 2020). In one such study (Papafragou et al., 2018), 4-and 5-year-old children watched two videos, where two almost identical agents (“twins”) performed the same action (e.g., color a star). In one video, an observer witnessed the whole event; in the other video, the observer fell asleep halfway through the action and only watched part of the event. At the end, children heard either a strong or a weak statement (e.g., “The girl colored all/some of the star”) and had to attribute it either to the fully knowledgeable or to the partially knowledgeable observer. Results showed that 5-year-olds were able to attribute informationally strong statements to knowledgeable observers and informationally weak statements to partially informed observers, but 4-year-olds could not reliably link the observer’s (i.e., speaker’s) epistemic state to the informational strength of different statements. The 4-year-olds’ performance improved only when the epistemic component was removed (cf. also Barner et al., 2018; Hochstein et al., 2014). A further study using a simple paradigm inspired by the literature on referential communication demonstrated that even 4-year-olds can integrate speaker knowledge in calculating implicatures (Kampa and Papafragou, 2020). In this study, 4-year-olds, 5-year-olds and adults saw pairs of pictures showing the same person sitting across a table behind a two-compartment box with identical objects (e.g., a spoon and a bowl). In one picture, the person in the picture could see the contents of both compartments in her box (e.g., both the spoon and the bowl) but, in the other, she could only see the content of one compartment (e.g., only the spoon). Participants heard either a
434 Myrto Grigoroglou and Anna Papafragou strong statement (e.g., “I can see a spoon and a bowl”) or a weak statement (e.g., “I can see a spoon”) and had to choose the box that the speaker was talking about. At age 4, children were highly successful in matching weak statements to the pictures where the person had limited access to the contents of the box (e.g., she could only see the spoon), and, by age 5, they were entirely adult-like. Extensions of this method showed that children could use similar reasoning when the communicator used a drawing of either a single object (a spoon) or two objects (a spoon and a bowl) instead of an utterance to identify the intended box (Kampa and Papafragou, subm.). Thus, children apply expectations of informativeness in combination with speaker knowledge to instances of both linguistic and non-linguistic communication, as predicted by pragmatic accounts (Grice, 1975; Sperber and Wilson, 1988/1995; cf. also Gweon, Pelton, Konopka, and Schulz, 2014). Despite these successes, as decades of earlier developmental research show, children face persistent limitations in deriving quantity inferences in more open-ended contexts. For instance, when presented with logically true but under-informative descriptions of events (e.g., “Some of the horses jumped over the fence,” when all of the horses had jumped over the fence), 5-year-olds are massively more likely (Foppolo, Guasti, and Chierchia, 2012; Guasti, Chierchia, Crain et al., 2005; Chierchia, Crain, Guasti, Gualmini, and Meroni, 2001; Pouscoulous, Noveck, Politzer, and Bastide, 2007; Noveck, 2001; Papafragou and Musolino, 2003; Barner, Brooks, and Bale, 2011; Huang and Snedeker, 2009) to accept such descriptions compared to adults (Papafragou and Musolino, 2003; cf. Noveck 2001; Foppolo, Guasti, and Chierchia, 2012; Guasti et al., 2005; Barner, Brooks, and Bale, 2011). Children’s difficulties persist even in online processing tasks that do not require explicit felicity judgments (Huang and Snedeker, 2009). Children’s performance improves in judgment tasks that provide training in pragmatic infelicity (Foppolo et al., 2012; Papafragou and Musolino, 2003), or offer more nuanced response options (Katsos and Bishop, 2011). Notice that, unlike the studies on quantity inference and speaker knowledge reviewed earlier (and the referential communication tasks that inspired them), most of the felicity judgment tasks testing comprehension of quantity implicature do not involve conversational exchanges with an interactive addressee who intends to convey a quantity implicature (often the speaker is a “silly puppet” who only unintentionally produces an infelicitous utterance). A further difference is that, unlike the studies reviewed in the beginning of this section (and most referential communication paradigms), children are asked to evaluate an utterance in the absence of specific information about what is relevant in the task. In sum, children’s failures in a long line of felicity judgment tasks in which they have to reject what (from an adult’s perspective) is an under-informative utterance are plausibly due to a failure to reason about the goals of the task (i.e., what is a relevant alternative to what the speaker has said), especially when there is no clear indication of what these goals are and no genuine interlocutor. In direct support of the role of relevance, 5-year-olds compute quantity implicatures when more informative lexical alternatives are highly accessible, but only if these alternatives are relevant to the probable goal of
Pragmatics TO Word Learning and Interpretation 435 the task (Skordos and Papafragou, 2016). Furthermore, in simple tasks where quantity inferences are used for a clear conversational goal (e.g., to identify a referent) and relevant lexical alternatives are highly salient, even children younger than 4 compute quantity implicatures (Stiller, Goodman, and Frank, 2015; Kampa, Richards, and Papafragou, 2019).
20.4 Conclusion In this chapter, we surveyed a rich set of experimental evidence demonstrating the role of social-pragmatic mechanisms in very young children’s lexical development. This evidence suggests that infants with limited linguistic knowledge use a variety of social-pragmatic mechanisms to identify possible meanings for newly encountered words. Similarly, children use pragmatic reasoning to constrain potential contextual interpretations of known words (e.g., when resolving referential indeterminacy, or drawing conversational implicatures). Thus, from the perspective of the young learner, pragmatic mechanisms have a role both in constraining hypotheses about what newly encountered words mean, and in contextually enriching the linguistic- semantic meaning of known words during a conversation. Perhaps most importantly, children’s pragmatic computations seem to rely on a pragmatic architecture that flexibly integrates information about properties of the speaker with a developed set of expectations about how rational communicators should talk (Grice, 1975; Sperber and Wilson, 1985/1996), thereby suggesting continuity with major components of the adult system. At the same time, this chapter has highlighted several limitations in children’s reliance on pragmatics to acquire and use their lexicon. For instance, even though preschool-aged children frequently consider the perspective of the speaker across several phenomena, the ability to do so consistently and to adult levels of sophistication undergoes significant development. As a result, learners can be inconsistent in their ability to consider the speaker’s knowledge (e.g., Epley et al., 2004; Hochstein et al., 2014; Papafragou et al., 2017, 2018). They fail to compute some pragmatic inferences that adults routinely derive (e.g., Noveck, 2001; Huang and Snedeker, 2009), and might show strikingly different responses in what appear to be close variants of the same task (e.g., Nadig and Sedivy, 2002; Epley et al., 2004). We have suggested that these limitations emanate from difficulties implementing fundamental pragmatic mechanisms across different contexts, and not from pragmatic insensitivity. This idea is further supported by the fact that general cognitive limitations in children correlate with pragmatic performance (Matthews, Biney, and Abbot- Smith, 2018), and that even adults’ pragmatic performance deteriorates under cognitive load (e.g., see Horton and Gerrig, 2005; Wardlow, Lane, and Ferreira, 2008). Learners overcome these limitations as they become capable of inferring common ground or referential intent from less overt cues across different phenomena such
436 Myrto Grigoroglou and Anna Papafragou as word learning (e.g., Papafragou et al., 2017), reference resolution (e.g., Nilsen and Graham, 2009), and scalar implicature (e.g., Stiller, Goodman, and Frank, 2015; Kampa and Papafragou, 2020). To be sure, the picture of pragmatic development presented here is highly selective and needs to be broadened in several respects. First, for simplicity’s sake, we have examined the processes of acquiring and contextually interpreting the words in one’s language as distinct processes, but naturally, both processes operate in tandem in language acquisition: in most word learning studies we described, children are figuring out the meaning of a novel noun as they solve a referential question (how to choose an object among many possible referents for a new name). The true task for learners is thus how to extract the semantics of a word from its pragmatically enriched interpretations as these interpretations shift more or less dramatically across contexts of use. Second, in the present chapter, we have focused exclusively on lexical comprehension, but it is clear that similar issues arise in lexical production. Indeed, recent work has suggested that pragmatic pressures such as informativeness shape children’s use of the lexicon in production, often in strikingly similar ways cross-linguistically (Bannard, Rosner, and Matthews, 2017; Grigoroglou, Johanson, and Papafragou, 2019). Third, we have sampled only a few of the many pragmatic phenomena actively being investigated in the literature, and mostly included examples from studies on typically developing English speakers (see Grigoroglou and Papafragou, 2017; Matthews, Biney, and Abbot-Smith, 2018, for additional examples). A fuller picture of pragmatic development needs to broaden the empirical basis under discussion and include learners from a wider variety of communities and backgrounds. As this chapter showed, one of the biggest challenges in studying the contribution of pragmatics to children’s acquisition and use of the lexicon is inferring the state of the learner from variable, often contradictory patterns of pragmatic performance. Moving forward, we suggest two main directions that could be pursued to address this challenge. First, the field needs more specific linking assumptions between formal models of semantic-pragmatic representations and the specific demands of individual psycholinguistic paradigms used to study pragmatic abilities across children of different ages and adults. Such an integrated approach can help elucidate the nature and growth of the pragmatic system, as well as bridge the mostly separate research traditions now studying early word learning and pragmatic interpretation in preschoolers and older children (e.g., for notable examples see de Marchena et al., 2011; Srinivasan et al., 2019, on mutual exclusivity; Kampa and Papafragou, 2020, on implicature). Furthermore, computational models paired with behavioral methods can tease apart the variables, assumptions and computations that comprise word learning and interpretation at the semantics/pragmatics interface (e.g., see Bohn et al., 2019; Frank and Goodman, 2014). Second, and relatedly, it is important for the field to move toward an investigation of how the same pragmatic principles apply across diverse linguistic phenomena. For instance, the principle of informativeness is an underlying assumption guiding inferences
Pragmatics TO Word Learning and Interpretation 437 across a variety of phenomena in both word learning and interpretation but has been examined mostly within limited, individual phenomena such as sensitivity to discourse novelty in word learning (e.g., Akhtar et al., 1996), referential communication (e.g., Nadig and Sedivy, 2002), or scalar implicature (e.g., Noveck, 2001). It would be interesting to see how this assumption (or other similarly broad pragmatic principles) is implemented across different linguistic phenomena and even in non-linguistic communication (see Kampa et al., 2019).
Acknowledgments Preparation of this chapter has been supported in part by NSF grant #1632849.
Chapter 21
Differenc e s i n vo cabul ary g row t h across grou p s a nd individua l s Christine E. Potter and Casey Lew-W illiams
21.1 Introduction Young children vary in how they learn words, and developmental scientists vary in how they emphasize individual differences in research on word learning. While the evaluation of variability is built into any empirical study—in the form of statistical tests and conventional reporting of standard deviations and outliers—only a subset of studies examines individual differences as a research question or as a way to interpret data. A common way to motivate developmental science is to focus on average differences between samples of children or experimental conditions, but an equally rich approach is to focus on differences within a sample of children and to explore relations among different skills. The examination of individual differences is an important device in two primary ways: for understanding the mechanisms of learning, and for making predictions about children’s outcomes. The purpose of this chapter is to examine both causes and consequences of individual differences in children’s early vocabulary growth and word learning. We suggest that by testing associations and dissociations between different types of abilities, experiences, and knowledge, it becomes possible to draw key inferences about word learning and child development. In turn, these findings can be used to develop more careful, inclusive, and complete theories of learning across a range of populations.
Differences in vocabulary growth across groups and individuals 439
21.1.1 Mechanisms of individual differences Examining relations between children’s abilities and outcomes often provides useful insights into developmental mechanisms. For example, a long-standing question in language research is whether or not a particular skill is specific to language. Vocabulary knowledge has been linked to diverse cognitive skills, including attention and memory (e.g., Salley, Panneton, and Colombo, 2013; Swanson, 1996) and even the development of motor skills (Iverson and Wozniak, 2007), suggesting influences of domain-general processes. At the same time, correlations between word knowledge and verbal measures are often more reliable than relations with non-verbal tasks (e.g., Newman, Ratner, Jusczyk, Jusczyk, and Dow, 2006; Rajan, Konishi, Ridge et al., 2019), suggesting there may also be some specialization. Correlational studies cannot provide proof of causal relations between different abilities, but examinations of associations between children’s vocabulary knowledge and other skills have challenged notions of modularity and yielded many observations that have helped build comprehensive theories of developmental change (Bates, Dale, and Thal, 1995). The study of individual differences has also been used to inform discussions about children’s use of different mechanisms for learning new words over the course of development. Children are often able to demonstrate comprehension of words before they produce them, and studies that track children’s learning over time have highlighted not only the fact that children do not add words to their lexicons at a consistent rate but also that the trajectory of children’s learning varies widely (e.g., Fenson, Dale, Reznick et al., 1994). For instance, some children demonstrate sharp growth in their vocabulary after they produce approximately 50 words, a phenomenon termed the “vocabulary spurt” (Nelson, 1973). However, some children’s growth rate is steadier, and it has been argued that what appears to be a sudden shift in behavior may actually reflect continuous change (Ganger and Bent, 2004; McMurray, 2007). This incrementality is not always apparent from studies that focus on a group’s mean performance, which can make it appear as though a group of children fully possesses or entirely lacks a particular strategy (e.g., McMurray, Samelson, Lee, and Tomblin, 2012; Perry and Kucker, 2019). In other words, while it may appear that older children make use of information that younger children do not, it is likely that differences are a matter of degree (e.g., McMurray, 2007). This viewpoint is also consistent with a growing belief found across a variety of fields that language abilities are best understood as existing on a continuum (e.g., Rescorla, 2009; Tomblin, Zhang, Weiss, Catts, and Weismer, 2004) and underscores the continuous nature of development (Elman, Bates, Johnson et al., 1996; McMurray et al., 2012; Samuelson and McMurray, 2017).
21.1.2 Predicting long-term outcomes In addition to offering useful theoretical tools, it is valuable to study individual differences in children’s early vocabulary because these differences tend to persist and
440 Christine E. Potter and Casey Lew-Williams are often reliable predictors of later language ability and even success in school. Even before they begin producing a single word, infants’ demonstrations of preliminary word knowledge predict the size of their vocabulary in subsequent years. For example, infants who show better recognition of familiar word forms in behavioral and neuroimaging tasks have larger vocabularies at two years (Junge, Kooijman, Hagoort, and Cutler, 2012; Newman et al., 2006; Singh, Reznick, and Xuehua, 2012). Likewise, infants who are reported to know more words at one year of age tend to score better on standardized measures of vocabulary knowledge at three years (Lyytinen, Laakso, Paikkeus, and Rita, 2001; Rose, Feldman, and Jankowski, 2009). These relations demonstrate the continuity of infants’ learning and knowledge and highlight the value of tracking individual differences from an early age in order to understand later differences. To understand the emergence of individual differences, longitudinal studies have examined the growth of children’s vocabulary knowledge over time and have revealed that the relative size of children’s vocabulary typically remains somewhat stable (e.g., Fenson et al., 1994). While there is considerable variability in individual children’s trajectories, those children who are classified as “early talkers” at young ages are likely to have advanced vocabularies up to two years later, while those who are identified as “late talkers” disproportionately continue to know fewer words than their peers (Fernald and Marchman, 2012; Thal, Bates, Goodman, and Jahn-Samilo, 1997). There is also burgeoning evidence that these patterns continue into childhood. Recent studies have found that the number of words that children produce at 16 to 24 months can predict the size of their vocabularies up to eight years later (Duff, Reen, Plunkett, and Nation, 2015; Rajan et al., 2019). Similarly, the rate at which children’s vocabulary grows between the ages of one and four years is related to their later vocabulary (Rowe, Raudenbush, and Goldin-Meadow, 2012). These differences are particularly noteworthy because children’s vocabulary growth has been linked to their school readiness and even the structure of their brain (Asaridou, Demir-Lira, Goldin-Meadow, and Small, 2017; Rowe et al., 2012). In particular, lexical knowledge is likely to be foundational for children’s early reading skills (see Seidenberg, 2017 for discussion of how reading builds on children’s prior knowledge of words and the relations among them). Measures of vocabulary in infancy have been shown to be related to reading ability in school-aged children (Duff et al., 2015), and the size of children’s vocabulary when they begin school continues to predict their reading skills throughout the elementary school years (Quinn, Wagner, Petscher, and Lopez, 2015). Correlations between young children’s vocabulary and their later reading skills have been reported across different cultures, languages, and school systems, including the US, China, and the Netherlands (e.g., Quinn et al., 2015; Song, Su, Kang et al., 2015; Verhoeven, van Leeuwe, and Vermeer, 2011). These long- term associations underscore the importance of exploring the origins of differences in children’s knowledge and suggest that the roots of children’s academic success may begin as early as infancy. While there is converging evidence to support the view that later differences in children’s knowledge are related to early variability, it is also important to acknowledge
Differences in vocabulary growth across groups and individuals 441 that considerable amounts of variance have yet to be explained. For example, although Thal and colleagues (1997) documented reliable group-level differences in the rate at which children acquired words, they failed to be able to predict trajectories for individual children. Similarly, a number of studies have found that measures of infants’ vocabulary cannot reliably be used to determine which children will have diagnosable language delays (e.g., Fernald and Marchman, 2012; Justice, Bowles, Turnbull, and Skibbe, 2009). Infants’ early knowledge is a far from a perfect indicator of how quickly and successfully children will add new words to their vocabularies, and more research is still needed in order to find ways to better identify those children at risk for poor outcomes. However, the fact that there are cross-age correlations suggests that a better understanding of how differences in children’s vocabularies first emerge can ultimately inform both theories and interventions aimed at improving language and academic outcomes.
21.1.3 Overview In the current chapter, we do not provide a historical account of how differences have been construed (see Fernald and Weisleder, 2011 for such a review), and we do not fully describe all of the impressive efforts to systematically document the size and composition of children’s vocabulary across ages and social groups. Instead, we focus our attention on contemporary perspectives on individual differences and then attempt to unite ideas from studies of typical language development, language disorders, and bilingualism. In the first section, we discuss the origins and early emergence of individual differences in children’s vocabulary knowledge. Current research emphasizes the importance of language input in children’s learning and has highlighted striking variability in the language experience of different children, which is particularly apparent when comparing children from different backgrounds. We describe specific features of the input that have been proposed to support children’s learning and suggest areas for future research. In the second section, we note that children play a powerful— and at times overlooked—role in determining their own learning. Children build their knowledge gradually over time, and while the evidence that children’s vocabulary knowledge is directly related to the input they receive is compelling, differences in input alone cannot fully explain why some children learn more quickly than others. Thus, rather than conceiving of children as passive recipients of language input, science points toward viewing children as active agents whose knowledge, past experiences, and abilities shape the input they receive and the information they extract from it. In the third section, we focus on differences across different populations of children. We begin by reviewing how vocabulary develops for children learning different languages. We then consider vocabulary development for children with developmental delays and disorders. Finally, we examine the vocabulary knowledge of children
442 Christine E. Potter and Casey Lew-Williams learning two languages and use bilingual environments as a case study to understand how experience can influence children’s abilities to learn new words.
21.2 Origins of individual differences Individual differences in children’s vocabulary knowledge have been proposed to arise from a variety of sources, including biological and genetic influences (e.g., Oliver and Plomin, 2007), basic cognitive and perceptual abilities (e.g., Tsao, Liu, and Kuhl, 2004), and environmental factors including childhood nutrition, stress, physical environment, and family well-being (e.g., Shonkoff, Garner, Siegel et al., 2012). In recent years, researchers have tended to emphasize that much of the variability in children’s knowledge may be related to differences in their language environments, and that children who receive richer input tend to have larger vocabularies (e.g., Hart and Risley, 1995; Hoff, 2003). Recognizing these associations has been important in helping to reconceptualize well-documented differences between the abilities of children from different family backgrounds (Huttenlocher, Waterfall, Vasilyeva, Vevea, and Hedges, 2010; Rowe, 2008) and has yielded crucial insights into learning processes. Other research has revealed that even at very young ages, some children learn words more easily than others, suggesting that differences may emerge quite early (e.g., Samuelson and Smith, 1999). For example, as young as six months of age, some infants show evidence of recognizing some words, while others do not, and those infants who show more recognition of familiar words at young ages know more words at two years (Bergelson and Swingley, 2012; Singh et al., 2012). Thus, early individual differences likely become exacerbated over time, as children’s vocabulary continually builds on their past experience and knowledge.
21.2.1 Associations between caregivers’ language input and children’s vocabulary At a superficial level, it is obvious that children can only learn the words that they hear, so their vocabularies must be in some way related to their language and communicative experience. It has become increasingly clear that children receive different amounts and types of experience with language and that these differences, to paraphrase the seminal work by Hart and Risley (1995), are meaningfully related to their language knowledge and development. Furthermore, when children encounter words in different contexts, they have more opportunities to determine what the words mean (Trueswell, Medina, Hafri, and Gleitman, 2013; Smith and Yu, 2008), as well as to enrich their understanding of those words (Sloutsky, Yim, Yao, and Dennis, 2017; Wojcik and Saffran, 2015), and to use the words they know to continue learning
Differences in vocabulary growth across groups and individuals 443 additional words (Samuelson, 2002; Smith, Jones, Landau, Gershkoff-Stowe, and Samuelson, 2002). Thus, it may be unsurprising that children whose parents talk to them more during the first years of life are likely to have larger vocabularies across ages (Hart and Risley, 1995; Hoff, 2003; Huttenlocher, Haight, Bryk, Seltzer, and Lyons, 1991; Newman, Rowe, and Ratner, 2016; Rowe, 2008; Weisleder and Fernald, 2013). In fact, one way to understand these effects is to view children’s input as the “corpus” of data that they can use to learn words; when children have access to larger, richer datasets, learning is likely to be more successful. Critically, and unfortunately, links between language input and language knowledge have been consistently demonstrated by comparing children from backgrounds of high- and low-socioeconomic status (SES). Typically, children from higher-SES families hear more speech, and the speech that they hear contains more unique words, offering them a rich corpus of data from which to learn (Hart and Risley, 1995; Huttenlocher et al., 2010; Rowe, 2008). Because children in lower-SES homes tend to be exposed to less language, differences in children’s early language environments have been proposed to explain SES-related discrepancies in early vocabulary growth (Hoff, 2003; Huttenlocher et al., 2010; Rowe, 2008; Weisleder and Marchman, 2018). Importantly, although children from low-SES families hear less language on average than their high-SES peers do, there is also considerable variability in language use by families of approximately the same SES. The key message from this research is that across the SES spectrum, children’s language development relates to their parents’ speech (Rowe, 2008; Schwab, Rowe, Cabrera, and Lew-Williams, 2018; Weisleder and Fernald, 2013). Therefore, both researchers and policy makers have begun to emphasize the importance of encouraging parents, especially in lower-SES communities, to engage in rich dialogue with their children in the hopes of reducing disparities (e.g., Ridge, Weisberg, Ilgaz, Hirsh-Pasek, and Golinkoff, 2015; Suskind, Leffel, Graf et al., 2016).
21.2.2 Important features of language input While the sheer number of words that children hear may be useful as a global measure of their language experience, newer research has sought to tease apart more nuanced relations between properties of the input and children’s learning. For example, Newman and colleagues (2016) reported that infants whose parents often repeat words tend to have larger vocabularies. This relation makes sense; children may find it easier to learn the meaning of a word after repeated exposure (Schwab and Lew-Williams, 2016). However, other research suggests that as children get older, their vocabulary growth is negatively correlated with parents’ use of repetition (Schwab et al., 2018), and instead, the diversity of caregivers’ speech predicts both receptive and productive vocabulary (Hoff and Naigles, 2002; Huttenlocher et al., 2010). While these findings may at first appear incongruous, they can be reconciled by the proposal that children benefit from exposure to input that is tailored to their present knowledge (Rowe, 2008; Rowe and Leech, 2018; Schwab et al., 2018). In other
444 Christine E. Potter and Casey Lew-Williams words, it is not uniformly the case that simplicity or variability supports children’s learning. When children know very few words, they may accumulate information gradually and need more experience with each individual item. However, as their vocabularies grow, they can draw on prior knowledge and experience, allowing them to benefit from more complexity in the input. Therefore, caregivers who provide input that reflects their children’s current abilities are most likely to be able to continue to support their children’s subsequent learning. Moreover, it is not only language input that can be tailored for a child. As in speech, parents adjust their gestures and actions when interacting with young children and provide input that is simpler, more enthusiastic, and more repetitive, offering infants potentially useful cues for learning (Brand, Baldwin, and Ashburn, 2002; Fernald and Simon, 1984), and non-verbal information has been shown to help children learn new words. Differences in parents’ early use of gesture relate to children’s later vocabulary (Rowe and Goldin-Meadow, 2009), and similarly, parents who are more engaged with their children may be able to direct their children’s attention more effectively, which in turn could help children to learn the names of objects more easily (Yu, Suanda, and Smith, 2018). In fact, it has been shown that when parents provide clear cues to a word’s referent in their spontaneous interactions with their infants, the same children have larger vocabularies even years later (Cartmill, Armstrong, Gleitman et al., 2013). In addition, children whose language experience is more interactive (as measured by conversational turn-taking) not only show enhancements in vocabulary knowledge but also show different patterns of neural activity, even after controlling for SES and the overall quantity of input that they receive (Romeo, Leonard, Robinson et al., 2018). Combined, these studies suggest that children’s learning can be systematically influenced by interactions with their parents and support the view that early experience has long-lasting consequences for the words that children know. As a result of recent research that has quantified and delved into different aspects of children’s experience, individual differences in children’s knowledge are now widely interpreted to reflect children’s experience, rather than other factors, such as genetics. This perspective has led to a significant focus on input as a target for intervention in low-SES communities. However, the exact nature of the association between input and outcomes is still poorly understood. Existing research cannot satisfactorily explain why some parents talk to their children more than others, and there has been little research examining the input that children receive from individuals other than their parents, such as teachers, siblings, or peers. Likewise, it is likely that different types of input may support different aspects of learning, but direct associations between children’s experience and their learning of different types of words, for example, have yet to be explored. To fully understand how children’s experience can (or perhaps cannot) determine the trajectory of their vocabulary growth, future research will need to test specific relations between different aspects of children’s language environment and the words that they know.
Differences in vocabulary growth across groups and individuals 445
21.3 How children construct their knowledge of words As learners gain experience with language, they become sensitive to different cues in the environment, and even when presented with identical input, individual learners do not necessarily extract the same information (Potter, Wang, and Saffran, 2017). Moreover, children’s knowledge, interests, and attention all affect the input they receive, and individual differences in abilities such as memory, executive function, and temperament have been shown to relate to language knowledge (e.g., Slomkowski, Nelson, Dunn, and Plomin, 1992; Swanson, 1996; Zelazo, Anderson, Richler et al., 2013), suggesting that children’s cognitive skills may also affect the rate and manner in which their vocabularies grow. Therefore, it is important to consider that children play an active and integral role in constructing their knowledge of words.
21.3.1 Cumulative effects of learning over time Children’s early experience with language sets the stage for future learning because early differences are likely to become exaggerated over time. Children who hear more and more rich language early in life not only have more opportunities to learn the meanings of words but also—critically—to develop learning strategies that make use of their prior knowledge. Children’s growing knowledge of their native language changes the way they interpret the input around them, and they use words that they already know to inform their learning of new words, as well as to process information more efficiently. Thus, early experience not only provides a corpus of data from which to learn words but also influences the way that children approach new input, which in turn shapes their subsequent learning. The cumulative influence of children’s learning is clearly illustrated in the domain of speech perception; as infants’ vocabulary size increases, the sound structure of their native language exerts greater influence over their learning (Graf Estes, Edwards, and Saffran, 2011). Infants who know more words have a better understanding of the common sound patterns in their native language, and they pay more attention to distinctions that are important in their language and less attention to those that are not meaningful (e.g., Kuhl, Stevens, Hayashi et al., 2006, see Creel, this volume). For instance, infants who know more words are better able to understand speakers with different types of accents, suggesting that they can use their knowledge to overcome the challenge of unfamiliar pronunciations (Mulak, Best, Tyler, Kitamura, and Irwin, 2013). Consequently, as children’s vocabulary size increases, they are more likely to infer that words that differ in only a single sound refer to distinct objects, supporting their ability to learn challenging new words (Law and Edwards, 2015). When children know more words, their representations of the sounds and words in their language become more
446 Christine E. Potter and Casey Lew-Williams refined, allowing them to exploit that knowledge to make sense of new information (see Swingley, this volume). Differences in the number of words that children already know are also related to their ability to accumulate additional word knowledge. Children with larger vocabularies and more robust word knowledge show more successful learning of new words across a variety of paradigms (e.g., Bion, Borovsky, and Fernald, 2013; Ferguson, Graf, and Waxman, 2018; Jones, 2003; Lany, 2018). Why might this be? One possibility is that children who know more words process language more efficiently, thereby supporting their ability to encode and remember new words (Fernald and Marchman, 2012; Lany, 2018; Rajan et al., 2019). In one study by Fernald and colleagues (2008), children who responded more quickly to words in the middle of a sentence such as There’s a blue cup on the deebo showed better learning of the novel words that occurred later in the sentence. Other evidence also suggests that children who recognize words more easily at young ages show more rapid growth in their vocabularies (Fernald and Marchman, 2012; Lany, Giglio, and Oswald, 2018; Weisleder and Fernald, 2013), demonstrating how efficient processing of familiar words can support subsequent learning. A second possibility is that children with larger vocabularies can leverage their greater knowledge to determine the meanings of new words. Several studies have shown that children can use their understanding of familiar verbs to infer the meaning of new nouns. For instance, children can deduce that an unknown word must refer to an edible referent when they hear sentences such as Do you want to eat the artichoke?, and children who know more verbs are better able to learn new words in these types of contexts (Ferguson, Graf, and Waxman, 2014; Ferguson et al., 2018; Goodman, McDonough, and Brown, 1998). Likewise, children can use other parts of speech, such as adverbs, to inform their learning of verbs under otherwise-challenging circumstances (Syrett, Arunachalam, and Waxman, 2014). Children may also make use of their existing knowledge via mutual exclusivity, the tendency to map a new label onto a novel object when viewing one novel object and one or more familiar objects (Markman and Wachtel, 1988). Over time, this approach can only be realistic and useful if children already know the names of a sufficient number of objects; indeed, children who know more words show more evidence of using mutual exclusivity to learn new labels (Bion et al., 2013; Law and Edwards, 2015). When viewing multiple objects at once, with no clear referential cues, children may also be able to leverage their word knowledge to gradually rule out incorrect mappings (Smith and Yu, 2008; Stevens, Gleitman, Trueswell, and Yang, 2017). For instance, a child might see several toys on the floor, and she will more easily be able to figure out which one is the xylophone and which one is the slinky if she already knows the word for puzzle. Because children’s ability to learn new words often relies on their existing word knowledge, small differences in early experience and vocabulary size have the potential to compound over time. Just as both the quantity and quality of children’s language experience shape language outcomes, the size and composition of children’s vocabulary can influence later learning. Both behavioral studies and computational models have shown that children are more likely to learn words that are related to words that they already know
Differences in vocabulary growth across groups and individuals 447 (Borovsky, Evans, Ellis, and Elman, 2016; Hills, Maouene, Maouene, Sheya, and Smith, 2009). As many parents can attest, a child who can identify a velociraptor and a triceratops may quickly learn to recognize a brachiosaurus, while another child who knows the words bulldozer and backhoe may have an easier time learning excavator. Moreover, the structure of children’s vocabulary also influences more general strategies in word learning. When children know words from categories that are organized by shape, they are more likely to generalize a novel label to another object of the same shape instead of an object made of the same material (e.g., Samuelson and Smith, 1999). This is a useful insight, as shape is often an indicator of category membership. Importantly, it has been shown that children’s vocabulary develops more quickly once they acquire this bias to make generalizations based on shape (Samuelson, 2002; Smith et al., 2002), and children with vocabulary delays show a reduced bias (Jones, 2003; Perry and Kucker, 2019). In addition, children’s knowledge of grammatical patterns grows along with their learning of individual words (De Carvalho, He, Lidz, and Christophe, 2019; Devescovi, Caselli, Marchione et al., 2005; Marchman and Bates, 1994). As children learn more about the structure of language, they are increasingly able to take advantage of relations among words (e.g., Gertner, Fisher, and Eisengart, 2006; Gleitman, 1990; Naigles, 1990; Wojcik and Saffran, 2015; Yuan, Fisher, and Snedecker, 2012, see Lidz, this volume). This research highlights that children do not learn words in isolation, and that vocabulary development is a process of building networks of interconnected knowledge that depend on the specific words and experiences of the individual child (Borovsky and Peters, 2018; Hills et al., 2009; Wojcik, 2017).
21.3.2 Children are not passive recipients of language input To fully understand how children’s past experience and learning shape their later vocabularies, it is important to consider the fact that children themselves play a role in determining the input that they receive. Caregiver speech, infant language skills, and later vocabulary have all been shown to be interrelated (Bornstein, Tamis-LeMonda, and Haynes, 1999; Newman et al., 2016), and transactional models of development suggest that is impossible to isolate the influences of the child’s individual abilities from the social context, as different children elicit different input from the speakers around them (Sameroff and Chandler, 1975; Slomkowski et al., 1992). Parents have been shown to adjust their speech to make it easier for children to comprehend (Arunachalam, 2016), and they are likely to direct simpler speech to children who do not yet demonstrate understanding of many words and to direct more complex speech to children who can understand and respond appropriately (Huttenlocher et al., 2010; Schwab et al., 2018). Parents may also be more inclined to label and talk about objects that consistently attract their children’s attention (Suanda, Geisler, Smith, and Yu, 2014), which could further support children’s processing of familiar items and their
448 Christine E. Potter and Casey Lew-Williams learning of new, related words (Borovsky et al., 2016). To illustrate, parents might be more likely to both read animal books and go to the zoo if they have an animal-loving child, which in turn would support children’s learning of less common animal words like sloth, mammal, and herbivore. Therefore, the input that children receive on a daily basis can be viewed as both an origin and outcome of individual differences in their current language knowledge.
21.3.3 Influences of cognitive abilities Language is, of course, just one aspect of cognitive functions and processes— intertwined with many others—and studying language alone would be limited in scope. Improvements in infants’ understanding of language are related to their growing knowledge of the world around them, as language and cognitive abilities can be mutually reinforcing (see Perszyk and Waxman, 2018 for review). Being able to track regularities in the environment, attend to the most relevant cues, remember salient information, and integrate learning over time are all likely to facilitate the acquisition of new words (e.g., Konishi, Stahl, Golinkoff, and Hirsch-Pasek, 2016; Saffran, 2003; Smith and Yu, 2008; Swanson, 1996). Across many studies, children who perform better on tasks that assess abilities such as working memory, cognitive control, and attentional flexibility tend to have larger receptive and productive vocabularies (e.g., Reuter, Emberson, Romberg, and Lew-Williams, 2018; Swanson, 1996; Zelazo et al., 2013). In addition, measures of early infant cognition, such as visual attention and memory, can predict individual differences in vocabulary not only in infancy, but also later in life (Rose et al., 2009; Salley et al., 2013; Yu et al., 2018). These associations underscore the importance of recognizing that individual children process information differently and suggest that differences in domain-general cognitive skills may also help explain differences in children’s vocabulary, further demonstrating that children build, and do not simply absorb, their knowledge.
21.4 Vocabulary development across different populations 21.4.1 Cross-linguistic research Research on children’s early language learning has focused excessively on children who are learning English, and patterns of vocabulary development may not be identical across languages. Languages, of course, vary in many structural dimensions. But of equal importance is that children growing up in different societies are exposed to varying amounts and types of speech, and different cultures vary in the importance they
Differences in vocabulary growth across groups and individuals 449 place on language (Casillas, Brown, and Levinson, 2020; Cristià, Dupoux, Gurven, and Stieglitz, 2019; Weber, Fernald, and Diop, 2017). Differences in how parents choose to engage with their children can reflect personal or community attitudes (Johnston and Wong, 2002; Schieffelin and Ochs, 1986; Simmons and Johnston, 2007; see Kuchirko and Tamis-LaMonda, 2019 for recent review of similarities and differences in how parents interact with children in different communities), and the quantity, acoustic properties, and contents of parents’ speech differ across languages and cultures (Au, Kit-Fong, Dapretto, and Song, 1994; Fernald and Morikawa, 1993; Schneidman and Goldin-Meadow, 2012; Tardif, Shatz, and Naigles, 1997). Nevertheless, studies that include samples of children learning different languages typically emphasize similarities in the size and growth of children’s early vocabulary (Italian: Caselli, Casadio, and Bates, 1999; Korean: Choi and Gopnik, 1995; Spanish: Jackson-Maldonado, Thal, Marchman, Bates, and Gutierrez- Clellen, 1993; Hebrew: Maital, Dromi, Sagi, and Bornstein, 2000; Icelandic: Thordardottir and Ellis Weismer, 1996). Likewise, relations between children’s early and later vocabulary have been reported in a number of languages, including in communities that place less emphasis on speaking directly to young children (Guiberson, Rodríguez, and Dale, 2011; Lyytinen, Laakso, Poikkeus, and Rita, 1999; Scheidman and Goldin-Meadow, 2012), suggesting commonalities in how children learn words across languages. However, there are reports that languages with particularly complex structure may be learned more slowly (Danish: Bleses, Vach, Slott et al., 2008; Greek: Papaeliou and Rescorla, 2011; Polish: Rescorla, Constants, Białecka-Pikul, Stępień-Nycz, and Ochał, 2017), indicating that there may be some cross-language variability. For example, it has been proposed that the sound system of Danish makes it especially hard for children to identify word boundaries in speech, which in turn impedes their ability to learn new words (Trecca, Bleses, Madsen, and Christiansen, 2018). Similarly, the grammar of Polish requires words to be marked with a variety of prefixes, suffixes, and interfixes, and these added challenges seem to lead Polish children to learn words more slowly (Rescorla et al., 2017; Smoczyńska, Krajewski, Łuniewska et al., 2015). Thus, the specific characteristics of different languages appear to influence the rate at which children acquire their vocabularies, highlighting the importance of studying different populations before drawing conclusions about universal patterns of development. Furthermore, counts of the total size of children’s vocabulary may obscure subtle differences in knowledge about types of words. In particular, there has been debate about whether the dominance of nouns in children’s early vocabulary is universal or reflects the unique statistics of English (Bornstein, Cote, Maital et al., 2004; Caselli, Bates, Casadio et al., 1995; Choi and Gopnik, 1995; Gentner and Boroditsky, 2001). Languages such as Korean and Mandarin contain more balanced use of nouns and verbs than English, and children learning these languages have been shown to learn verbs more quickly and nouns more slowly than children learning noun-heavy English (Choi and Gopnik, 1995; Tardif, 1996), revealing language-specific influences on word learning. Similarly, it has been reported that even when early vocabularies may be skewed toward nouns across languages, there are still different patterns in how quickly
450 Christine E. Potter and Casey Lew-Williams children learn other word types, including verbs and adjectives (Bornstein et al., 2004; Tardif et al., 1997). Thus, it seems likely that differences in the structure and frequency of words in children’s native language affect the words they learn most easily, in addition to other linguistic and cultural factors.
21.4.2 Vocabulary development for children with developmental delays and disorders Thus far, we have focused on the variability in children’s vocabulary that exists among children who are typically developing, but variability is perhaps more pronounced for children whose development is not typical. Vocabulary delays are commonly found in children with many different types of developmental delays, and in some cases, these delays may be diagnostic. Importantly, just because children with different types of developmental disorders display comparable deficits in their vocabulary growth does not mean that their difficulties stem from the same processes, and closer examination of the relative strengths and weaknesses of their knowledge can yield insight into underlying mechanisms, as well as potential targets for interventions.
21.4.2.1 Late talkers Language delays are often co-morbid with broader developmental disorders, but some children without another diagnosis still fall behind their peers in learning language. Because there is so much variability in children’s early vocabulary, it is difficult to accurately distinguish those children who will later have language delays and academic difficulties from those who simply wait longer to begin speaking. Current approaches to studying language disorders emphasize that children’s abilities exist along a continuum, and language delays are best understood as the lower portion of a normal distribution (McMurray, Samelson, Lee, and Tomblin, 2010; Rescorla, 2009; Tomblin et al., 2004). Consistent with this perspective, there is considerable heterogeneity among late talkers (Fernald and Marchman, 2012; Perry and Kucker, 2019; Rescorla, 2009). While some late-talking children are later diagnosed with language and learning difficulties, most eventually have language skills that fall within the normal range (Dale, Price, Bishop, and Plomin, 2003; Justice et al., 2009). Therefore, early delays in children’s expressive vocabulary are not necessarily indicative of long-term problems with language, and researchers continue to try to find ways to separate young children who are merely slow to begin talking from those who would benefit from early interventions and services (Fernald and Marchman, 2012; Justice et al., 2009). While most late talkers do not ultimately meet clinical criteria for language delays, it is important to note that they may still show some deficits in some areas of language later in life (Rescorla, 2009). Longitudinal work has shown that children who were classified as late talkers at two years but scored in the typical range as adolescents still had language skills that were lower than an SES-matched comparison sample, suggesting that
Differences in vocabulary growth across groups and individuals 451 slow growth in early vocabulary may have long-term consequences. Children who know fewer words also seem to have impoverished semantic representations of the words that they know (McGregor, Oleson, Bahnsen, and Duff, 2013), supporting the idea that children need a strong foundation of early vocabulary to support the incremental growth of a rich network of word knowledge. In addition, recent studies have found differences in the structure of late talkers’ vocabularies, as well as in their learning of new words. Compared to typically developing children who know a similar total number of words, late talkers’ vocabulary is not as well-matched to the typical patterns of children’s input (Beckage, Smith, and Hills, 2011). Late talkers’ early vocabularies contain proportionally fewer nouns and verbs, and the words that they know are less related to one another in meaning, revealing that their semantic networks may not have the same organization. Possibly due to this atypical organization, late talkers show less consistent use of word learning strategies that are built on their prior knowledge, such as making generalizations based on shape (Perry and Kucker, 2019). As reviewed earlier, the shape bias emerges as children learn more words from shape-based categories, which in turn boosts subsequent learning, so decreased use of this strategy may both reflect and contribute to existing differences. It may also be that children who talk less at young ages elicit different input than their typically developing peers (Beckage et al., 2011), offering another illustration of the cascading effects of early experience and abilities. While early measures of children’s vocabulary size cannot fully explain which children will go on to have language delays, the development of late talkers’ vocabulary still provides robust support for the view that early differences affect the long-term trajectory of children’s language knowledge.
21.4.2.2 Children with developmental disorders Difficulties with many aspects of language, including vocabulary, are common for both children and adults with autism spectrum disorders (ASD) and are often among the earliest symptoms observed (Luyster, Kadlec, Carter, and Tager-Flusberg, 2008; Mitchell, Brian, Zwaigenbaum et al., 2006). Language outcomes are highly variable for children with ASD, and as in other populations, the size of children’s vocabulary at young ages is related to their language skills in later childhood (Kjelgaard and Tager- Flusberg, 2001). In addition, vocabulary growth for children with ASD is negatively correlated with the severity of other autism symptoms, including difficulty with social interaction (Charman, Baron-Cohen, Swettenham et al., 2003; Ellis Weismer and Kover, 2015), highlighting the variety of influences on language development for children with developmental delays. Interestingly, compared to children with other types of developmental delays, toddlers and children with ASD show greater impairments in receptive vocabulary than in their expressive abilities (Ellis Weismer, Lord, and Esler, 2010; Kover, McDuffie, Hagerman, and Abbeduto, 2013). Like children with ASD, children with Down syndrome (DS) seem to have particular trouble with language, even compared to children with other types of developmental delays (Roberts, Martin, Moskowitz et al., 2007), and their early knowledge of words is related to later abilities (Chapman, Hesketh, and Kistler,
452 Christine E. Potter and Casey Lew-Williams 2002). It has long been suggested that parents may use simpler speech when interacting with children with DS; for example, they may give relatively more instructions and ask fewer questions (Buium, Rynders, and Turnure, 1974; Cardoso- Martins and Mervis, 1985), and as with typically developing children, the language skills of children with DS are also related to the input their parents provide (Crawley and Spiker, 1983; Sterling and Warren, 2014). Children with DS also display slower vocabulary growth than typically developing children, even when they are matched for non-verbal mental age (Hick, Botting, and Conti-R amsden, 2005). However, in contrast to children with ASD, children with DS have greater difficulty with expressive language, compared to comprehension (Laws and Bishop, 2003). The differences between these two populations suggest that there are multiple causes of children’s difficulties that are not yet well understood. Even in cases where children’s language abilities are considered to be relatively spared, children with developmental delays often differ from typically developing children (Karmiloff-Smith, Brown, Grice, and Paterson, 2003). Individuals with Williams syndrome (WS, a genetic disorder with distinct cognitive, physiological, and neuroanatomical traits) have language skills that are stronger than their non-verbal cognitive abilities might suggest (Bellugi, Bihrle, Jernigan, Trauner, and Doherty, 1990; Brock, Jarrold, Farran, Laws, and Riby, 2007). While children with WS tend to have larger expressive vocabularies than children with DS, they also show delays in infancy, and their vocabularies still do not equal those of typically developing children in either size or composition (Mervis and John, 2008; Mervis and Robinson, 2000). For example, children with WS have better knowledge of concrete words and less understanding of relational associations (Mervis and John, 2008). These differences in their vocabularies may be related to other skills, such as the ability to understand social cues and to appropriately engage in shared attention (John, Rowe, and Mervis, 2009; John, Dobson, Thomas, and Mervis, 2012), again suggesting that the process by which children with WS learn may differ from both typically-developing children and children with other types of delays (Nazzi and Karmiloff-Smith, 2002). The experiences and abilities of children with developmental delays vary widely, and deficits in language occur in conjunction with other conditions, making it hard to identify the origin of the difficulty. Nevertheless, comparisons of the skills of children with different types of delays do suggest some important underlying commonalities. Across populations, vocabulary knowledge seems to be related to non-verbal skills, and for children with delays, social abilities and attention may be especially important. In addition, as with typically developing children, early vocabulary knowledge appears to arise in part from input quality/quantity and predicts their later vocabulary outcomes. These relations leave open a key question: can the same interventions span multiple populations or should interventions be heavily tailored to a specific population? It is safe to conclude that early experiences are important for promoting the best possible outcomes, yet there is no one-size-fits-all description of vocabulary-related difficulties within or across populations.
Differences in vocabulary growth across groups and individuals 453
21.4.3 Bilingual populations Despite the global prevalence of multilingual environments, research has overwhelmingly focused on monolingual children’s learning. In the past decade, researchers have begun to rectify this lack of information, and there has been a dramatic increase in interest in children’s learning of two language simultaneously (Fennell and Lew- Williams, 2018). This research is important because it is not clear that standards based on monolingual children’s knowledge are appropriate for evaluating bilingual children, and as a result, bilingual children may be both over-and under-diagnosed as having language delays (Bedore and Peña, 2008). In addition, the study of individual differences in bilingual children’s knowledge affords the unique opportunity to separate language-specific experience from domain-general processes because the same child has different experience with each of two languages. For example, a Spanish-English bilingual toddler who loves Oreos might respond more quickly when offered a galleta than a cookie because of differences in familiarity with the words. Scrutinizing bilingual children’s knowledge in each of their languages allows us to test how exposure to a particular language affects the words that children know, as well as to observe the impact of multilingual experience on children’s learning strategies.
21.4.3.1 Learning two languages By definition, bilingual children must learn more words than their monolingual peers— a child learning both English and Spanish will ultimately need to know both dog and perro to be able to communicate effectively in each language. This added challenge leads many parents to express concern that the presence of two languages will impede their children’s language development, and it is true that when young bilinguals are compared to monolinguals, they typically know fewer words in the language of comparison (Byers-Heinlein and Lew-Williams, 2013; Pearson, Fernández, Lewedeg, and Oller, 1997). However, estimates that take into account their knowledge of the two languages combined tend to show that monolingual and bilingual children learn their first words at a similar rate (e.g., Core, Hoff, Rumiche, and Señor, 2013; Pearson, Fernández, and Oller, 1993). That is, young English-Spanish bilinguals may know fewer English words than their monolingual peers, but they know a similar total number of words when their English and Spanish vocabularies are counted together. Overall, both the trajectory and heterogeneity of bilingual children’s early vocabulary growth appear to be similar to those of monolinguals, and as with monolinguals, individual differences in bilingual children’s early vocabularies are correlated with the number of words that they know at later ages (Conboy and Thal, 2006; Hurtado, Grüter, Marchman, and Fernald, 2014), providing additional evidence for the persistence and continuity of differences in children’s learning. While global estimates of bilingual children’s vocabulary size are useful in highlighting parallels between monolinguals and bilinguals, it is also informative to examine the growth of bilingual children’s vocabulary in each language separately.
454 Christine E. Potter and Casey Lew-Williams Bilinguals’ knowledge is not evenly distributed across their two languages, and children’s vocabularies grow more quickly in the language to which they receive more exposure (e.g., Potter, Fourakis, Morin-Lessard, Byers-Heinlein, and Lew-Williams, 2019). In addition, while there are correlations between children’s early and later vocabulary within a language, the number of words that children know in one language is largely unrelated to their knowledge of the other language (e.g., Conboy and Thal, 2006; Hurtado et al., 2014). In other words, children’s later vocabulary in English is highly related to how many English words they knew in infancy, but not to the number of Spanish words that they know. This dissociation underscores the complexity of predicting children’s development and emphasizes that each child’s vocabulary reflects their own unique experiences.
21.4.3.2 Bilingual input Recent studies have also consistently shown that bilingual children’s vocabulary in a given language depends on the quantity and quality of input that they receive in that language (Hurtado et al., 2014; Marchman, Martínez, Hurtado, Grüter, and Fernald, 2017; Place and Hoff, 2011). This research is an important complement to studies showing links between parents’ language use and monolingual children’s vocabulary; examining bilingual children’s vocabulary separately in two languages can help to disentangle the direct effects of input in one particular language from other potential influences such as genetics, SES, or global parenting strategies. Children’s vocabulary in each language can be predicted by both early patterns of exposure (Hurtado et al., 2014) and the concurrent balance of language use in their environments (Marchman et al., 2017; Place and Hoff, 2011). Children’s processing is also more efficient in the language with which they have more experience (Hurtado et al., 2014), and they show different neural responses to words in the language they hear more frequently (Conboy and Mills, 2006). Bilingual children are also better able to recognize words in challenging contexts (such as sentences with language mixing; e.g., Do you like the perro?) if the key words appear in the language they hear more frequently (Potter et al., 2019). These findings provide additional evidence for the view that bilingual children’s vocabulary in each language depends on their specific experience with that language. Importantly, bilingual children not only tend to receive unequal amounts of exposure to each language, but their social experience with each language can also differ in ways that affect learning. For example, children may interact with more native speakers in just one of their languages, and more non-native speakers in the other, and experience with native speakers tends to better support vocabulary growth (Place and Hoff, 2011). In addition, there are often differences in the perceived status of each language in society, which likely has consequences for parents’ and children’s language use. This can mean that children’s skills in their community’s majority language improve at the expense of their skills of their other language. For instance, in the English-dominated United States, the development of children’s vocabulary in English has been shown to be negatively related to their vocabularies in Spanish (Duursma, Romero-Contreras, Szuber et al., 2007; Hoff, Rumiche, Burridge, Ribot, and Welsh, 2014). Likewise, Inuit children
Differences in vocabulary growth across groups and individuals 455 in Canada who receive more instruction in a majority language (French or English) show weaker knowledge of their heritage language (Wright, Taylor, and Macarthur, 2000). Conversely, increased exposure to the minority language does not appear to be associated with weaker knowledge of the majority language (Cha and Goldenberg, 2015). Complex social factors, combined with the prevalence of the language in children’s experience, contribute to the ease with which children learn language. These findings showing the independence of bilingual children’s knowledge across their two languages allow us to examine the influences of specific experiences and to isolate these effects from child-specific factors, such as general cognitive abilities, as we attempt to understand variation in children’s vocabulary.
21.4.3.3 Consequences of bilingual experience on learning Though monolinguals and bilinguals may acquire their vocabularies at a similar rate, the different nature of their experience means that there may still be differences in the processes by which they learn. For instance, bilingual infants who are exposed to similar-sounding languages may be sensitive to cues in both the auditory and visual input that help separate their two languages, such as speakers’ mouth movements (Birulés, Bosch, Brieke, Pons, and Lewkowicz, 2019; Bosch and Sebastián-Gallés, 2003). Bilingual infants can also learn that pitch contours of otherwise identical words can be used to distinguish possible referents, while monolingual infants learning a non-tonal language do not, suggesting that bilingual experience informs the cues to which infants attend (Graf Estes and Hay, 2015; Hay, Graf Estes, Wang, and Saffran, 2015). Other studies have shown differences in how monolinguals and bilinguals learn new words. Unlike monolinguals, bilinguals regularly encounter multiple words for the same object, and therefore do not appear to assume that each object should have a unique label. As a result, they are more likely to accept multiple labels for the same object and less likely to assume that a novel label must refer to a novel object (Byers-Heinlein, 2017; Byers-Heinlein and Werker, 2009; Kandhadai, Hall, and Werker, 2017). This research comparing bilinguals and monolinguals highlights yet another way that children’s experience with language changes how they learn words.
21.5 Conclusions In recent decades, research on language development has increasingly emphasized the importance of understanding how differences in children’s early experiences influence their long-term outcomes. In this chapter, we have reviewed evidence for a number of main conclusions. First, across different communities and cultures, children’s vocabulary knowledge is directly related to the quality of their language input. However, much of this evidence consists of somewhat coarse descriptions, such as the total number of words that children hear in particular contexts. Second, we have argued that language input cannot be treated as an independent force. Children play an active role in shaping
456 Christine E. Potter and Casey Lew-Williams their input; they vary in what information they extract from their input; and individual differences in vocabulary growth reflect children’s own abilities, interests, and prior experience. Third, learning effects are cumulative. Children use their prior knowledge to learn new words, and they are more likely to elicit rich, complex input when they are able to demonstrate understanding. Thus, by continuing to explore reciprocal influences of children’s knowledge and experience with language, we will have the opportunity to understand why and how early experiences may have such sizable effects on children’s development. Finally, we have attempted to illustrate both the theoretical and practical value of examining individual differences in children’s learning among different populations and contexts. As developmental science increasingly recognizes the need to examine development in diverse environments, we will have the opportunity to test claims about similarities and differences across languages and developmental contexts, as well as to generate predictions about a wider range of language-related phenomena. In particular, our final section suggests that the independence with which bilingual children learn each of their languages provides powerful support for the view that children’s experiences, rather than their innate capacities, can explain differences in the success with which they learn words. A deeper understanding of links between input and learning will be essential for developing complete theories of vocabulary development that capture the experiences of children from diverse populations, as well as for designing interventions that reduce input-related disparities and support the skills of children with developmental delays.
PA RT I I I
AC C E S SI N G T H E M E N TA L L E X IC ON
Pa rt I I IA
V IA F OR M
Chapter 22
Sp oken word re c o g ni t i on James S. Magnuson and Anne Marie Crinnion
22.1 The problem: scope and division of labor Recognizing words seems easy. We just hear them. Naïve listeners report that word recognition is normally effortless, and at least as easy as reading words on a printed page. You might share the intuition of a prominent auditory neurophysiologist who said that he was surprised that speech scientists struggled so much to explain speech perception, since there are clear breaks between words in fluent speech. A quick look at some spectrograms disabused him of this opinion (there are no reliable acoustic cues to boundaries between words, let alone phonemes—consonants and vowels). In fact, acoustic breaks are more likely within words than between words (the one clear break in “where were you a year ago” occurs as part of the /g/in ago). It is easy to underestimate the challenge of explaining human speech recognition, where the buzzes, hisses, pops, and whistles emanating from a speaker’s vocal tract result in a listener typically recovering the message the speaker wishes to transmit. It is in fact so challenging that multiple subfields of cognitive science and neuroscience are devoted to small pieces of the puzzle. We have largely broken the problem into four pieces: speech perception (roughly, mapping from acoustics to phonological categories like phonemes), spoken word recognition (SWR; mapping from phonemes or something similar to sound forms of words), sentence processing (mapping series of word forms to syntactic structures, constrained by semantics), and pragmatics and discourse (situating sentence processing within a larger set of utterances, such as a narrative or conversation). You may notice some gaps. For example, why should spoken word recognition end with form recognition? There is indeed a rich literature on accessing semantic representations in spoken word recognition (see Rodd, this volume; Piñango, this volume; and Magnuson, 2017). Of course, one should also question what might be lost by not considering the lowest levels of language processing in the context of the
462 James S. Magnuson and Anne Marie Crinnion highest, and vice-versa, and we will return to this at the end of the chapter. For now, let’s consider why the subfield of SWR typically starts with abstract phonological inputs rather than the speech signal and why it typically ends with form and not meaning—a state of affairs we will call the Simplified Mapping Perspective (SMP). The SMP stems from very practical simplifying assumptions that emerged in the 1980s when theories of spoken word recognition came into their own. Although there were some attempts to develop models that could operate directly on speech (Elman and McClelland, 1986; Klatt, 1979), a set of longstanding unsolved problems in theories of speech perception (mainly unsolved to this day) and lack of computing power motivated this simplification. For at least 120 years (Bagley, 1900–1901), psychologists have recognized that SWR offers plenty of mysteries of its own, even if we step away from speech perception and constrain the problem to mapping from something like a string of phonemes to lexical forms. If we simplify the scope of our problem in this way (i.e., adopt the SMP), and leave the acoustic-phonetic challenges to specialists in speech perception, we can posit a fundamental process that seems to be necessary for recognizing spoken words: mapping an uninterrupted stream of phonemes onto a series of word forms stored in memory (indices to the so-called mental lexicon). This is spoken word recognition construed as form recognition. For now, we will not consider meanings of individual words, complications of morphological processes,1 lexical semantics of sets of words, nor potential constraints of syntax or discourse. And again, the SMP starts with something like phonemes as inputs, so for now we will not concern ourselves with lower-level questions that we relegate to the domain of speech perception.2 Defining the scope like this requires us to adopt at least three simplifying assumptions that we know are wrong: that we can isolate a process of spoken word recognition from the rest of language and cognition, that the input can be conceived of as something like phonemes, and that word forms are a plausible stopping point (i.e., that there are not rich interactions with processing of morphology, meaning, and syntax). But if we reject these simplifying assumptions, we must adopt others, or face an endless regress where at the “low” end we go deeper and deeper, from phonemes to phonetic cues to acoustic energy impinging on hair cells to the dynamics of networks of cells to individual cells, ad infinitum. We can do the same thing at the “high” end, progressing from word meaning to lexical semantics to sentence level constraints to discourse constraints to a listener’s personal experience with each word and phrase and any memories or emotions they trigger, to social interactions with an interlocutor, to cultural concerns, and so forth. It is not possible to consider all these levels at once. So in any domain, developing theories and understanding requires a reductionist approach to break 1
Typically, the SMP focus is restricted to lemma forms: “citation,” uninflected forms, such as BARK but not BARKS or BARKED. 2 The SMP also harks back to Marslen-Wilson’s (1987) proposal that SWR emerges from three distinct functions that operate in parallel: access (form activation), selection (discrimination among forms, essentially recognition via determining best fit), and integration (with higher levels of processing).
Spoken word recognition 463 problems into manageable pieces. A degree of fundamental understanding of the pieces is a prerequisite for developing integrative theories. As understanding of smaller pieces of the puzzle emerges, the scope of the problem can be expanded. Of course, developing micro-theories poses risks (cf. the parable of blind men examining an elephant, coming to radically different conclusions about its nature depending on whether they inspect the trunk, ear, leg, tusk, etc.). Later, we will consider the potential perils of segmenting the system incorrectly, or of losing sight of simplifying assumptions. For now, we will adopt the conventional phonemes-to-form definition of spoken word recognition, and focus on the great strides psycholinguists have made within this scope. We will divide the rest of this chapter into two major sections: core challenges that current SWR theories and models face within the SMP (a mainly historical overview of SWR theories and models and the data that motivate them), and future challenges, where we review phenomena that are clearly essential for true understanding of human SWR, but are mostly outside the scope of current theories and models. In a nutshell, by walling off SWR from the unsolved challenges of speech perception and sentence processing (not to mention neural-level findings, which we will touch upon briefly later), the SMP has allowed tremendous progress toward understanding crucial components of word-level processing and representation that is largely prerequisite to integrative theories spanning these levels. It is possible that solutions to currently unresolved challenges and debates may emerge as we develop models that allow lower and higher levels of language and cognition to constrain SWR.3
22.2 Core challenges for the SMP of SWR The primary challenge is mapping the stream of phonemes onto words. One possibility would be to buffer some amount of speech into short-term memory and analyze it in chunks. But what would the chunks be? This idea runs right into what is known as the segmentation problem: there are few cues to word boundaries. Because the input must be processed before word boundaries can be discovered, the chunks could not be words. Perhaps the chunks could be longer utterances, such as phrases or sentences. However, phrase and even sentence boundaries can also be uncertain—another segmentation problem.4 Also, in many contexts, speech proceeds quickly with few pauses (e.g., 3
One might argue that our description of the SMP is an oversimplification; as we shall see, several SWR studies since the 1980s have strayed beyond its boundaries. This is a fair point, and—spoiler alert— we will reject the SMP ourselves. However, we find it a useful approximation of the modal approach in SWR over the last few decades (most models of SWR conform to it, for example), as well as a foil that highlights exceptions to it and potential alternatives. 4 A third segmentation problem is found in the domain of speech perception, where the information specifying adjacent (or sometimes more distant) consonants and vowels overlaps in time, precluding clear phonemic segmentation.
464 James S. Magnuson and Anne Marie Crinnion interlocutors taking immediate or even overlapping turns), so this idea might pose impractical computational demands. Another intuitive approach might be to consider the lexicon as a series of branching paths, as in Figure 22.1. There would be one tree for every possible onset phoneme, which could potentially branch to every other possible phoneme, and so on (though in
Figure 22.1 Examples of lexical trees beginning with /b/and /k/, showing only a small subset of branches (some of which are truncated). Note that in some cases the path to a terminal node (which cannot be continued to form a word without considering morphological inflections) passes through multiple words that are onset-embedded in one or more longer words (e.g., the branch for BASKET passes through BASS and BASK). Reproduced from Magnuson (2020a).
Spoken word recognition 465
Figure 22.2 Flow chart illustrating the basic algorithmic principles of the Cohort Model (Marslen-Wilson and Welsh, 1978). Reproduced from Magnuson (2020b).
fact, only a subset of possible branches occurs in English [for example], and word length is finite). Recognizing a word would simply be a matter of mapping the phoneme stream onto the correct branch. As we can see in Figure 22.1, a challenge crops up immediately: the embedding problem (McQueen, Cutler, Briscoe, and Norris, 1995). Many words are embedded within longer words. We need a mapping process that won’t prematurely “recognize” BASS or BASK when hearing BASKET. If we only consider words spoken in isolation, this is not much of an issue; the system can wait for silence to mark the end of the word. But how will the system manage embeddings in fluent speech, without clear word boundaries? In fact, the foundational modern theory of spoken word recognition—the Cohort Model (Marslen-Wilson and Welsh, 1978)—posited that a process operating over something like this tree structure would provide a basis for mapping phonemes to words, as well as an emergent solution to word segmentation. The basic idea is depicted in Figure 22.2. When the first phoneme in the stream is encountered, it is added to a buffer. Each subsequent phoneme is added to the string in the buffer if doing so results in a continuation of a word in the lexicon. If adding the phoneme does not yield a match to a word in the lexicon, this likely indicates a word boundary (e.g., /kæt/is a word, but if the next phoneme is /r/— as in “CAT RUNS”—it does not match a word, and so a boundary is posited after /kæt/and the process begins again with /r/as the first phoneme). However, if the phoneme cannot be added to the prior string and the prior string does not correspond to a word, this means an error has occurred, and reanalysis is needed. For example, if the input were /kʌlæps/ (COLLAPSE) but the /l/was not clearly articulated and the listener mapped it to /r/, the string /kʌræ-/could be mapped to CARAFE but at the /p/at the next position, neither the prior string (/kʌræ/) nor the new string (/kʌræp/) would match a lexical item, so reanalysis would be required. The reason this was called the Cohort Model is that at the beginning of a string, all words that match the first phoneme are “activated” and form the potential recognition cohort.5 Thus, a more typical way of describing the process in Figure 22.2, rather than as 5
Note, however, that cohort is often used in SWR to refer to an onset competitor, e.g., CAB is a cohort of CAT. Note also that the competition cohort, that is, the set of items predicted to compete strongly with
466 James S. Magnuson and Anne Marie Crinnion traversing a tree, is generating a list of all matching items at the first position, and then winnowing that list as each new phoneme comes in. On this view, word onsets hold a privileged place. Even if two words have great global similarity—e.g., BATTLE and CATTLE overlap in four of five phonemes—hearing one is not predicted to activate the other if they mismatch at onset. The /b/at BATTLE’s onset constitutes positive evidence for any word beginning with /b/and evidence against any word that does not begin with /b/. On the other hand, hearing one of two words with relatively little global overlap but with the same onset (BATTLE and BAG, which match in only two of five possible positions) would be predicted to strongly activate both. Although the primacy of onsets had been long recognized (Bagley, 1900–1901), Marslen-Wilson and his colleagues transformed this intuitive idea into a formal model that generated clear, testable predictions. Indeed, they conducted comprehensive tests of the model over several years. Their methods included gating (playing a listener progressively longer snippets of a word and asking them to guess what word it would turn out to be; Grosjean, 1980) and a variety of pairwise approaches (cf. Magnuson, 2017). In pairwise approaches, a specific word is presented and the impact of hearing (or seeing) that word on the activation of one specific word (i.e., a specific target and potential competitor pair) is assessed. For example, one can test whether hearing BATTLE primes cohorts (BAND, BAG, etc.) or rhymes (CATTLE). When Marslen-Wilson and his colleagues used the cross-modal semantic priming paradigm (Swinney, 1979) to assess spread of activation based on phonological similarity (e.g., Marslen-Wilson and Zwitserlood, 1989), an asymmetry emerged. In cross- modal semantic priming, the participant hears a series of words but is focused on a visual lexical decision task. The participant decides whether each written string they see on a screen is a word or not (and of course, the domains can be flipped such that the lexical decision task is focused on spoken words). Although the spoken words are superfluous, they impact the visual lexical decision via semantic relations. For example, participants are faster to identify LOG (associate of CABIN) as a word after hearing CATTLE than after a phonologically unrelated word, such as BRAIN. This suggests that CABIN was so strongly activated when CATTLE was heard that a detectable degree of activation spread to its semantic associates. Work by Marslen-Wilson and colleagues and others motivated by the Cohort Model led to multiple discoveries regarding the incremental process of spoken word recognition. For example, the simple view of the Cohort Model depicted in Figures 22.1 and 22.2 predicts that recognition should happen at the uniqueness point. Given a word like COCONUT, the /n/represents the uniqueness point; once the /n/is encountered, there is only one lexical possibility (or two, taking into account morphological marking for number, i.e., COCONUT vs. COCONUTS). The fact that the uniqueness point is at or after word offset for many words (e.g., CAT can continue as CATTLE, CATALOG, etc.) a give target word, is not typically defined based on overlap in only the first phoneme. Here, we adopt a common standard of defining the competition cohort as words overlapping in the first two phonemes (other variants include “overlap in the first 200 msecs” or “overlap through the first vowel”).
Spoken word recognition 467 also motivates the parsing strategy depicted in Figure 22.2. Given the string, /ðʌkætəlou/ (we can approximate this phonetic transcription roughly as tha-kat-a-low, with no pauses or breaks), a word boundary would be discovered between the second and third phonemes (THE and then something beginning with /k/). A potential boundary exists after the /t/(THE CAT . . . ), but the following schwa and /l/could be added to /kæt/to form CATTLE. The final vowel still leaves an ambiguity; a possible parse is THE CATTLE LOW (the final word being a low-frequency, archaic verb, meaning to moo), and if the utterance ended there, it would be the parse. However, if the utterance continued as /ðʌkætəloupt/, the system would have to discover the parse THE CAT ELOPED. The low-frequency example of the verb TO LOW leads us to two other critical considerations in the time course SWR that emerged in studies motivated by the Cohort Model: the isolation point and the recognition point (Grosjean, 1980). The recognition point is where we can establish definitively that a listener has decided upon the identity of a spoken word (not necessarily correctly). This could be the point when the listener presses the YES button in a lexical decision (deciding that the stimulus is a word), although there is some controversy about whether a correct YES response in lexical decision necessarily requires actual recognition (vs., e.g., high but not definitive activation, depending on the nature of nonwords used, among other factors; see Balota, 1990, on the slippery notion of a “magical moment” of recognition). Grosjean (1980) instead asked participants (in a gating task) to listen to increasingly longer portions of words (from word onset), to provide their best estimate of the identity of the word, and to rate their confidence about that estimate. He defined the isolation point as the position at which a participant settled upon the correct response for the target word (that is, they correctly identified the actual word and did not change their response on subsequent gates). The isolation point necessarily precedes (or coincides with) the recognition point, but both may occur before or after the uniqueness point. Grosjean found that for single words presented without context, the isolation point was relatively late. A relatively neutral context (constraining form class and possibly concreteness, e.g., my son asked for a...) led to an earlier isolation point for appropriate words (e.g., PARROT), and semantically constraining contexts shifted the isolation point earlier (because he loves pets, my son asked for a . . . ). This raises an interesting question that we will return to later: does top-down context directly affect recognition processes and therefore contribute causally to perception (e.g., via feedback), or does context bias word recognition after bottom-up information drives activation (e.g., in a decision-stage process)? For now, we will focus on “context-free” SWR (i.e., isolated words), and will shortly consider how the lexical context of isolated words may mediate or moderate sublexical processing. However, despite the terrific amount of data that had accumulated supporting the Cohort Model, Luce (1986) and Luce and Pisoni (1998) took a contrarian perspective with the Neighborhood Activation Model (NAM). They proposed that similarity defined in a more global manner might capture aspects of competition that were not apparent
468 James S. Magnuson and Anne Marie Crinnion in studies using gating or pairwise approaches, which strongly supported the Cohort Model’s emphasis on temporally “left-to-right” sequential processing. In contrast, they took what we call a lexical dimensions approach to investigating phonological similarity (cf. Magnuson, 2017). On a lexical dimensions approach, predictors are selected for factorial and/or regression analyses examining performance measures for large numbers of items (rather than specific pairs). Luce and Pisoni focused on two lexical dimensions: frequency of occurrence, and neighborhoods. They adopted a specific and simple neighborhood definition drawing on prior work (e.g., Greenberg and Jenkins, 1964; Landauer and Streeter, 1973; and Sankoff and Kruskall, 1983): two words are neighbors if they differ by no more than a single phonemic deletion, addition, or substitution (the so-called DAS rule). For example, CAT has the deletion neighbor AT, addition neighbors SCAT and CAST, and many substitution neighbors (e.g., BAT, COT, CAN). Luce and Pisoni acknowledged that the DAS rule makes very strong assumptions, in that any it ignores potentially gradient phonetic similarity (e.g., CAD is more phonetically similar to CAT than is CALF) or potentially differential effects of position of similarity (e.g., do onset substitutions such as BAT and CAT compete as strongly as offset substitution pairs such as CAB and CAT?). However, they argued that the computational simplicity of the DAS rule provides an excellent starting point for considering effects of phonological neighborhoods (and they also explored gradient measures of similarity, as we discuss shortly). Luce and Pisoni proposed that frequency and neighborhood could be integrated into a simple choice model for spoken word recognition, paraphrased in Equation 1. A word’s frequency-weighted neighborhood probability (FWNP) is calculated as the ratio of a target word t’s frequency (f) to the summed frequencies of all n of its neighbors (including itself). The simple, elegant idea here is that words differing by a single phoneme are similar enough to activate each other if one of the words is heard, that strength of activation will be proportional to word frequency (i.e., prior probability), and that activated words will compete for recognition. Thus, the larger the proportion of the frequency-weighted neighborhood that the target contributes, the more easily it should be recognized. So if two target words had the same frequency-weighted neighborhood, the target with higher frequency (i.e., the one with the larger numerator) would be (predicted to be) recognized more easily. If two target words had the same frequency, the one with the “sparser” neighborhood (i.e., the one with the smaller denominator), should be recognized more easily.
FWNPt =
f
∑
t n j =1
fj
(1)
Luce and Pisoni (1998) conducted factorial studies using categorical definitions of high and low word (log) frequency, neighborhood density (count of neighbors), and neighborhood frequency (summed log frequencies of neighbors). In general, higher frequency predicted better lexical decision performance (higher accuracy and faster
Spoken word recognition 469 reaction times), while high neighborhood density (count) and high neighborhood frequency predicted worse performance. They also used a graded form of the FWNP rule, as in Equation 2 (note that we use a simplified notation, but Equation 2 is equivalent to Equation 6 in Luce and Pisoni, 1998). FWNPt =
p(t | i) ft (2) ∑ p( j | i) f j l j =1
In Equation 2, p(t|i) is the probability that the word is target t given the input i. Given that the input does indeed correspond to word t, you might expect this to be 1.0. However, this may not be 1.0 given the possibility of external or internal noise. In the denominator, p(j|i) is the probability that the input word is actually word j. To put this slightly differently, the p(t|i) and p(j|i) are the calculated similarity between i and target word t or word j. Note that n (all neighbors) in Equation 1 is replaced here with l, because now every word in the entire lexicon is considered, not just the neighbors that conform to the DAS rule. As in Equation 1, the denominator includes the target, t. Note also the relation between Equations 1 and 2; in Equation 1, the similarity calculation is dropped because words are defined categorically as neighbors (those conforming to the DAS rule) or not. Luce and Pisoni (1998) reported that for an identification-in-noise task, the FWNP accounted for approximately four times as much variance as word frequency at high and moderate signal-to-noise ratios (SNRs): 16% vs. 4% with a +15 dB SNR, and 22% vs. 5% at +5 dB SNR. The FWNP and frequency accounted for similar variance at a low SNR (5% vs. 6% at a -5 dB SNR). The graded similarity metric makes some surprising predictions. For example, it correctly predicts that VEER should prime BULL (which differ only by a single phonetic feature at each position; Luce, Goldinger, Auer, and Vitevitch, 2000).6 However, the DAS variant of the FWNP (Equation 1) has come into common usage, with little exploration of the graded version (Equation 2). An obstacle to using the graded version is the need for a precise basis for calculating similarity. Luce and his colleagues derived confusion probabilities for the specific parameters of the experiments in which they have applied the graded metric (e.g., phoneme or diphone confusions for specific SNRs, talkers, etc.). This is a gap that could be filled by the use of a generic similarity metric based on acoustic-phonetic features, but to our knowledge, this approach has not yet been taken. Let’s consider how the NAM and Cohort Model relate to one another. In Figure 22.3, we present subsets of the neighbors and cohorts of the word CAT (/kæt/). Neighborhoods and cohorts overlap in substitution neighbors that overlap at all but the 6 An
important extension of the framework is the concept of probabilistic phonotactics (Vitevitch and Luce, 1999), which considers positional likelihoods of each phoneme and diphone in a word. While these measures tend to increase with frequency-weighted neighborhood, probabilistic phonotactics adds important information, and many studies now control probabilistic phonotactics as well as neighborhood.
470 James S. Magnuson and Anne Marie Crinnion Neighbors bat hat mat ... cot kit cut ... at scat
Cohorts cab cad calf cam can cap cast cat calve ...
cabbie caddy cafeteria camera candy captain castle cattle cavalry ...
Figure 22.3 Illustration of the differences and overlap in competitor sets predicted for the target word CAT by the Neighborhood Activation and Cohort Models. Reproduced from Magnuson (2020c).
final position, and addition neighbors that preserve the first two phoneme positions. But neighbors include non-cohorts with substitutions or deletions at the first or second positions, and cohorts include items that mismatch by more than one phoneme, so long as they overlap in the first two positions. Which should we prefer? Informally, it would appear that the NAM has outstripped the Cohort Model in SWR. Anecdotally, reviewers routinely insist upon neighborhood being controlled in word recognition studies, but rarely comment upon cohort size or density. The NAM approach has also proved useful in designing clinical assessments and interventions for language impairments (e.g., Kirk, Pisoni, and Osberger, 1995; Morrisette and Geirut, 2002; Sommers and Danielson, 1999; Storkel et al., 2010, 2013). However, it’s not clear that the apparent dominance of the NAM approach is fully warranted, nor has there been a clear reconciliation of the two approaches. A study by Allopenna, Magnuson, and Tanenhaus (1998; see Figure 22.4) suggested a middle ground that is consistent with one of the best-known computational models of human SWR, TRACE (McClelland and Elman, 1986). In an early “visual world paradigm” (Tanenhaus, Spivey- Knowlton, Sedivy, and Eberhard, 1995) study, Allopenna et al. presented participants with displays containing four shapes and four images of objects. Participants followed spoken instructions to interact with one of the images (e.g., “pick up the beaker; now put it below the diamond”). On critical trials, the target image (e.g., BEAKER) was accompanied by a potential cohort competitor (BEETLE) and/or a rhyme competitor (SPEAKER) along with images of one or two phonologically unrelated items (Figure 22.4C). Allopenna et al. tracked the proportion of fixations to each object from the onset of the target name in the “pick up the . . . ” instruction. The time course is shown in Figure 22.4B. There was early, strong competition between cohorts (which makes intuitive sense, since the input initially matches the target and its cohort equally well), but also later, weaker competition between rhymes. Changes in fixation proportions over time mapped onto phonetic similarity with an approximately 200-ms lag, which is a typical latency for saccades in cognitive tasks (Carpenter, 1988; Viviani, 1990),
Spoken word recognition 471
Figure 22.4 How the TRACE model is linked to the visual world paradigm task, adapted from Allopenna et al. (1998). Participants saw displays like the one in C. As participants followed spoken instructions like “pick up the beaker,” the proportion of fixations to different object types mapped onto phonetic similarity over time with an approximately 200-ms lag (panel B). When TRACE simulations were conducted with analogous items, similar trends were observed (A). When TRACE activations were constrained to the same four-alternative choices presented to participants (C), and thus rescaled as predicted response probabilities (D), a very close fit was observed.
and nearly as fast as could be expected, given that saccade latencies to a point of light in a darkened room are approximately 150 ms (Fischer, 1992; Matin, Shao, and Boff, 1993; Saslow, 1967). Notably, the cohort competition they observed is not predicted by NAM, since the cohorts differed by more than one phoneme. Similarly, the rhyme competition is not predicted by the Cohort Model.7 Thus, the time course suggests a possible reconciliation between cohort and NAM predictions. While cohort competition is indeed strong,
7
In addition, the rhymes in the example are not neighbors, as BEAKER and SPEAKER differ by two phonemes. However, all other rhyme pairs differed by just one phoneme (e.g., SANDAL/CANDLE).
472 James S. Magnuson and Anne Marie Crinnion Words Phonemes Pseudo-spectral features
Figure 22.5 Schematic of the architecture of the TRACE model. Arrows indicate positive, selective connections (features connect to phonemes for which they are relevant, phonemes connect to words that contain them, and words connect back to constituent phonemes). Knobs indicate lateral inhibition. Reproduced from Magnuson (2020d).
there is also weaker competition from at least one non-cohort type of neighbor. The fact that competition between rhymes is relatively weak suggests that the failure to observe rhyme effects in pairwise approaches was likely due to the use of low-sensitivity paradigms rather than the actual dynamics of lexical access and competition in online SWR. For example, consider again cross-modal semantic priming. To detect effects in this paradigm, activations must be strong enough to cross modalities (auditory-visual) and spread form-based activation via semantic relations. If form-based rhyme effects are relatively weaker than cohort effects (Figure 22.4B), it is unsurprising that rhyme effects might be too weak to drive cross-modal semantic priming. Let us consider how the human-like competition timecourse emerges in the TRACE model. The basic architecture of TRACE is presented schematically in Figure 22.5. In TRACE, “pseudo-spectral feature” inputs (elements representing features such as diffuseness or burstiness that ramp on and off over time, with adjacent phoneme patterns overlapping as a coarse analog to coarticulation) activate corresponding phonemes. Activated phonemes stimulate word nodes that contain them. As word nodes become activated, they send feedback to constituent phonemes. Crucially, activations are governed by competition via lateral inhibition.8 As a word is input incrementally to TRACE, items overlapping at onset (cohorts) naturally become activated since they match the beginning of the input. Rhymes can be activated as the input begins to activate their constituent phonemes. Importantly, lexical feedback will even drive modest activation of the initial phoneme of a rhyme (though not sufficiently for that phoneme to be as strongly activated as the phonemes in the bottom-up input). One might expect that because BEAKER and SPEAKER overlap in 8
The schematic in Figure 22.5 belies a considerably more complex architecture, where many copies of each feature, phoneme, and word node are tiled in the TRACE memory over time (essentially creating a temporotopic map, which allows interaction not just vertically, from words to phonemes, but horizontally between units that are aligned with different stretches of time in TRACE’s memory). Lateral inhibition is relatively “local,” with connections constrained to units with similar temporal alignments. For detailed discussions of the TRACE architecture, see McClelland and Elman (1986), Magnuson, Mirman, and Myers (2013), and Magnuson, Mirman, and Harris (2012).
Spoken word recognition 473 four phonemes but BEAKER and BEETLE only overlap in two, BEAKER should activate SPEAKER more strongly than BEETLE. However, this does not happen due to lateral inhibition. By the time the input begins to activate the rhyme, the target word and all of its cohorts are moderately activated (e.g., around cycle 20 in Figure 22.4A). The inhibition they jointly send to the rhyme prevents the rhyme from becoming as strongly activated as cohort items despite greater global similarity. Thus, the incremental nature of the input promotes a strong advantage for cohort competition (note that McClelland and Elman (1986) cite the Cohort Model as their inspiration). Since there is no explicit tracking of word onsets or removal of items with mismatches (lateral inhibition suppresses rather than categorically removes items from the competition set) items that mismatch at one or more positions can become activated, with degree of activation depending on many complex factors (position[s]of mismatch[es], number of other items similar to the target and/or potential competitor, etc.). One way that TRACE could be made more consistent with the Cohort Model would be to use bottom-up inhibition: that is, phonemes would send inhibition to words they do not occur in (in a position-specific fashion; see Footnote 4; to our knowledge, this has not been attempted with TRACE). Note that lateral inhibition has important advantages over categorical removal based on mismatches; in particular, it allows rhyme effects to emerge, and it also makes TRACE quite robust against noise (see, e.g., Magnuson, Mirman, Luthra, Strauss, and Harris, 2018), and replaces the need for the (unspecified) reanalysis mechanism in the Cohort Model with a of more tolerant form of constraint satisfaction. So, does TRACE reconcile differences between NAM and the Cohort Model? Yes and no. On the one hand, it provides a mechanistic explanation for how the timecourse of phonological competition might emerge in human listeners, with surprisingly fine- grained, nearly millisecond-scale simulations (Figures 22.4A and 22.4D). However, while some general principles that govern TRACE’s behavior can be articulated (as we have just attempted), it does not provide an explicit similarity metric, let alone a categorical (or even graded) definition of competitors. Its central tendencies (e.g., the averages presented in Figures 22.4A and 22.4D) closely correspond to those of human behavior (e.g., Figure 22.4B). But TRACE does not generate convenient, precise numbers corresponding to its word-specific predictions like NAM does. One could attempt to do this by operationalizing recognition time in TRACE, for example. However, TRACE has a modest phoneme inventory (14 rather than the ~40 required for English) and a fairly small lexicon (212 words originally, which has been extended to ~1,000 in two reports; Frauenfelder and Peeters, 1998; Magnuson et al., 2018). Another challenge for models of SWR is the fact that phonemes may not be processed in strict temporal order. Toscano, Anderson, and McMurray (2013) used a visual world paradigm to show that listeners look more to phonemic anadromes (words that share the same phonemes but in a different order, e.g., GUT for target word TUG) than to another word that shares the same vowel and one consonant (i.e., has word initial mismatch; GUN for target word TUG) or to an unrelated word. Notably, TRACE was unable to simulate these effects (although it might if one were to extend the spread of
474 James S. Magnuson and Anne Marie Crinnion coarcticulation further left and right in TRACE). Similar effects have also been shown in a short-term priming task (i.e., a phonemic anadrome primes a target better than an unrelated prime or a prime that has word initial mismatch but shares a vowel and consonant; Dufour and Grainger, 2019). Furthermore, Dufour and Grainger (2020) found that these priming effects occur only when the target has a higher frequency than the prime, suggesting that lexical representations underlie these effects. While these findings suggest that spoken word recognition does not occur in a strictly sequential way, findings from Gregg, Inhoff, and Connine (2019) suggest that overlap in vowel position is important for these effects to be observed. In other words, while Gregg and colleagues found more looks to an anagram than an unrelated word, for pairs like LEAF (target) and FLEA (competitor), where the same consonants are present in a different order, but the position of the vowels does not overlap in the two words, they observed no more looks to the competitor than to a completely unrelated word. These effects suggest that models of spoken word recognition need to reconsider strict temporal ordering of constituent phonemes,9 include frequency as a core concern in models, and also that models need to consider different saliency of vowels vs. consonants—none of these are considered by default in TRACE, although jTRACE (Strauss, Harris, and Magnuson, 2007) includes three different ways to enable frequency. Despite these limitations, TRACE has proved to account for a broad and deep set of phenomena in human SWR and speech perception (see Table 22.1). There are also a few reported failures (e.g., Chan and Vitevtich (2009) reported that simulations with jTRACE (Strauss et al., 2007) failed to simulate differences they observed with a factorial manipulation of clustering coefficient—proportion of a word’s neighbors that are also neighbors of each other—that we review briefly below). The fact that TRACE simulates so many results with its default parameter settings is important to note, as some models require parameter changes for simulating different phenomena (see Magnuson et al., 2012, for a review). However, it is crucial to note that successful simulations do not establish that the mechanism proposed within a model is correct. Indeed, much of the SWR literature since the mid-1980s has revolved around disagreements regarding the algorithms proposed in competing models. We do not have space in this chapter to discuss most models of SWR (e.g., the Distributed Cohort Model (Gaskell and Marslen-Wilson, 1997, 1999), PARSYN (Luce et al., 2000), TISK (Hannagan et al., 2013), or various models within the Adaptive Resonance Theory (ART) framework (Grossberg, Boardman, and Cohen, 1997; Grossberg, Govindarajan, Wyse, and Cohen, 2004; Grossberg and Myers, 2000). 9 The
Time Invariant String Kernel (TISK) model of spoken word recognition (Hannagan, Magnuson, and Grainger, 2013) may account for these results. TISK is an interactive activation model like TRACE, but instead of reduplicated, strictly ordered phoneme templates (see Footnote 7), it uses a form of “open diphone” coding, where diphone units are activated by every ordered pair of phonemes in a word. For example, when cat is presented, nodes for /kæ/and /æt/are activated, but so is a node for /kt/(hence, “open,” or not necessarily adjacent). This allows for a much more compact model compared to TRACE. It would also seem likely to tolerate phoneme transpositions better than TRACE but, to our knowledge, it has not been applied to such results yet.
Spoken word recognition 475 Instead, we will focus briefly on the Shortlist and Merge “A” and “B” models proposed by Norris (1994; Shortlist A), Norris, McQueen, and Cutler (2000; Merge A)) and Norris and McQueen (2008; Shortlist B and Merge B), as they represent a theoretical position (autonomy) fundamentally opposed to TRACE’s interactive activation. Currently, the primary alternative to interactive models like TRACE is an autonomous framework without feedback. The fundamental premise is that any phenomenon that can be simulated with feedback could be simulated by an autonomous model. Consider, for example, one of the most familiar effects that appears to support the notion of lexical feedback: the Ganong effect (Ganong, 1980). In the Ganong paradigm, a continuum is created from a word to a nonword, for example, SHAPE-*SAPE, and participants identify the phoneme that is changing (‘sh’ vs. ‘s’ in this case). Compared to a nonword-nonword continuum or a word-word continuum (e.g., SHIP-SIP), where one would observe a classic categorical perception pattern with an apparent boundary in the middle of the continuum (first few steps identified as ‘sh,’ last few as ‘s,’ with intermediate response rates for items in the middle of the continuum, reflecting their ambiguity), the pattern shifts for a word-nonword continuum; the boundary shifts toward the nonword (such that the previously ambiguous items are more likely to be identified as consistent with the word, and previously unambiguous steps nearer to the nonword continuum become ambiguous). This pattern is naturally explained under a feedback account. Phonemes receive two sources of input in a model like TRACE: bottom-up input and lexical feedback. When there is a lexical item consistent with one endpoint of a continuum but not the other, the phonemes consistent with the lexical endpoint receive more total activation, because lexical knowledge directly mediates sublexical activation. This proposed mechanism also readily accounts for other top-down effects; we will discuss two more examples. The first is the word superiority effect, where phonemes can be detected more quickly in word than nonword contexts (Rubin, Turvey, and van Gelder, 1976)—on the interaction account, this arises due to the boost of top-down feedback for phonemes in words. The second is phoneme restoration (e.g., Warren, 1970). If a phoneme in a word is replaced by noise (e.g., the /t/in RESTORE), participants report hearing noise, but also report hearing the missing phoneme, and have difficulty identifying exactly where the noise was positioned within the word; however, if the phoneme is replaced by silence, participants precisely identify the position of the gap and do not report hearing the missing phoneme (see, e.g., Samuel, 1981, 1996, 1997). This phenomenon demonstrates both the potential for top-down feedback to restore an ambiguous or noise-masked (or replaced) portion of a stimulus and the need for models to exhibit bottom-up priority (under the assumption that noise allows partial activation of all phonemes, which is enough to allow lexical feedback to boost the missing phoneme sufficiently).10
10
For a debate as to whether TRACE appropriately accounts for phoneme restoration, see Grossberg and Kazerounian (2011, 2016) vs. Magnuson (2015).
476 James S. Magnuson and Anne Marie Crinnion Table 22.1 Effects accounted for by the TRACE model. As we discuss, there are also a few reported failures with the TRACE model, and successful simulations do not establish that algorithms implemented in a model are correct—they simply confirm that the model’s algorithm is plausible and not-yet-falsified. This list provides a sense of the breadth and depth that has been achieved with TRACE, and a set of candidate benchmarks for other models. Effect /phenomenon
Reference
Notes
1.
Lexical context mediates phoneme McClelland and Elman perception (“Ganong effect”; Ganong, 1980) (1986), p. 24
Original parameters
2.
Elimination of lexical effects under time pressure
ibid, p. 26, Figs. 5 and 6
Original parameters
3.
Phoneme position impacts lexical effects (weaker at word onset than offset)
ibid, pp. 27, 29, and 30, Figs. 8-11
Original parameters
4.
Dependence of lexical effects on phonological ambiguity (demonstrating bottom-up priority)
ibid, p. 28
Original parameters
5.
Lexical basis for phonotactic influences
ibid, pp. 33-35, Fig. 12
Original parameters
6.
Trading relations among phonetic cues
ibid, p. 38
Original parameters
7.
Categorical perception (identification and discrimination, including reaction time and timecourse effects)
ibid, p. 42
Original parameters
8.
Recovery from noisy input, and influences of lexical neighborhood
ibid, p. 55
Original parameters
9.
Time course of word recognition
ibid, p. 57
Original parameters
10.
Frequency and context effects
ibid, p. 60
Original parameters
11.
Lexical segmentation, including multiword sequences, and impact of right context
ibid, pp. 61-69
Original parameters
12.
Compensation for coarticulation
Elman and McClelland (1988), Fig. 3A
Original parameters
13.
Lexically mediated compensation for coarticulation
ibid, Fig. 3B
Original parameters
14.
Stochastic noise allows correct simulation of classical context effects that the basic TRACE model fails to simulate correctly
McClelland (1991)
Modified parameters and noise
15.
Timecourse of cohort and rhyme Allopenna et al. (1998), competition observed in human eye tracking Expt. 1 (Fig. 4, this study (eye tracking data) chapter)
16.
Elimination of rhyme effects in a gating paradigm (eye tracking data)
ibid, Expt. 2
Original parameters
17.
Time course of word frequency effects (eye tracking data)
Dahan, Magnuson, and Tanenhaus (2001)
Original parameters + 3 alternative implementations of frequency
Original parameters + Luce choice rule
Spoken word recognition 477
Effect /phenomenon
Reference
Notes
18.
Subcategorical mismatch (eye tracking data) Dahan et al. (2001)
Original parameters
19.
Subcategorical mismatch (lexical decision data, words only—nonwords not included)
Magnuson, Dahan, and Tanenhaus (2001)
Original parameters
20.
Lexically-induced delays in phoneme recognition
Mirman, McClelland and Original parameters Holt (2005)
21.
Lexically-guided tuning of speech perception with Hebb-TRACE
Mirman, McClelland, and Holt (2006)
Extensions for learning
22.
Attentional modulation of lexical effects
Mirman, McClelland, Holt, and Magnuson (2008)
Original parameters + attentional scaling parameter
23.
Individual differences relate to lexical decay
McMurray et al. (2010)
Variety of parameters explored
24.
Changes in timecourse of phonological competition in Broca’s and Wernicke’s aphasia (eye tracking data)
Mirman, Yee, Blumstein, Original parameters + and Magnuson (2011) modified Luce choice rule (other parameters explored)
25.
Variation in phoneme inhibition relates to reduced rhyme + enhanced subcategorical mismatch in individuals with weaker language abilities
Magnuson, Kukona, Braze et al. (2011)
Variety of parameters explored
26.
Duality of length effects (early short-word advantage, later long-word advantage)
Magnuson et al. (2013)
Original parameters; see Pitt and Samuel (2006) for human data
27.
Suppression of embedded words
Magnuson et al. (2013)
Original parameters
28.
Phoneme restoration as a function of word length and position
Magnuson (2015)
Original parameters
29.
Flexible modulation of inhibition may explain experience-based changes in lexical competition
Kapnoula and McMurray Original parameters + (2016) changes in phonetic features tailored to materials + inhibition variation
30.
Feedback promotes resistance to noise (model-specific results [feedback turned on vs. feedback off])
Magnuson et al. (2018)
Original parameters + parameters for large lexicon (Frauenfelder and Peeters, 1998)
Norris et al. (2000) propose that such effects arise post-perceptually—that is, that lexical knowledge moderates a decision process rather than mediating sublexical activation. In their Merge model, phonemes feed to words, and then both levels—phonemes and words—feed to a post-perceptual bank of phoneme decision nodes. Norris et al. argue that any post-perceptual decision appearing to support lexical feedback could be simulated by an autonomous model like Merge with post-perceptual integration.
478 James S. Magnuson and Anne Marie Crinnion Norris et al. presented two simulations with Merge demonstrating its abilities to simulate apparent lexical effects. The first was subcategorical mismatch, where words are cross-spliced to introduce misleading coarticulatory cues consistent with a word or nonword. Marslen-Wilson and Warren (1994) used a lexical decision paradigm and found that words cross-spliced with a word (e.g., NET with coarticulation on the vowel consistent with NECK) or a nonword (NET with misleading coarticulation consistent with NEP on the vowel) were both recognized more slowly than a version of NET with consistent coarticulation (created by splicing together two instances of NET to ensure that any differences were not due to the cross-splicing operation). Marslen-Wilson and Warren reported that TRACE predicted a different pattern, with misleading coarticulation from a cross-spliced word leading to slower target activation than from misleading coarticulation from a cross-spliced nonword. Norris et al. found that Merge predicted the pattern observed by Marslen-Wilson and Warren (equivalently slowed responses for word or nonword cross splicing). Subsequently, Dahan, Magnuson, Tanenhaus, and Hogan (2001) used the visual world paradigm to revisit this finding. They found that activations in TRACE mapped closely onto participants’ fixation proportions. They attributed the difference between eye tracking and lexical decision data to participants responding “yes” in response to the activation of the lexical competitor when the misleading coarticulation was consistent with a word (Magnuson, Dahan, and Tanenhaus (2001) simulated “yes” responses from TRACE activations and the Dahan et al. fixation proportions; both predict the Marslen-Wilson and Warren data pattern for fairly fast responses). We will return to this result when we discuss Shortlist B. Their second simulation focused on a result reported by Connine, Titone, Deelman, and Blasko (1997), where final phonemes in nonwords were detected more quickly when the nonword was more similar to a real word (e.g., participants detected the final /t/ in *GABINET more quickly than in *MABINET; the former differs from CABINET by only a single phonetic feature). Merge readily simulated this pattern with post- perceptual lexical integration in its phoneme decision nodes. Based on these results, and detailed critical examinations of other purportedly top-down effects, Norris et al. argued that there was no evidence that definitively supported an interactive architecture over an autonomous one. Before turning to Shortlist B, we need to go on a brief tangent about another paradigm that proponents of both views agreed had strong potential to provide definitive support for interaction: lexically mediated compensation for coarticulation (LCfC). Elman and McClelland (1988) devised this paradigm by combining the Ganong effect (discussed above) with the compensation for coarticulation paradigm (Mann and Repp, 1981). In the latter paradigm, participants perform an identification task on a front-back place of articulation (POA) continuum (e.g., from TAPES [front] to CAPES [back]) without context, and with a context word ending with a sound with a front POA (e.g., MUSS) or a back POA (e.g., MUSH). Perception of the continuum shifts away from the POA of the context sound. Mann and Repp proposed that listeners learn that in real speech, if a talker has to produce two segments with distant
Spoken word recognition 479 places of articulation, they are unlikely to reach the canonical POA for the second sound. Thus, we learn to compensate for coarticulation, and accept noncanonical POA based on context. Elman and McClelland proposed that if compensation for coarticulation reflected sublexical interactions, one could test for lexical mediation of sublexical processing by replacing clear context sounds with ambiguous sounds midway between front and back places of articulation, but then use lexical context to restore one or the other. They used items like CHRISTMAS and SPANISH, with the final segment replaced with the same ambiguous fricative. Consistent with their prediction, these restored phonemes shifted the perception of the target continuum (e.g., TAPES-CAPES) in the same direction as clear tokens of ‘s’ or ‘sh.’ However, Pitt and McQueen (1998) reported that it seemed that a sublexical explanation was possible: they were able to drive compensation for coarticulation with ambiguous phonemes in nonword contexts based on the likelihood of the preceding phoneme (i.e., transitional probability). Subsequently, Magnuson, McMurray, Tanenhaus, and Aslin (2003a) reported LCfC with words where transitional probabilities were pitted against lexical context, and lexical context won. Magnuson, McMurray, Tanenhaus, and Aslin (2003b) also assessed all items that had been used in previous LCfC studies and found that the diphone transitional probabilities were at odds with lexical context in several items, and that larger n-phones could not explain all extant results (on average, the necessary n-phone was approximately the same as word length). Samuel and Pitt (2003) also reported additional positive LCfC results (along with analyses of factors that appear to affect the strength of effects). However, McQueen, Jesse, and Norris (2009), using materials supplied by Magnuson et al., were unable to replicate the earlier results. Although more positive than negative results have been reported (possibly reflecting a “file drawer” phenomenon due to the difficulty of publishing null results), the apparent fragility of the LCfC paradigm has led to an impasse (see, however, Luthra et al., 2021). With this context, let us consider Shortlist B. The “B” in Shortlist B stands for Bayesian, as Shortlist B is a radical revision of the original Shortlist model. Norris and McQueen (2008) redubbed the original model “Shortlist A,” with “A” standing for activation. Shortlist A was a neural network model, and thus its currency was node activation. It differed from the TRACE model by building competition networks on-the-fly at each phoneme position based on the top lexical matches (intended to be generated by simple recurrent networks (Elman, 1990; 1991), but generated via a simple lexical lookup) and, crucially, eschewing feedback. (For an extensive review of the case for rejecting lexical feedback, see Norris et al., 2000.) Norris and McQueen (2008) reject the notion of linking activation in network models to human behavior and cognition as convoluted (additional parameters and assumptions are frequently required to make such links). They propose instead that probabilities from a low-parameter Bayesian model can be linked directly to human behavior, while also providing a principled and optimal model of human SWR. Like Shortlist A, Shortlist B rejects top-down information flow. Shortlist B also uses
480 James S. Magnuson and Anne Marie Crinnion a different form of input (diphone confusion probabilities measured from human participants, sampled in three “gates” per phoneme). At each input step, up to 50 words aligned with the current phoneme can be added to a “lattice” (a set of paths corresponding to possible lexical segmentations of the input) consisting of a maximum 500 paths—the 50-word and 500-path limits are the “shortlist” parameters. Words and paths are ranked according to their probabilities (with words evaluated relative to their frequency and fit with the bottom-up input, and paths evaluated based on conditional probabilities). Probabilities over all paths within the lattice are ranked, and the model works in some ways very much like the Cohort Model logic we presented in Figures 22.1 and 22.2, such that for many word sequences, only one complete path is possible. When more than one path is possible (e.g., perhaps RECOGNIZE SPEECH and WRECK A NICE BEACH), prior probabilities of words or conditional probabilities of multiple words identify the most likely parse. One way in which Shortlist B compares very favorably with TRACE is that it includes the full Dutch phoneme inventory (37 phonemes) and a lexicon of 20,250 Dutch words. Norris and McQueen reported seven simulations (see Table 22.2). They also reported a new Merge model, Merge B, and used it to simulate Marslen-Wilson and Warren’s
Table 22.2 Effects accounted for by the Shortlist B model. All are reported by Norris and McQueen (2008). Note that an eighth simulation of subcategorical mismatch is not included because it was conducted with a different model (“Merge B”). Effect /phenomenon
Reference
Notes
1.
Parsing a multiword sequence and overcoming embedded words via right context
Fig. 3
TRACE: #11 in Table 22.1
2.
Word frequency x neighborhood density Fig. 4 x neighborhood frequency
3.
Word frequency x neighborhood characteristics x stimulus quality
Fig. 5
Not directly simulated with TRACE, though #27 in Table 22.1 could be comparable for neighborhood and stimulus quality)
4.
Time course of word frequency effects
Fig. 7
TRACE: #17 in Table 22.1
5.
“Word spotting” on the basis of the stress-based Possible Word Constraint
Fig. 8
Not possible in TRACE without adding stress representation
6.
Identity priming based on Possible Word Fig. 9 Constraint
Not possible in TRACE without adding stress representation
7.
Recovery from onset mispronunciation
Fig. 10
TRACE has only been applied to word frequency (Table 22.1, #17) and selective neighborhood characteristics (Table 22.1, #8)
Conceptually related in TRACE: Table 22.1 #4, #8
Spoken word recognition 481 (1994) subcategorical mismatch paradigm. They compared Merge B to data from Marslen-Wilson and Warren as well as a study by McQueen, Norris, and Cutler (1999). Merge B provided quite good quantitative fits for both word and nonword cross-spliced items, with some minor discrepancies. (However, they neither mentioned the Dahan et al. (2001) eye tracking data and TRACE simulations, nor provided simulations of the time course.) Thus, Shortlist B accounts for seven phenomena, including a more detailed assessment of neighborhood and frequency interactions than has been conducted with TRACE, and simulations of the stress-based Possible Word Constraint (Norris, McQueen, Cutler, and Butterfield, 1997) that would not be possible in TRACE without adding representations for syllabic stress. This is far fewer than the list in Table 22.1, and the lack of overlap (with a few exceptions) impedes direct model comparisons. The full phoneme and large lexical inventories in Shortlist B is a considerable advance compared to other models. One potential weakness of Shortlist B is that the model is told explicitly where each phoneme begins, whereas in TRACE, phonemes overlap and their onsets are not directly encoded. As we already mentioned, there are not robust cues for phoneme segmentation in real speech, so this is a simplifying assumption that may need to be reconsidered in the future in Shortlist B. What of the feedback debate? It remains unresolved. Norris, McQueen, and Cutler (2016) present an extended case for their position that feedback is never necessary, but have extended their treatment of this issue to accept some forms of potential feedback, while continuing to reject (only) feedback as implemented in TRACE. They suggest that feedback in service of Bayesian processing is compatible with their view, because it is qualitatively different from what they label “activation feedback.” Magnuson et al. (2018) report simulations showing that feedback helps make TRACE robust against noise and makes TRACE “Bayes approximant.” They also provide a critique of Norris et al. (2016)’s arguments. In addition, McClelland (2013) and McClelland et al. (2014) provide detailed explanations of how interactive activation models relate to Bayesian models, and under what conditions they are Bayes- equivalent. Norris, McQueen, and Cutler (2018) published a commentary laying out their disagreement with Magnuson et al. (2018). On our view, this debate remains at an impasse. Until a paradigm is devised that can definitively falsify one model or the other—or until the field of SWR simply moves on to more realistic models—no resolution is in sight.11 Indeed, one might take the view that the field is moving on, with great excitement about the concept of predictive coding (PC). Rao and Ballard (1999) originally
11 In fact, after this chapter was submitted, Luthra, Peraza-Santiago, Beeson, Saltzman, Crinnion, and Magnuson (2021) published new results demonstrating robust lexically-mediated compensation for coarticulation (with an initial sample of 40 participants and a direct replication with another 40). The key was conducting pilot studies with potential items to establish separately robust phoneme restoration and compensation for coarticulation before attempting to combine items for the full paradigm. Whether these results afford progress or resolution in the feedback debate remains to be seen.
482 James S. Magnuson and Anne Marie Crinnion proposed PC as an extension of theoretical work in vision. In their work, a visual stimulus (image patch) was encoded by a hierarchy of “modules” with progressively larger spatial scale (e.g., first-level modules take input from small, overlapping grids of pixels tiled over part of an image, and second-level modules take input from multiple, overlapping first-level modules, and so on). Higher levels send predicted states to their immediately inferior levels, and those predictions are compared to the states of the inferior-level nodes. Inferior levels, rather than passing forward their actual state, pass forward the discrepancy between the top-down prediction and their state. This provides a potentially compact and robust code. Rao and Ballard’s PC model developed receptive fields at the lowest levels that resembled wavelets based on the statistics of natural images. Higher levels developed progressively larger and more complex receptive fields for visual features (and note that that is where the model stops; it does not perform image or object recognition, for example). A possible prediction from such a mechanism is that when an input conforms to top-down expectations, the bottom-up signal should be weaker since error (which is what is passed forward) is smaller. Several neuroimaging studies have reported results consistent with this hallmark of PC. In an example from SWR, Gagnepain, Henson, and Davis (2012) presented listeners with (lexical) expectation-confirming inputs (i.e., words, e.g., formula) or (lexical) expectation-violating inputs (e.g., formubo). Gagnepain et al. found relatively weaker activation in superior temporal gyrus at the la of formula compared to the bo of formubo (i.e., for items like these; this is just an example). We note that such results are consistent with longstanding evidence that unexpected inputs can drive event-related potentials like the phonological mismatch negativity or N400 (e.g., Kutas, Van Petten, and Kluender, 2006). But do such results demonstrate predictive coding or simply predictive processing? Davis and colleagues have done some proof- of-concept modeling with simple Bayesian prediction (Gagnepain et al., 2012) and variants of interactive activation models (Blank and Davis, 2016) with multiplicative feedback from words to phonemes (somewhat like TRACE) or subtractive feedback (intended to be like PC). A full discussion of these studies is beyond the scope of this chapter, but in a nutshell, these models are neither faithful implementations of PC, nor have they been validated on basic aspects of SWR. Luthra, Li, You, Brodbeck, and Magnuson (2021) provide a fuller review of this literature, and also present simulations using the Gagenpain et al. simple Baysesian model (which we call “predictive cohort”), TRACE, and a simple recurrent network (SRN; Elman, 1990) trained on TRACE-like inputs (it predicts the current word from acoustic-phonetic features over time). All three models display predictive processing, but surprisingly, the only model that shows the putative hallmark of PC (a model-internal reduction in signal for expectation-confirming stimuli akin to formula relative to expectation- violating inputs akin to formubo) is TRACE (though the prediction error used during training for the SRN would also qualify, if we consider it model-internal). Luthra et al. argue that true understanding of the nature of neural signal changes in response
Spoken word recognition 483 to expectations, and whether they actually reflect PC, will require the development of a recurrent PC model that can be applied to SWR.
22.3 New developments, unresolved challenges, and the limits of current theories and models The breadth and depth of models like TRACE and Shortlist B are substantial and remarkable. These models have helped guide theories of SWR, and inspired a wide variety of empirical and computational investigations that have enhanced our understanding of human SWR. They complement rather than fully supplant rule-based (mathematical and verbal) models like the NAM and the Cohort Model. In particular, the NAM’s DAS rule for defining neighbors continues to contribute to most studies of SWR, as controlling neighborhood characteristics has become the convention. In recent years, Vitevitch and his colleagues have added new force and scope to the NAM approach by applying the tools of network science (e.g., Menczer, Fortunato, and Davis, 2020; Newman, 2010) to graphs created by connecting DAS neighbors (pioneered by Vitevitch, 2008). Network science provides a toolkit for characterizing interconnected systems at gradient levels of analysis ranging from local (“microscale,” i.e., individual nodes, corresponding to words in this case) to subcomponents (“mesoscale”) to global (“macroscale,” characterizing the entire network based on statistics aggregated over all nodes12). Vitevitch and colleagues have shown that a graph-theoretic network of lexical forms based on the DAS rule immediately increases NAM’s scope. For example, when sets of words are selected to be matched on neighborhood but differ in clustering coefficient (the proportion of a word’s neighbors that are also neighbors of each other), sets with higher clustering coefficient are processed more slowly (Chan and Vitevitch, 2009). Similarly, Siew (2017) examined sets of words matched on DAS neighborhood 12 Readers
may be familiar with the notion of “small world networks” (Watts and Strogatz, 1998), where most nodes have fairly few connections, but enough nodes have many connections that the number of “hops” from node-to-node it takes to get from any node to any other node is small. This was first observed informally in networks of human acquaintances (Milgram (1967) famously asked people in the Midwest to pass along a letter (without an address) to a named individual on the East Coast by sending it to someone they knew who might know someone (who might know someone else) more likely to know the addressee; on average, letters that arrived had passed through six individuals—the basis for the possibly familiar concept that there are only “six degrees of separation” between randomly selected individuals). A surprising number of social, biological, and artificial systems have small world structure, though the tools of network science can classify many kinds of networks, and network type has significant implications (e.g., for a system where information is transmitted, how efficiently information can be transmitted and how robust the system may be to noise or damage; see Strogatz, 2003 for an accessible overview).
484 James S. Magnuson and Anne Marie Crinnion (“one-hop” neighbors in a graph, since they are directly connected) but varying in number of two-hop neighbors (words that are two links apart in the DAS network); words with more two-hop neighbors were processed more slowly. It is still early days with respect to applying this unconventional, innovative approach,13 but it has the potential to provide novel insights into SWR. For example, one possibility that could be explored is comparing similarity metrics by using them as the connecting rules for such networks. Note that there are many aspects of human speech that are undoubtedly crucial for human SWR that are outside the scope of current models and theories (for the most part, in the sense that we do not have theories that provide an integrated account of the phenomena reviewed above and those reviewed below, even if theoretical accounts have been offered for individual phenomena). Let’s consider a subset of these unresolved challenges. One challenge for theories of spoken word recognition is how to account for the fact that listeners process speech under a variety of contexts, often in environments with more variation than we typically allow in a laboratory (see Purse, Tamminga, and White, this volume), which also poses significant challenges for understanding the development of lexical knowledge (see Creel, this volume). Variation can occur at the level of the ambient acoustic environment (which may be noisy or echoey), details of the speech input (such as speaking rate, or number of talkers, and differences between them in size, age, sex, or dialect/accent). Mullennix, Pisoni, and Martin (1989) reported foundational studies illustrating the impact of talker variation. In a series of tasks, Mullennix and colleagues found that both the speed and accuracy of SWR are reduced under conditions of talker changes. Across perceptual identification and naming tasks, they found that the impact of talker variation was more consistent than the impact of word frequency or neighborhood density. Magnuson and Nusbaum (2007) provide a review of talker variation effects, and suggest ways theories of speech perception and SWR might accommodate talker variation; a key point is that features of various sorts are stipulated (given to models) rather than discovered. For example, phoneme boundaries are stipulated in Shortlist B, and phoneme templates are stipulated in TRACE. Many additional features would have to be stipulated in such models to account for natural variation in the speech signal; we return to this complication in the final section of the chapter. Another challenge is that speaking rate varies dynamically in everyday speech, and this alters the mapping from acoustics to perceptual categories. A classic example is that formant transition durations that correspond to /w/at a relatively fast speaking rate map to /b/at a relatively slower rate (e.g., Miller and Baer, 1983), suggesting listeners must have to accommodate this variation somehow. The impact of variation can be seen in SWR tasks. Francis and Nusbaum (1996) found an interaction between cognitive load and variation in speaking rate, and Nusbaum and Morin (1992) found a similar
13
For an application of graph theoretic networks to speech perception (phonological categories and talker variability), see Crinnion, Malmskog, and Toscano (2020).
Spoken word recognition 485 interaction between load and talker variation. McLennan and Luce (2005) reported similar interactions, where talker and rate variation interacted with task difficulty. Such results suggest that accommodating variation is an effort-and attention-demanding process. However, recall that one of the core simplifying assumptions of the SMP is that a speech perception module provides something like a stream of phonemes as input to SWR. So perhaps we can simply propose that prelexical normalization mechanisms of some sort take care of these sources of variation. An impediment to this view is that lexical and sentential information appear to provide crucial bases for accommodating variation. For example, a lexical mismatch might be a better indicator of a rate change than, e.g., vowel durations (hearing what seems to be a /w/in a lexical context that calls for a / b/ , e.g., / wol/[nonword *WOLE] might suggest the actual production is / bol/ [BOWL]). Similarly, context may indicate that a change in talker requires a change in acoustic-phonetic mapping (e.g., hearing THE CAP WAS ON HER HID might suggest a change in the /ɛ/-/ɪ / boundary).14 Indeed, over the last two decades, a growing literature has shown that listeners learn to adjust acoustic-phonetic mappings for specific talkers based on lexical context—a finding known as lexically mediated perceptual learning. Norris, McQueen, and Cutler (2003) presented listeners with an ambiguous fricative (midway between /f/and /s/). Critically, one group of listeners heard the ambiguous token on /f/-final words and heard unambiguous /s/-final words, while another group of listeners heard the ambiguous token on /s/-final words and heard unambiguous /f/-final words. Afterwards, when listeners categorized tokens from an /f/-/s/continuum, the listeners who heard the ambiguous token on /f/-final words categorized more tokens along the continuum as /f/than those who heard the ambiguous token on /s/-final words. Importantly, this shift does not occur with ambiguous tokens at the ends of nonwords. This foundational study has been extended by many groups, and in particular Kraljic and Samuel (2005, 2006, 2011; Samuel and Kraljic, 2009), who have examined how such perceptual learning varies across different classes of phonemes, whether it generalizes between talkers, and contexts that can block such learning (e.g., an image of a talker with a pen in her mouth suggests that abnormal patterns result from that motor difficulty rather than reflecting talker-specific patterns). We do not have space to review this literature in detail, but note that it suggests a tight linkage between signal-level details and lexical and sentential contexts. Another challenge is that in casual, fluent speech, there are rampant phonological processes that lead to significant deviations from canonical phonemic forms of spoken words. For example, Gow and McMurray (2007; see also Gaskell and Marslen-Wilson, 1996) studied coronal place assimilation, in which a coronal segment assimilates the place of a following non-coronal segment (e.g., clean bars might be pronounced as cleam 14 This impediment also applies to episodic or exemplar accounts that try to avoid normalization (e.g., Goldinger, 1998), or at least suggests a limitation for them (see Magnuson and Nusbaum, 2007, for a more detailed critique of episodic accounts).
486 James S. Magnuson and Anne Marie Crinnion bars). They used a visual world paradigm study to examine whether context can signal lexical form using coronal assimilation. Because coronals assimilate in English, hearing an assimilated segment followed by a non-coronal is more likely than an assimilated segment followed by a coronal (i.e., when bite guard is pronounced more like bike guard, it is more likely to be understood as bite guard in context because of assimilation; when bite damage is pronounced more like bike damage, it will be understood as bike damage, since if the true underlying form were bite, there would be no assimilation due to the following coronal). Gow and McMurray found that assimilated coronals do indeed facilitate activation for non-coronals (where assimilation occurs) and inhibit activation for coronals (where assimilation does not occur). They also found regressive effects; when there is lexical ambiguity in the assimilated segment (i.e., bike and bite, as compared to clean and cleam), the class of the following segment (i.e., whether it is a coronal or not) influences processing of the word-final context segment (bike or bite). Complicating the situation further, assimilations tend to be partial and graded (Gow, 2001, 2002, 2003). Clearly then, these phonological processes influence lexical processing in ways that may be difficult to reconcile within current models of SWR. There are also reductions that lead to more extreme deviations from canonical forms. Consider progressively more casual productions of “I am going to be there” in Table 22.3, which gives a sense of the challenge. Johnson (2004) found that more than 60% of words in a corpus of casual speech deviate from canonical forms, and that one or more segments are missing in nearly 30% of casually produced words. This presents an enormous challenge to models and theories of SWR under typical listening conditions (clear speech, low noise, low cognitive effort). Janse and Ernestus (2011; see also White, Mattys, and Wiget, 2012) report experiments that suggest continuous use of syntactic and semantic context is required to overcome the rampant ambiguity that results; indeed, transcription accuracy is very poor for words extracted from fluent speech and presented in isolation. This bumps up against another core simplifying assumption of the SMP, that form recognition can proceed modularly without constraints from higher levels of processing. Another consideration is the rich prosodic information in the speech signal. This includes the “melody” of speech (variation in pitch over time), but also timing and stress, among other factors (see Dahan, 2015, for a review). Most aspects of prosody
Table 22.3 Progressively more casual productions of “I am going to be there.” Gloss
IPA
Number of phonemes
I am going to be there
/ɑiæmgointubiðeiɹ/ i
i
14
I'm going to be there
/ɑ mgointubiðe ɹ/
13
I’m gonna be there
/ɑimgʌnʌbiðeiɹ/
11
I’munna be there
i
i
/ɑ mʌnʌbiðe ɹ/
10
Spoken word recognition 487 are absent from current models, with one salient exception: the metrical segmentation strategy (MSS) proposed by Cutler and Norris (1988), and integrated into the Shortlist A model by Norris, McQueen, and Cutler (1995). The MSS proposes that language- specific, probabilistic patterns of strong and weak syllable stress can constrain segmentation of fluent speech. Human behavior in several studies has been consistent with the MSS (e.g., detection of a word [e.g., MINT] embedded within a nonword is better when the word occurs at the onset of a nonword with strong-weak stress [MINTEF] rather than a strong-strong nonword [e.g., MINTAFE]). The MSS was implemented in Shortlist A simply by giving a boost to the match score of items that began with a strong syllable, and penalizing other items—in essence, creating features that code for primary stress on the first syllable of a word. While this did not make Shortlist A sensitive to stress in general, it substantially boosted the model’s ability to simulate human sensitivity to the MSS. Other aspects of prosody have not been incorporated into current models, as we will discuss in the next section. The challenges of variation, prosody, assimilation, and reduction are exacerbated by a variety of adverse listening conditions that are common outside the laboratory (see Mattys, Davis, Bradlow, and Scott, 2012, for a review). Noise, cognitive load, anxiety (Mattys, Seymour, Atwood, and Munafo, 2013), and chronic or age-related reductions in perceptual acuity (along with perceptual learning) are, for the most part, outside the scope of current models (noise is easy to apply, though more work is required to link noise in models to human speech recognition under noise, though Mirman et al. (2008) address attentional effects and Mirman et al. (2006) address perceptual learning with the TRACE model). Models must be extended to real-world conditions, learning over the lifespan, and age-related changes. We have avoided grappling with neural-level responses to SWR in this chapter so far. One might justify this on the common interpretation of Marr (1982) that algorithmic theories can be developed independently of implementational (neurobiological) details. Marr, however, also argued that complete understanding requires eventual linkage of computational, algorithmic, and implementational theories. In fact, as we consider interactions between speech perception and SWR, cognitive neuroscience findings may be crucial. For example, using an effective connectivity analysis, Gow and Olson (2016) found influences of brain regions associated with lexical processing areas on regions associated with lower-level acoustic information, suggesting that the neural representations of ambiguous speech sounds may be influenced by sentential and lexical context (i.e., lexical information may mediate perception, even at a neural level). Furthermore, Gwilliams, Linzen, Poeppel, and Marantz (2018) used MEG to demonstrate that acoustic-phonetic information can be maintained so that later information in the processing stream can influence perception of earlier sounds. Recent findings using EEG to measure the time course of speech perception have demonstrated that both semantic information and lexical content influence processing at lower levels (i.e., sublexical; Getz and Toscano, 2019; Noe and Fischer-Baum, 2020). Noe and Fischer- Baum (2020) showed that for an ambiguous sound between a voiced and unvoiced sound (i.e., a sound in between /t/and /d/), participants’ neural responses (as measured
488 James S. Magnuson and Anne Marie Crinnion by the N100 ERP component, a signal known to indicate the presence of voicing; see Toscano, McMurray, Dennhardt, and Luck, 2010) demonstrated lexical mediation: when participants heard a sound ambiguous between /t/and /d/at the beginning of a continuum where only one sound corresponded to a word (i.e., tape-*dape, vs. *tate- date), participants’ N100 responses were more like responses to less ambiguous /t/ sounds when hearing a tape/*dape continuum and more like /d/when hearing a *tate/ date continuum. These lines of work represent an important frontier between speech perception, SWR, and neural mechanisms that operate during speech processing.
22.3 Moving forward How might understanding of SWR move forward? It is possible that we are reaching the limits of current modeling frameworks. One of the major hurdles is relating inputs in models (ranging from phonemes in TISK (Hannagan et al., 2013) to “pseudo-spectral” features over time in TRACE to human diphone confusion probabilities in Shortlist B) to real speech. As we note above, adding additional aspects of the speech signal to such inputs mainly involves reducing those details to features of some sort and stipulating their presence or absence—stress as present or not, for example. One could try building in multiple forms of each word in a lexicon to account for reductions. One might build in conditional probabilities to account for phenomena like assimilation or compensation for coarticulation. However, such approaches are unlikely to extend current modeling approaches much further, as we are unlikely to be able to fully anticipate and enumerate the continuous, graded nature of variation in speech. Consider a study by Salverda, McQueen, and Dahan (2003; see also Davis, Marslen- Wilson, and Gaskell, 2002). They reasoned from longstanding findings that there are systematic relationships between prosodic features and word length (Lehiste, 1970; e.g., vowel durations in stressed syllables tend to be longer in monosyllabic than bisyllabic words) that listeners might be sufficiently sensitive to these probabilistic contingencies to constrain SWR. They recorded multiple talkers saying words like HAM, HAMMER, and HAMSTER. The initial vowel, on average, was about 20 msecs longer in monosyllabic words like HAM than in bisyllabic words, although duration distributions overlapped. They used a visual world paradigm and found that listeners fixated referents of words significantly faster when vowel duration relative to speaking rate was consistent with the number of syllables in the word (that is, durations consistent with HAM led to longer latencies to reach HAMSTER than durations consistent with a bisyllabic word). This illustrates how the simplifying assumptions underlying the SMP may actually complicate aspects of SWR, as this finding suggests that words contain probabilistically constraining information about word length could, for example, mitigate the embedding problem we mentioned earlier. If we assume a phonemic grain of input to SWR, we hide from ourselves subphonemic constraints that could actually simplify a core challenge of SWR (see Magnuson, 2008, for extended discussion). And, as we
Spoken word recognition 489 discussed already, a similar situation holds in downstream direction: recognition of many words in casual, fluent speech depends crucially on morphological, semantic, and syntactic context for disambiguation. The simplifying assumption in the SMP that the goal of SWR is recognition of lexical sound forms initially improves the tractability of SWR (by giving us a more manageable scope) but at the cost of walling off constraints that may greatly simplify the problem. These two problems—the proliferation of features that can approximate but not fully capture details of real speech, and the potential for simplifying assumptions to hide constraints—are daunting. Moving beyond them may require modeling frameworks that both use real speech as input and connect to higher levels of language comprehension. As we noted early in this chapter, the modal SMP view of SWR adopts simplifying hypotheses that we expect virtually all researchers in this field would agree are incorrect, but were crucial to pioneering work in SWR. Their utility is reductionist; they constrain theoretical problems to manageable scope. But as we also discussed, such simplifications have to be provisional; once enough smaller component problems are fairly well understood, it is time to develop larger-scope, integrative theories. We think we have reached this point in SWR. One potential way to do this would be to embrace developments in deep learning that have allowed complex networks that have evolved from the same origin as models like TRACE (i.e., the parallel distributed processing revolution; Rumelhart, McClelland et al., 1986; McClelland, Rumelhart et al., 1986) that currently power (fairly) robust automatic speech recognition on literally billions of smart phones worldwide every day (e.g., Hinton, Deng, Yu et al., 2012). A formidable challenge in exploring implications of deep learning approaches for cognitive theories is that the best-performing systems have many layers of multiple kinds and require complex, (arguably) biologically-implausible training regimes. This means understanding how and why the fully trained systems function as they do may present theoretical and technical challenges similar to the ones we face in trying to understand human SWR—that is, determining how those systems work may require substantial experimentation (via simulation with the systems themselves) or even the development of simpler models to isolate and identify key algorithmic details. The prospects for progress may thus appear rather dismal. Kietzmann, McClure, and Kriegeskorte (2019), however, make a compelling case that deep networks are not inscrutable black boxes, but can be understood at various levels of detail by careful analysis and by relating them to human behavior and neurobiology. Furthermore, it is not necessary to begin with the most complex models available for a domain. Magnuson, You, Luthra et al. (2020) set out to borrow the minimum possible from automatic speech recognition approaches in order to develop a neural network model of human speech recognition that would operate on real speech. They developed a two-layer recurrent network that maps spectral slice inputs from speech files to semantic output vectors via a hidden layer of long short-term memory (LSTM) nodes (Hochreiter and Schmidhuber, 1997). LSTMs have memory cells and gates that determine the relative weight of new and old information from long sequences of input.
490 James S. Magnuson and Anne Marie Crinnion This network, dubbed EARSHOT (for Emulation of Auditory Recognition of Speech by Humans Over Time), achieves high accuracy on 1,000 words for nine training talkers, as well as moderate generalization to subsets of words excluded from training for each talker, and for a tenth talker completely excluded from training (with ten instances of the model trained, each excluding a different talker). It also simulates the time course of lexical activation and phonological competition over time, with results that resemble those observed in TRACE (see Figure 22.4). Intriguingly, although the model is not trained on phonetic targets, a structured phonetic code emerges in hidden unit responses that strongly resembles phonetically structured responses in human superior temporal cortex (Mesgarani, Cheung, Johnson, and Chang, 2014). Magnuson et al. point out that this similarity does not necessarily indicate similar functions or mechanisms in the model and cortex—simply that EARSHOT and human cortex are both sensitive to key information in the speech signal (i.e., phonetic patterns). However, EARSHOT’s hidden units also display complex spectrotemporal response patterns that could generate novel predictions for human cortical responses to speech. EARSHOT represents an initial step toward creating models complex and powerful enough to operate on real speech while remaining simple enough to guide theoretical understanding. It currently has significant limitations. While EARSHOT is exposed to surface variation of various sorts, whether and how it accommodates that variation has not yet been addressed. EARSHOT is also currently restricted to single words, but has the potential to take structured, continuous inputs, which could open the way to models that integrate levels from speech to sentence processing. As a learning model, it also has the potential to provide a new framework for studying the development of spoken language comprehension. It is possible that by incrementally increasing the scope, realism, and complexity of such models, the gap between current cognitive models of SWR (e.g., TRACE, Shortlist B) and deep learning networks used for robust, real-world automatic speech recognition can be bridged—and along the way, provide bases for deeper theoretical understanding of human spoken language comprehension. However, neural networks like EARSHOT may miss important aspects of human speech processing. While the success of deep neural networks for robust automatic speech recognition used by billions of smartphone users daily (e.g., Hinton et al., 2012) is reason for optimism, it could be that such networks are to human speech recognition as airplanes are to avian flight—a good solution, but not homologous to biology. An alternative may come from neural networks that are able to oscillate in response to the temporally varying speech signal (Giraud and Poeppel, 2012). Peele and Davis (2012) propose that if oscillating neural systems encoding speech entrain to dynamically- changing rhythms of speech, this may provide natural solutions to some of the challenges discussed above (rate variation may fall away if the entire system entrains to amplitude modulations of speech). Future progress will likely require deeper understanding of the neurobiological foundations of speech processing guided by innovative, neurally-realistic models.
Chapter 23
Word-m eaning ac c e s s The one-to-many mapping from form to meaning Jennifer M. Rodd
23.1 Words: the gateway to meaning The mental lexicon contains stored knowledge about familiar words: information about their spoken and written forms, their meanings, and their grammatical properties. A complete theoretical account of lexical processing must specify not only the nature of these different forms of stored, lexical knowledge but must also describe the computational processes by which these different forms of knowledge interact. Here I focus on the process by which stored knowledge about a word’s form (either orthographic or phonological) maps forward onto knowledge about its meaning in adult comprehenders. Wordforms are the gateway to meaning: once a reader or listener has identified the particular word that is present in their environment (see Magnuson and Crinnion, this volume; Treiman and Kessler, this volume), they are able to access a rich store of conceptual knowledge that may include perceptual, abstract, and episodic features of the relevant concept. These individual word-sized units of stored knowledge can be combined together in a hierarchical manner within sentences to communicate complex, novel ideas. This chapter will argue that this mapping from form to meaning is best characterized as a one-to-many mapping: most words in language are, to some extent, ambiguous, such that a single wordform maps forward onto multiple, alternative word meanings. The aims of this chapter are to (i) summarize the different forms of lexical-semantic ambiguity that lead to a one-to-many mapping from form to meaning and then (ii) describe what is known from experimental evidence about the constraints that support fluent, accurate word-meaning access by guiding comprehenders toward appropriate word meanings. These constraints fall into two broad categories. First, listeners and readers make sophisticated use of their recent and longer-term experience with word
492 Jennifer M. Rodd meanings, such that individual meanings become more readily available as a consequence of experience. Second, a wide range of contextual cues guide access toward those word meanings that are most likely in any given context. Finally, the chapter will review models that specify potential mechanisms by which these different constraints operate to ensure successful word-meaning access.
23.2 Structure of the mental lexicon A common approach in models of lexical processing is to assume that the lexicon contains abstract representations of individual words. For example, in Morton’s highly influential Logogen model (1969), any given logogen becomes activated by information (perceptual or contextual) that is consistent with the properties of that specific word. Once a single logogen’s activation level rises to its threshold level, the properties of that word (including its meaning) becomes available to the comprehender. This idea that individual words are represented by abstract lexical units is retained in subsequent models such as the Cohort model of spoken word recognition (Marslen-Wilson and Welsh, 1978), the Interactive Activation and Competition model of visual word recognition (McClelland and Rumelhart, 1981), and the Dual Route Cascaded model of reading aloud (Coltheart, Rastle, Perry, Langdon, and Ziegler, 2001). Thus, according to these models, lexical representations provide an abstract level of lexical knowledge that bridges the gap between form-based (phonological or orthographic) representations and knowledge about word meanings. Distributed connectionist models of reading provide an alternative framework for viewing this mapping from form to meaning (Plaut, McClelland, Seidenberg, and Patterson, 1996; Seidenberg and McClelland, 1989; see Figure 23.1). This “triangle” model of lexical processing proposes that three distinct sets of lexical knowledge (orthographic, phonological, and semantic) words are each represented as distributed patterns of activity over three separate featural representations. Importantly, this theoretical approach does not assume that there are individual, localist units that each correspond to particular, familiar words, instead knowledge about individual words is coded in a distributed manner as patterns of activation over an array of units. Thus, within this framework, there is a direct mapping from information about a word’s form to its meaning without the need for intervening abstract level of lexical knowledge. Different lexical tasks (such as reading aloud) are viewed as transformations among these representations. For example, Seidenberg and McClelland (1989) reported simulations that focus on reading aloud: the model is presented with an input orthographic pattern and is required to learn how to generate an appropriate output phonological pattern. Although these early connectionist models did not fully implement the word-meaning representations, and focused primarily on the mapping from orthography to phonology, this framework has provided the foundation for much of the computational modeling work on word-meaning access in the past 30 years (see Rodd, 2019 for a recent review).
Word-meaning access 493
Context
Word Meaning: Semantics
Printed Wordform: Orthography
Spoken Wordform: Phonology
Figure 23.1 Seidenberg and McClelland’s (1989) lexical processing framework. Each oval represents a group of units, and each arrow represents a group of connections between these units. The unlabeled ovals represent “hidden” units that are included in the model to aid its learning of the complex mappings between different forms of lexical knowledge. Adapted from Seidenberg and McClelland (1989).
This computational modeling work highlighted the arbitrary nature of the mapping from form to meaning. For mono-morphemic words such as ‘dog’ and ‘cat,’ the individual constituent letters/sounds do not provide strong statistical cues as to the meanings of these words: words with similar sounds do not always have similar meanings. This arbitrariness in the form-to-meaning mapping is in stark contrast to the quasi-regular mapping between words’ printed forms and spoken forms in English (and many other alphabetic languages): when presented with a novel wordform such as ‘wug,’ readers will have highly consistent intuitions about how this word should be pronounced, but do not have similar certainty about its likely meaning (Castles, Rastle, and Nation, 2018). Thus, all theoretical accounts of how the form-to-meaning mapping is computed in human comprehenders must provide a mechanism that is capable of learning such an arbitrary mapping. But subsequent computational modeling work that focused in more detail on the form-to-meaning mapping (Armstrong and Plaut, 2008; Borowsky and Masson, 1996; Joordens and Besner, 1994; Kawamoto, Farrar, and Kello, 1994; Rodd, Gaskell, and Marslen-Wilson, 2004) highlighted a second, more problematic characteristic of this mapping: knowing uniquely exactly which wordform has been seen or heard (i.e., the word was definitely ‘bark’ and not ‘park’) does not uniquely identify a single word meaning. For example, the word ‘bark’ might be used to refer to the outer covering of a tree but might also be used to refer to the sound made by a dog. This high level of ambiguity places important challenges on theories of lexical processing: knowing which form of a word has been encountered does not, in isolation, provide sufficient information to uniquely identify the word meaning that was intended
494 Jennifer M. Rodd by the speaker or writer. Comprehenders must therefore use additional constraining information to achieve successful word-meaning access. This computational challenge arises both for theories of lexical processing in which word meanings are being accessed directly from information about word forms (e.g., Seidenberg and McClelland, 1989) and those that assume an intervening level of abstract lexical representation (e.g., Morton, 1969).
23.3 Forms of lexical-semantic ambiguity 23.3.1 Homonymy The one-to-many mapping from form to meaning can arise from various forms of lexical-semantic ambiguity. The most salient form of ambiguity occurs for those wordforms that have multiple, different meanings that are not related in either their meaning or their etymological history. For example, it is a historical accident that English happens to use the same wordform ‘bark’ to refer both to the sound made by a dog and to the outer covering of a tree. These unrelated meanings are usually listed as separate entries in dictionaries and are referred to as homonyms, but they can also be referred to as homographs (because they share their spelling) or homophones (because they share their pronunciation). True homonyms, where there is no relationship at all between the different meanings, are relatively rare—Rodd, Gaskell, and Marslen- Wilson (2002) estimated that only 7% of common English words are homonyms. English also contains homophones that only share their pronunciation (e.g., ‘waist/ waste,’ ‘some/sum,’ ‘for/four’) as well as homographs that only share their spelling (e.g., ‘lead,’ and ‘close’). These words, which are also accidents of language evolution (and the imperfect relationship between orthography and phonology in English), are ambiguous only in spoken or written language respectively. (This chapter focuses entirely on within-language ambiguity; see Kroll, Bice, Botezatu and Zirnstein, this volume, for a review of the additional ambiguity that can arise for multilingual individuals.) Homonyms present a particular challenge because the ambiguity that is inherent in these words must always be fully resolved for the overall sentence to be understood. The semantic features associated with the different meanings are completely non- overlapping and mutually exclusive. Therefore, if a comprehender retrieves a mixture of these two meanings (sometimes known as a blend state; Rodd, Gaskell, and Marslen- Wilson, 2004) this will not provide them with a useful, meaningful representation— the comprehender must therefore, at some point, fully commit to one or other of the meanings for a single coherent representation to be built. This need to disambiguate homonyms places significant additional processing demands on the language comprehension system that have been observed using a wide range of behavioral and
Word-meaning access 495 neuroimaging methods (e.g., Kadem, Herrmann, Rodd and Johnsrude, 2020; Rodd, Davis and Johnsrude, 2010; see Johnsrude and Rodd, 2016; Vitello and Rodd, 2015 for reviews).
23.3.2 Polysemy A more common form of ambiguity—usually referred to as polysemy—occurs when a single wordform can refer to multiple, semantically related word senses. These word senses often come in large clusters. For example, the word ‘run’ has numerous different dictionary definitions (e.g., “the athlete runs in the race,” “the paint runs down the wall,” “the river runs through the valley,” “the politician runs in the election,” “the program runs on the computer,” etc.). These different senses often overlap somewhat in their meanings, although it can often be difficult to quantify precisely the nature of these relationships. Indeed, it is likely that the senses of words that are listed in any given dictionary does not fully capture the full extent of the flexibility of such words, which can often be used in a wide range of contexts to express a range of subtly different concepts. Therefore, although the presence of polysemy in natural language often increases the processing demands, by making the form-to-meaning mapping more challenging, it adds considerably to the communicative power of language by allowing a finite set of familiar wordforms to convey a richer array of word meanings than if each wordform mapped uniquely onto a single meaning. Importantly, the capacity of speakers to productively extend the meaning of familiar wordforms to convey novel, related conceptual ideas provides a key mechanism for language change such that language continues to convey those meanings that are needed to support communication. For example, ‘tweet,’ ‘tablet,’ ‘post’ and many more words have acquired novel technological meanings that share some meaning overlap with their previous non-technological meanings (Hulme, Barsky, and Rodd, 2019; Maciejewski, Rodd, Mon-Williams, and Klepousniotou, 2018; Rodd, Berriman, Landau et al., 2012). This creative aspect of polysemy is particularly apparent for “regular polysemy”: those clusters of words that have systematic patterns of senses. For example, animal names can be productively extended to refer to the meat that comes from that animal (e.g., chicken, lamb, ostrich, etc.) and the names of substances can often also refer to objects made from that material (e.g., glass, iron, paper, etc; Copestake and Briscoe, 1995; Srinivasan and Rabagliati, 2015). There is evidence from historical analyses of language use that new word senses emerge in a relatively regular manner, predominantly through a process of “nearest-neighbor chaining” whereby each new sense is highly similar in meaning to an existing sense in the language (Ramiro, Srinivasan, Malt, and Xu, 2018). Polysemy differs from homonymy in several important respects. Most obviously, there is a far greater level of semantic (or associative) overlap between the different meanings. From a processing point of view, the level of competition that results from polysemy can be somewhat reduced compared with homonymy, and the level of competition can vary according to the current task demands (Rodd et al., 2012, 2002).
496 Jennifer M. Rodd For example, the different senses of ‘run’ are likely to share significant aspects of their meanings so that the only competition that occurs during comprehension will be between those non-shared elements of meaning. In addition, in many task situations the reader may not fully disambiguate between the different senses of a polysemous word— it may be sufficient for them to activate some of the shared/core semantic features of the word. For example, readers can make a lexical decision to the word ‘run’ or decide that it is semantically related to a probe word ‘move’ without necessarily fully disambiguating its word meaning to one precise word sense. In addition, studies of the syntactic aspects of sentence processing have highlighted the extent to which readers often engage in “good enough” processing in which their final interpretations of sentences containing syntactic ambiguities may often be incomplete or inaccurate (Ferreira, Bailey, and Ferraro, 2002). (See Armstrong and Plaut, 2016; Rodd et al., 2002 for extensive discussion about how the level of semantic disambiguation of polysemous word can vary as a function of task constraints.) It is important to note that in practice it can be extremely difficult to draw a clear categorical distinction between polysemy and homonymy: it is not always clear exactly how related two word meanings need to be to count as polysemous—lexicographers do not always agree on how word meanings/senses should be classified (Rodd et al., 2002). This is important because some theoretical accounts of how word meanings are represented assume a categorical classification in which there is a clear qualitative difference in how these two forms of ambiguity are represented: homonyms correspond to two distinct lexical items with separate representations in the lexicon, while the related sense of polysemous words all correspond to a single lexical entry (Carston, 2019). In contrast, other approaches allow the distinction between related word senses and unrelated word meanings to be viewed as a graded continuum. For example, Rodd (2020) proposed an account of word-meaning access (based on earlier work by Rodd et al., 2004) in which word senses/meanings can vary continuously in the strength of their semantic relationship without need to draw a clear division between these two forms of ambiguity. This latter approach can accommodate findings that suggest that there is considerable variability in relatedness within the class of polysemous words (Klein and Murphy, 2001). The impact of these forms of lexical-semantic ambiguity has typically been studied separately in studies of reading and speech comprehension. This chapter will summarize experimental findings from studies of both reading and listening, but it is important to keep in mind that the cognitive mechanisms may differ somewhat for these two modalities due to the important differences in the formats of these two forms of sensory information (e.g., the highly variant and transient nature of the speech signal). (See Rodd and Davis, 2017 for a systematic overview of how speech differs from text and the impact of this on studies of language comprehension.) Perhaps surprisingly, there are relatively few studies that directly compare the impact of ambiguity across modalities (Blott, 2020), and this remains a key topic for future studies. Finally, it is important to note that although lexical-semantic ambiguity is usually viewed in a negative light, as a source of additional potential competition and confusion, it has been argued that, under some conditions, ambiguity is actually a functional
Word-meaning access 497 property of language that can improve communicative efficiency (Piantadosi, Tily, and Gibson, 2012). Specifically, Piantadosi et al. (2012) have argued that if words are used within rich, informative contexts that provide sufficient cues to allow them to be accurately disambiguated, then re-using a relatively small set of word forms to convey a larger set of word meanings can reduce the processing demands relative to an alternative system that uses a much larger set of word forms (Piantadosi et al., 2012).
23.3.3 Contextual modulation of unambiguous word meanings Even for relatively unambiguous wordforms that are used by all speakers of a given dialect to refer to a single, clearly defined specific object, action, or other conceptual entity, the mapping from form to meaning is not entirely straightforward. For most highly familiar words, individuals will have a large body of stored conceptual knowledge, and the subset of this knowledge that is most relevant can vary according to the specific context in which the word has been used (Yee and Thompson-Schill, 2016). Take for example the word ‘lemon.’ Even if we set aside its lower frequency senses such as referring to a pale yellow color or to describe something that is unsatisfactory or defective (e.g., ‘it turned out to be a lemon’) and focus instead on its dominant interpretation as a type of fruit, then the mapping to meaning is somewhat flexible. Despite the likely widespread agreement among speakers of English about which objects in the world can (and cannot) be accurately be referred to by the label ‘lemon,’ once this word is placed within a sentence context it is advantageous for the listener to perform some form of context-driven semantic disambiguation. For example, Tabossi (1988b) used a cross-modal priming paradigm to demonstrate when relatively unambiguous words like ‘lemon’ are placed within a sentence context, properties like ‘sourness’ are more readily available at the offset of the target word if the preceding context guides the listener toward this particular semantic feature (e.g., ‘The little boy shuddered after eating a slice of lemon’) compared with sentences where other features are more contextually relevant (e.g., ‘The little boy was playing on the floor rolling a lemon’). This demonstrates that context can very rapidly modulate the aspects of a word’s meaning that are accessed/maintained in any given sentence context. (See Mirković and Altmann, 2019 for more recent demonstration of how visual context can influence semantic interpretation of unambiguous nouns.) Taken together, these different forms of lexical-semantic ambiguity require that any theoretical account of word-meaning access must specify how readers and listeners are able to select those aspects of lexical meaning most relevant to the current context. Successful comprehension relies on ensuring that the most relevant aspects of meaning are available for integration into a representation of the ongoing discourse. This process of lexical-semantic disambiguation is most salient for homonyms such as ‘bark,’ but is a ubiquitous aspect of lexical processing that is constantly required for successful comprehension.
498 Jennifer M. Rodd The following sections will review what is currently known about how comprehenders deal with this need to select contextually relevant aspects of word meanings. The vast majority of this research has focused on homonym processing: the minimal overlap between the alternative interpretations of these wordforms makes it more straightforward to measure the extent to which meaning(s) have been retrieved by participants in any given experimental scenario. But it is important to keep in mind that lexical-semantic disambiguation is also critical for efficient comprehension of polysemes (e.g., ‘run’) as well as for relatively unambiguous words. Successful comprehension requires that listeners and readers select from their vast array of stored conceptual knowledge the specific information that is most relevant to the current contextual situation (Yee and Thompson-Schill, 2016).
23.4 Cues that guide lexical-semantic disambiguation 23.4.1 Recent and longer-term experience with word meanings The one-to-many mapping from form to meaning entails that perceptual information alone is rarely sufficient for a reader or listener to be certain about the meaning of each word they encounter. They must use other cues to help them settle on one likely interpretation for each word. One important cue that guides word-meaning access is distributional information about the frequency with which different meanings/senses have previously been encountered: meanings that are more frequent in language (usually referred to as dominant meanings) tend to be more readily available (e.g., Hogaboam and Perfetti, 1975). Evidence from a very wide range of behavioral and neuroimaging paradigms have confirmed that more frequent (dominant) word meanings experience a processing benefit that makes them more easily accessible, compared with less frequent (subordinate) meanings (see Vitello and Rodd, 2015 for review). The clearest evidence that word-meaning dominance strongly modulates word- meaning access comes from the large body of research in which adult readers’ eye movements are tracked during reading. When readers are presented with text (or any other visual stimulus), they will respond by making a series of discrete eye movements called saccades. Between these saccades, their eyes remain relatively still for short periods (usually for 200–300 msec.) as they fixate on a specific area of the text and gain detailed visual information about those letters that are currently in the center of their gaze. Importantly, as text becomes conceptually more difficult, readers tend to (i) lengthen duration of their individual fixations, (ii) reduce the distance (i.e., number of letters) that they jump forward on each saccade, and (iii) make more backward regressions to reread earlier parts of the text (Rayner, 1998). Eye-tracking studies of
Word-meaning access 499 lexical ambiguity have consistently shown that when highly dominant word meanings, such as the “writing implement” meaning of ‘pen,’ are preceded by a supportive sentence context then these meanings can be easily accessed with no measurable delay in processing compared with a low-ambiguity control word. In contrast, when the preceding context supports a highly subordinate word meaning, such as the “animal enclosure” meaning of ‘pen’ then there is a delay in reading times that reflects the additional processing demands associated with accessing this less frequent meaning. (See Duffy, Kambe, and Rayner, 2001 for a comprehensive review of findings concerning how ambiguous words are processed within sentence contexts.) Although eye-tracking studies of adult reading have been highly influential in driving theoretical accounts of word-meaning access that can account for many aspects of the data seen in eye-tracking studies of lexical ambiguity (e.g., The Reordered Access Model: Duffy et al., 2001), one limitation is that they have typically operationalized “dominance” as a single unitary property of word meanings that is assumed to be relatively consistent across participants: dominance is usually estimated by averaging the responses to these words when they are presented in isolation to a different group of participants (e.g., Twilley, Dixon, Taylor, and Clark, 1994). Underlying this approach is the assumption that individuals within a particular language community will all have had relatively similar experiences with word meanings, such that a single estimate of meaning dominance can reliably predict behavior across all participants in any given experiment. More recently, however, researchers have attempted to further develop this approach to explore in more detail how the availability of word meanings is influenced by individuals’ idiosyncratic experiences with particular word meanings. Rodd, Cai, Betts, et al. (2016) investigated word-meaning preferences in recreational rowers whose language experience differs somewhat from the rest of the population because they regularly encounter rowing-related terminology: they will frequently encounter rowing-specific meanings for highly familiar words such as ‘catch,’ ‘square,’ and ‘feather,’ which refer (for example) to a particular position of the oar or phase of the rowing stroke. Rodd et al. (2016) found that the relative availability of these rowing-specific meanings (compared with the non- rowing, dominant meaning) increased with participants’ number years of rowing experience: the availability of word meanings increase gradually over time in response to an individual’s idiosyncratic linguistic environment over relatively long periods of time. For example, an individual with ten years’ rowing experience will (on average) have different word-meaning preferences to an individual with two years’ experience, even for words for which both of them are highly familiar with the rowing-related meaning. This indicates that individuals are effectively tracking the prevalence of word meanings over relatively long time periods. Similarly, Wiley, George, and Rayner (2018) showed that (compared with non-experts) individuals with specific expertise in baseball showed significantly more difficulty processing words that had a baseball-related meaning (e.g., ‘bat’) when these words were presented in sentence contexts that were strongly biased toward the non-baseball meaning (e.g., ‘Monica had a great fear of things flying around her head. She looked
500 Jennifer M. Rodd for the BATS that lived in the shed’). These data indicate that the availability of the baseball-related meanings had been boosted in the baseball experts as a consequence of their increased experience with these word meanings. These studies emphasize the inevitable differences in how different people represent word meanings—every individual has their own idiosyncratic linguistic history that will shape the distributional statistics with which we encounter the different meanings of words. These studies demonstrate that individual differences in long-term experience directly influence the ease with which different word meanings come to mind. These studies also remind us of the flexible functionality of the language comprehension system. These incremental adjustments of lexical-semantic knowledge in response to a changing linguistic environment in a manner that will improve future comprehension by boosting the availability of those meanings that they are likely to encounter in the future (Rodd et al., 2016). These findings move us away from a view of the mental lexicon as being relatively fixed during adulthood and suggests that the plasticity that has been systematically explored in studies of children’s comprehension abilities extends into adulthood. Future studies are needed to explore the extent to which the levels of plasticity in the mental lexicon changes (or not) across the lifespan. In addition to these long-term effects of experience, experimental evidence has shown that recent encounters with particular, subordinate interpretations of ambiguous words can substantially boost the availability of these primed meanings (Betts, Gilbert, Cai, Okedara, and Rodd, 2018; Gaskell, Cairney, and Rodd, 2019; Gilbert, Davis, Gaskell, and Rodd, 2018; Gilbert, Davis, Gaskell, and Rodd, 2021; Rodd et al., 2016; Rodd, Lopez Cutrin, Kirsch, Millar, and Davis, 2013). In word-meaning priming experiments, participants encounter a subordinate meaning of an ambiguous word within a strongly disambiguating sentence context (e.g., ‘The farmer moved the sheep into the PEN’). Then, after a delay (typically 20–40 minutes) the availability of the word’s different meanings is assessed using tasks such as word association or semantic relatedness judgment (Betts et al., 2018; Gaskell et al., 2019; Gilbert et al., 2018, 2021; Rodd et al., 2013, 2016). Results from these paradigms have consistently shown that the availability of the primed, subordinate word meaning is higher than a control, unprimed condition. Importantly, this boost in word-meaning accessibility requires that the target ambiguous wordform is itself present in the prime sentence. For example, the sentence ‘The man accepted the post in the accountancy firm’ will boost the availability of the occupation-related meaning of ‘post’ after 20–40 minutes, whereas in a semantic priming control condition in which the critical prime word is replaced by a near synonym (e.g., ‘The man accepted the job in the accountancy firm’), no such boost is observed at this delay (Rodd et al., 2013, 2016). Thus, it is participants’ specific experience with the particular ambiguous word that has altered how they are likely to interpret this word when it is subsequently encountered. Word-meaning priming has been observed in carefully controlled lab-based conditions, as well as in more naturalistic conditions where the primes were presented within short vignettes within a radio program, and priming was measured via a subsequent web-based task (Rodd et al., 2016).
Word-meaning access 501 As with the longer-term effects of experience with word meanings that arise incrementally over periods of years, these shorter-term priming effects demonstrate the flexible functionality of the language comprehension system: participants are specifically boosting the availability of those meanings that they are particularly likely to occur (again) in the near future within a particular coherent conversation or text (Rodd et al., 2016). Taken together these finding emphasize the role of prior experience with ambiguous words in lexical- semantic disambiguation in helping comprehenders to rapidly and effectively settle on a single, likely meaning of each word that they encounter. Interestingly, evidence from a set of studies that investigate the extent to which experience with a word in one modality (e.g., speech) transfers to another modality (e.g., reading) found that word-meaning priming effects were of a similar magnitude for within-modality priming and between-modality priming suggesting that comprehenders are able to easily transfer information about word meanings across modalities (Gilbert et al., 2018). In addition, evidence from studies that varied the structure of the prime sentences have shown that this form of lexical-semantic retuning is driven by participants’ final interpretation of the ambiguous words regardless of initial meaning (mis)activation or (mis)interpretation (Gilbert et al., 2021). Finally, studies of Dutch-English bilinguals suggest that this form of lexical learning extends to the interpretation of interlingual homographs –wordforms that exist in both of a bilingual speaker’s languages (Poort and Rodd, 2019; Poort, Warren and Rodd, 2016). It is currently unclear whether this ability to modulate word-meaning access on the basis of recent experience reflects changes within the mental lexicon or whether they are reflect a contribution of participants’ short-term memory for the content of recently encountered sentences (see Gaskell et al., 2019 for further discussion).
23.4.2 Contextual cues The previous section argues for the importance of general, distributional information in guiding word-meaning access toward those word meanings that, based on prior experience, are most likely to be encountered in any future context. But such information about the a priori likelihood of any given meaning can, of course, never guarantee accurate comprehension. Comprehenders must make use of additional, context-specific cues to guide them the word meaning that is most compatible with their current situational context. For example, the wordform ‘bark’ should be interpreted very differently when it is preceded by the words ‘dog’ or ‘tree.’ As with dominance effects, researchers have used a wide range of experimental approaches to assess how and when different types of context guide word-meaning disambiguation (Rodd, 2019). One key theoretical debate has been whether (high-level) information about sentence meaning can directly influence (lower-level) word-meaning access processes. At one end of the theoretical spectrum are “exhaustive access” models (Onifer and Swinney, 1981; Seidenberg, Tanenhaus, Leiman, and Bienkowski, 1982; Swinney, 1979). These models, which have been described as “autonomous”
502 Jennifer M. Rodd or “modular” (Fodor, 1983) argue for a relatively strict separation between these two aspects of linguistic knowledge, such that all familiar meanings of ambiguous wordforms are automatically accessed, regardless of sentence context. Under this view, context only influences the subsequent meaning selection/integration processes that occur after all potential meanings have been accessed. This “exhaustive access” view was supported by early cross-modal semantic priming studies in which participants made responses to visual probe words that followed the ambiguous word (Seidenberg et al., 1982; Swinney, 1979). These studies found that when the probe words were presented immediately at the offset of the ambiguous word then responses were faster for probes that were semantically related to either meaning than for unrelated probes. This finding is usually interpreted as evidence that both meanings had been automatically activated in response to the ambiguous word. However, subsequent studies found that if the ambiguous word was preceded by a very strongly constraint toward the dominant meaning, then only the contextually appropriate meaning was primed (Tabossi, 1988a; Tabossi and Zardon, 1993). This evidence that context can modulate the initial access of word meanings is consistent with evidence from eye-tracking studies showing that the time taken to read ambiguous words is strongly modulated by the nature of the preceding context. (See Duffy et al., 2001 for a comprehensive review of findings concerning how ambiguous words are processed within sentence contexts.) One specific area of debate within the field has concerned the “subordinate bias effect”: readers show consistent processing delays when ambiguous words that have a strongly dominant meaning (e.g., ‘pen’) are preceded by a sentence context that is consistent with an alternative less frequent meaning (e.g., ‘Because it was too small to hold all the new animals, the old pen was replaced’; Duffy, Morris, and Rayner, 1988). This effect is usually interpreted as showing a limit on how strongly context can modulate word-meaning access: prior context cannot prevent readers from accessing strongly preferred, dominant meanings (Kellas and Vu, 1999; Pacht and Rayner, 1993; Rayner, Binder, and Duffy, 1999; Rayner, Pacht, and Duffy, 1994; Sereno, 1995). However, more recently Colbert-Getz and and Cook (2013) have shown that a five-sentence strongly constraining context can eliminate the subordinate bias effect. In addition, Leinenger and Rayner (2013) have shown that the subordinate bias effect is reduced when the preceding, biasing context contains the ambiguous word itself. Taken together, these results confirm the role of sentence context in rapidly guiding readers toward those word meanings that are most likely in a specific sentence context, but that this constraint operates in conjunction with a reader’s a priori knowledge about the likelihood of the different meanings. This view is exemplified in the influential Reordered Access Model (Duffy et al., 2001; Duffy, Morris, and Rayner, 1988), which states that (i) whenever an ambiguous word is encountered all of its alternative, familiar meanings begin to be automatically activated in parallel, but (ii) the rate at which these meanings become active is modulated by both sentence context and word-meaning frequency. Under this view, those word meanings that are highly frequent or very strongly consistent with the preceding context become most quickly available for integration into
Word-meaning access 503 the ongoing discourse. In situations where only one meaning becomes rapidly available, this meaning can be easily integrated into the higher-level representation of the ongoing discourse without significant disruption to processing (compared with unambiguous words). In contrast, when multiple meanings become available at the same time, then the comprehender must engage in a more time-consuming selection process to determine which meaning is most likely to be intended by the speaker/writer. More recently, Rodd (2020) has proposed an alternative “semantic settling” framework to describe how the immediate sentence context (and other distributional information) influence word-meaning access. This model retains many of the key aspects of the Reordered Access Model, but takes a distributed connectionist perspective in which word-meaning access is viewed as a settling process: for each word that is encountered the activation state within a high-dimensional lexical-semantic space settles over time into a stable state that corresponds to a pattern of semantic activation that corresponds to a known, familiar word meaning (Rodd et al., 2004). According to this perspective, multiple cues can influence the settling process by which the system resolves on one of these familiar meanings: these cues include immediate sentence context as well as longer-term distributional knowledge about the likelihoods of different meanings. Critically, this view places learning at its center: this lexical-semantic space is constantly being re-shaped throughout an individual’s lifespan in response to experience with different word meanings. This re-tuning of lexical-semantic knowledge acts to maximize the efficiency of word-meaning access in the individual’s current linguistic environment. This view highlights individual differences in lexical-semantic knowledge: each person’s lexicon is constantly being flexibly re-structured by their specific, idiosyncratic linguistic experiences. Although these different theoretical perspectives have made clearly stated claims about the mechanisms by which linguistic context can influence word-meaning access, these models are currently somewhat underspecified with respect to claims about exactly what forms of contextual information can influence word-meaning access, and how different (potentially conflicting) cues might work together (or in conflict) during word-meaning access. Indeed, Rodd (2020) suggests that any cue that is a reliable predictor of which meaning is “correct” is likely to be utilized by comprehenders. Under this view, there are no structural barriers that can prevent specific (higher-level) forms of knowledge from acting on word-meaning access. Consistent with this view is the evidence that lexical-semantic access is guided by a wide range of contextual cues. At one end of the spectrum, there is evidence that relatively low-level lexical information about the extent to which word (meanings) tend to co-occur in language cues can guide disambiguation. Witzel and Forster (2015) demonstrated that the interpretation of ambiguous words such as ‘bat’ is strongly influenced by preceding words such as ‘umpire’ that share a strong co-occurrence relationship with the target word (i.e., the baseball-related meaning of ‘bat’ tends to co- occur with similar other words to ‘umpire’ because they often occur in similar contexts). Importantly, in this study participants tended to interpret the ambiguous words as being consistent with the meaning that was related to this contextual cue even when the overall
504 Jennifer M. Rodd sentence context did not plausibly support that particular interpretation (e.g., ‘The umpire tried to swallow the bat but its wings got stuck in his throat’). On the basis of this evidence, Witzel and Forster (2015) propose that word-meaning selection is driven by a “fast-acting, low-level heuristic based on intra-lexical connections.” However, in addition to this evidence that sentence context guides interpretation via relatively simple information about the different co-occurrence statistics of a word’s different meanings, there is also evidence that higher-level cues about this global interpretation of an ongoing discourse can guide interpretation. For example, Kambe, Rayner, and Duffy (2001) monitored readers’ eye movements as they read ambiguous target words that were presented within short paragraph contexts. Critically, they manipulated overall global context by using the first sentence of the paragraph to introduce the topic of the paragraph to be consistent with either the dominant or the subordinate meaning the target word ambiguous word. For example, different meanings of the word ‘band’ were introduced by the sentence ‘Lisa and John spent months looking for the perfect wedding ring’ vs. ‘Lisa and John loved to go to rock concerts with their friends.’ In addition, they manipulated the local context that either preceded or followed the target word, which was always consistent with the subordinate interpretation of the target word (e.g., ‘It wasn’t until they entered Kay’s jewelry store that they saw the band.’). The results confirm that relatively high-level global contextual information can influence disambiguation. Specifically, the global contextual information strongly constrained participants’ interpretation of the target ambiguous word when no additional local disambiguating information was available, but had no additional effect when it is consistent with local information. Finally, the effect of global context was somewhat delayed effect when it was inconsistent with local information. In addition to these studies that show that disambiguation can be guided by the local (co-occurrence) or global paragraph/sentence context, recent evidence has indicated that listeners are also influenced by more general background knowledge about the speaker. Specifically, Cai, Gilbert, Davis et al. (2017) showed that listeners take account of the accent of the speaker (British vs. American) when they are required to disambiguate words that are used somewhat differently by these two dialect communities. Specifically, these studies focused on words like ‘bonnet’ that have different dominant meanings in these two dialects. First, their results (on a range of tasks) confirmed that (consistent with the well-attested dominance effects) speakers of British English can more easily access those word meanings that are dominant in their native dialect (i.e., car part). More importantly, they found that the availability of the alternative, non-preferred subordinate meaning, which is dominant in American English (e.g., type of hat), is increased in British participants, when the words are spoken in an American accent. These data confirm that the listener’s interpretation is guided not only by the meanings of preceding words but also by the characteristics of the speaker. The likely mechanism that drives this effect was further explored in an experiment that compared strongly accented ambiguous words (UK or US) with targets that were presented in a neutral accent. These neutral-accent targets were created by morphing British-and American- accented recordings. These critical items were embedded within contexts of filler words
Word-meaning access 505 that were all consistently accented with either a UK or US accent. Across a number of experimental conditions, it was found that the interpretation of these targets was driven by the perceived accent of the filler items and not by the specific accent properties of the individual target word. For example, the word ‘bonnet’ was more likely to be given its US meaning when it was surrounded by American-accented speech (compared with a UK accent context), regardless of whether the target word was presented in a US or neutral accent. Cai et al. argued that listeners spontaneously generate an “accent context” for the speaker that they are currently hearing and use this contextual information to modulate the availability of different word meanings (see Cai et al., 2017 for further details of the likely mechanism by which speaker knowledge can influence lexical processing). The authors also raised the possibility that other cues, beyond accent, may constrain lexical- semantic disambiguation. For example, if other salient information about speakers, such as age or gender, can reliably predict which word meanings are used by individual speakers (e.g., if older and younger participants tend to use the word ‘tweet’ differently) then this information might also contribute to word-meaning access. Perhaps most intriguing is the (as yet untested) possibility that listeners can retain information about how specific, familiar individuals tend to use specific words (e.g., psychology colleagues may tend to use the statistical meanings of ‘mean,’ ‘normal,’ and ‘significant’). Future work is clearly needed to determine the extent to which listeners (and readers) acquire and maintain such highly specific knowledge about the distributional statistics of how word meanings are used by specific groups or individuals and the extent to which such factors contribute to successful comprehension in more naturalistic paradigms. Future work is also needed to better understand how and why individual differences in the ability to make use of such cues arise (Blott, Rodd, Ferreira, and Warren, 2021).
23.5 Conclusions Efficient and effective comprehension requires that listeners and readers are able to efficiently perform the mapping from form to meaning. This mapping is made challenging by the ambiguity that is ubiquitous in natural language: accurate processing of the current visual or auditory input alone is insufficient to guarantee that the correct aspects of a word’s meaning are retrieved. A large body of evidence has revealed that word- meaning access is guided by (i) distributional information about the a priori relative likelihoods of different word meanings, and (ii) a wide range of contextual cues that indicate which meanings are most likely in the current context. Successful comprehension is the result of the ability to rapidly and fluently integrate these cues for every word that is encountered such that only relevant aspects of word meaning are integrated into the ongoing representation of the overall meaning of the discourse.
Chapter 24
L earning a nd U si ng W rit ten Word Forms Rebecca Treiman and Brett Kessler
24.1 Introduction Over most of human history, knowing a word has involved knowing its phonological form. Nowadays, for people who are literate, knowing a word means knowing its written form as well. In this chapter, we discuss how people learn and use these forms. We begin by discussing how writing systems represent language in a visual form. We then consider the processes that are involved in skilled reading and how children learn to read. The topic of spelling is also considered. We end the chapter by discussing how the learning of orthographic representations can affect the mental lexicon.
24.2 The nature of writing Spoken language is a powerful tool, allowing us to store information outside our heads and to share it with others who may be distant in time or space. A weakness of spoken language, however, is that it fades quickly. Around 5,000 years ago, people developed ways to freeze language by representing it in a relatively permanent visual form. One way to do this, and one that took root in some societies, is to assign a distinct mark to each word or morpheme in the lexicon of a language. In a logography of this sort, different words that happen to sound alike would have different written forms. If English were a logography, the ear with which one hears and an ear of corn, though pronounced the same, would be written differently. Another way to record a language is to conceive of it as a sequence of sounds, disregarding meaning. One then assigns a mark to each sound, yielding a phonography. If English were a pure phonological writing system, the words for cereal and serial would have the same written form because they have the same
Learning and Using Written Word Forms 507 string of sounds. Whether one uses a logographic or a phonographic approach, one then arranges the marks in an order that corresponds to their order in the spoken language. It is sometimes assumed that a logographic writing system like Chinese does not represent phonology in any way. Indeed, a number of common logograms derive from drawings of the object the morpheme represents: 曲 qū ‘bent’ is the modern form of a picture of a bent object (a typical example of how characters that began as pictorial have lost their original iconicity). But in mainland China, the unrelated morpheme qū ‘yeast’ is also written 曲, for no other reason than that they have the same pronunciation; the character has replaced the more complicated traditional character 麴. This phonetic use of logograms is especially common when representing foreign names: 特朗普 Tèlǎngpǔ ‘Trump’ starts with the logogram 特 tè ‘special’ only because of its pronunciation. Another way pronunciation is important in Chinese writing is in the construction of characters. Over 80% of modern logograms incorporate other logograms (Hsiao and Shillcock, 2006). Typically, one of the components will be a logogram that has the same or similar pronunciation, and the other will indicate the broad semantic category. For example, the morpheme qū as in qūqur ‘cricket’ is written 蛐, where the left side 虫 chóng ‘insect’ gives a hint of the meaning and the right side 曲 qū ‘bent’ gives a hint of the pronunciation. The word hint is used advisedly because in many logograms the phonetic component was already only approximate when the compound character was created, and a few thousand years of sound change has only exacerbated the dissimilarities. The logogram 特 tè ‘special’ has the semantic component 牛 niú ‘bovine’ and the phonetic component 寺 sì ‘temple’—neither of which would be very helpful information to a reader trying to guess what morpheme 特 stands for. The majority of writing systems in use today represent speech at the level of syllables (syllabaries) or phonemes (alphabets). Although these systems are phonographic, most are not purely phonographic. For example, English spells the two meanings of /ɪr/ alike, as ‹ear›, but it spells the two meanings of /si/differently, as ‹see› and ‹sea›. A given phoneme or phoneme sequence may be spelled differently depending on the morphology of the word in which it appears. For example, the same phonological sequence in American English is usually spelled as ‹ous› at the ends of adjectives (e.g., ‹generous›), as ‹ess› at the ends of nouns referring to females (e.g., ‹waitress›), and in other ways in other types of words (e.g., ‹tennis›; Berg and Aronoff, 2017). The writing systems of Greek and French, among others, have some similar characteristics. Other complexities in the sound–letter relationships of English and other languages reflect the conservatism of written language. Spellings change less quickly over historical time than pronunciations do, meaning that the spellings of some words better reflect how they used to be pronounced than how they are currently pronounced. For example, whine and whale used to be pronounced with initial /hw/, but the /h/was lost about 200 years ago in most parts of the English-speaking world. This leads to the situation we face today: /w/has two spellings, one of which is complex in that it is composed of two letters. As another example, the pronunciation of the vowel in look, book and a number of other words in which this vowel is followed by /k/changed several hundred years ago, so that the pronunciation was no longer the same as in loot and boom. The
508 Rebecca Treiman and Brett Kessler spellings of these words had settled into ‹book›, ‹look›, ‹loot›, and ‹boom› before the change in pronunciation, and they remained unchanged. This means that ‹oo› is now pronounced /ʊ/in some words, especially those with a following /k/, and /u/in other words. Readers would have the best chance of obtaining the correct pronunciation of ‹oo› if they considered the context in which it appears, specifically the identity of the following element. Some alphabetic writing systems—especially ones that developed more recently than English and ones that are less conservative in their spellings—come closer to the phonographic ideal than English does. In these systems, a phoneme is spelled the same way in almost all of the words in which it appears, and a letter has the same pronunciation in almost all words. Spanish and Finnish, among other writing systems, have simpler and more consistent links between spellings and sounds than English does. In Spanish, for example, loans from Latin and Greek are spelled in the same way as words that are native to the language, as in ‹fonación› for phonation. Although a phonographic writing system represents the sounds of a language, it does not represent all aspects of the sounds. For example, English spelling fails to differentiate the phonemes /θ/(as in thin) and /ð/(as in then), spelling both with ‹th›. The conventional orthography of Hausa, a major language of Niger and Nigeria, fails to mark distinctions of vowel length and tone that are phonemic in the language. As a final example, each symbol of the syllabic writing system that was developed in the 1800s for Cherokee represents a range of syllables that are similar to each other but phonemically different. For example, Ꭸ can represent, among many other syllables, /ge/or /ke/with either a short vowel or a long vowel. Underrepresentation of this sort is common across writing systems. At the same time, writing has some redundancy. Some of this redundancy involves the visual properties of its symbols. For example, the crossbar on ‹A› is not essential: A reader who failed to detect it could still identify the letter because the Latin alphabet does not contain a letter ‹Λ›. Indeed, it has been estimated that the identity of the elements of modern writing systems can be determined when, on average, half of the strokes are removed (Changizi and Shimojo, 2005). Other redundancies in writing reflect redundancies in language itself. For example, a reader of English who failed to notice the identity of the second vowel letter of ‹desperate› could still identify the word as ‹desperate› because English does not have words like ‹despirate› and ‹despurate›. A logographic writing system represents the lexical level of language by definition: Each word or morpheme has a symbol. But most phonographic systems highlight the lexical level as well. One way in which they do this is through lexical constancy. In modern writing systems, there is one conventional way to represent a word. The word horse may be pronounced with a higher pitch in one sentence and a lower pitch in another, the difference in intonation conveying such things as surprise or irony, but the string of letters is always ‹horse›. A related property is lexical distinctiveness: ‹horse› represents the word horse and no other word. These principles have some exceptions—‹ax› and ‹axe› are accepted as spellings of the
Learning and Using Written Word Forms 509 same word; ‹wind› represents two different words—but such exceptions are not common enough to detract from the view that writing is at its core a system for representing words. Further highlighting the lexical level is the lexical demarcation that is present in many modern writing systems. This is a visual means of showing where one word ends and the next one begins. In English and many other writing systems, spaces between words serve this purpose. Some other writing systems use other marks, such as the ‹:› that was traditionally used to separate words in Amharic. Not all writing systems visually separate words, however. Most notable is Chinese, where each symbol takes up a fixed space. The spaces between symbols are no larger when two adjacent characters are two different words than when they are part of the same word. Writing represents the words of a language, and it almost always arranges the symbols for them in an order that corresponds to their order in the language. But what language does writing represent? There are often differences between the language that we read and write and the language that we hear and speak. In English, for example, adults are more likely to encounter epitome and ennui in reading than in conversation. Children, too, are more likely to be exposed to certain uncommon words and complex syntactic structures when listening to their parents read to them from storybooks or when reading on their own than when listening to everyday speech or television programs (Cameron-Faulkner and Noble, 2013; Hayes and Ahrens, 1988). These differences reflect, in part, the different goals and environments of spoken and written language. For example, because everyday speech is often about entities that are visible to all participants in a conversation, there is less need for the clarifications that can be provided by relative clauses and other complex syntactic structures. The conservatism of written language, which was previously mentioned as a reason for complexities in the correspondences between letters and sounds, also contributes to the linguistic differences between written language and spoken language. For example, whom was once more common in oral speech than it is today, but it still appears fairly often in writing. Differences between written language and spoken language are especially large in some languages. For example, the standard literary form of Arabic is very close to the classical Arabic language spoken over 1,000 years ago and very different from the forms that are spoken today. Even though writing is an order of magnitude more recent than spoken language, and even though humans have not evolved specializations for it, many people in the modern world become experts in its use. In the section that follows, we consider the skills of expert readers.
24.3 Skilled reading Silent reading is a rapid process for educated people. Looking at data predominantly from adult readers of English, Brysbaert (2019) calculated an average speed of 238 words
510 Rebecca Treiman and Brett Kessler per minute for silent reading of nonfiction texts and 260 words per minute for fiction. This compares to an average speed of 183 words per minute in oral reading. Spoken language is often presented at a rate of 140 and 180 words per minute, the rate that is recommended for the recording of audiobooks. These numbers show that silent reading can actually be somewhat faster than speaking (see Kilbourn-Ceron and Goldrick, this volume) and listening (see Magnuson and Crinnion, this volume). This is impressive given that reading is, in the words of Gough and Hillinger (1980), an unnatural act. That is, reading is a secondary linguistic skill, unlike the primary linguistic skills of speaking and listening. How do experienced readers manage to perform this unnatural act as quickly as they do? One reason that we read so quickly is that we make a quick hand-off from the visual system to the language system. Because writing involves visual marks, reading begins with the eyes. When people read silently, as modern adults usually do, speech is not overtly involved. But inner speech plays an important role in silent reading as the language system takes over from the visual system. As readers of an alphabetic writing system take in the letters of a word, they use their knowledge about the sound-to- spelling links of their writing system to predict its likely pronunciation. They can often use this assembled phonological code to help access the word’s entry in the mental lexicon. This method works less well for words like plaid, where the phonological code that a person assembles is unlikely to match the correct pronunciation, than for words like maid (e.g., Glushko, 1979). The assembled phonological code for plaid is likely to differ from the correct one in just one element (the vowel /e/vs. the vowel /æ/), though, meaning that this code can still be of some use in identifying the word. If a written word is familiar, experienced readers can access its lexical entry from its visual form, gaining information in this way about the word’s phonological form and other properties. Phonology that is accessed in this manner is known as addressed phonology. As words are identified, and as the structure of a sentence becomes clear, readers add information to the inner voice that is missing from or incompletely represented in the written input, such as information about intonation. We make eye movements when we read in order to move the fovea—the central part of the visual field—to the location we wish to process. Distinguishing the small marks of writing requires sensitivity to visual detail, and acuity is highest in the fovea. However, we can begin to form a phonological code and identify a word based on information in the parafovea, the belt outside the fovea. Consider a study in which university students silently read sentences on a computer screen like “Molly enjoyed her short break in the afternoon” (Ashby, Treiman, Kessler, and Rayner, 2006). While the reader’s eyes were fixating any of the letters or spaces in ‹Molly enjoyed her short›, a preview nonword appeared on the screen in the place where the target word, ‹break› in this example, would later appear. Once the eyes passed the final ‹t› of ‹short›, the target word replaced the preview. The change occurred so quickly that readers were not normally aware of it. Readers were found to spend less time on the target word when the pronunciation of the preview had the same vowel phoneme as the target than when it did not. For example, readers spent
Learning and Using Written Word Forms 511 less time looking at ‹break› when the preview was ‹braim› than when it was ‹braum›. Calculating /e/as the pronunciation of the ‹ai› in ‹braim› does not require consideration of context: ‹ai› is normally pronounced as /e/regardless of the identity of the adjacent letters. But similar effects were found when context was influential. For example, readers spent less time on ‹droop› when the preview was ‹droon› than when it was ‹drook›. As mentioned in the preceding section, most English words with ‹oo› before ‹k› have the /ʊ/pronunciation rather than the /u/pronunciation. Readers appeared to take this into account, meaning that performance on ‹droop› did not benefit as much when the preview was ‹drook› as when it was ‹droon›. These and other findings (Schotter, Angele and Rayner, 2012) show that use of parafoveal information helps to increase the speed and efficiency of reading. Also helping us to read as quickly as we do is the redundancy of writing. This redundancy means that we do not need to process every visual element on a page. For example, a reader of English could still recognize ‹furthermore› if he missed the horizontal line on the ‹f›, skipped the last letter, or failed to resolve the order of the ‹r› and the ‹m›. Readers can choose to skip whole phrases or sentences if their goal is to get the gist of the text through skimming rather than to understand the text in detail. When listening to a live speaker, in contrast, we cannot jump ahead to the next sentence or force the speaker to omit words or phrases of our choice. Experienced readers are so skilled at processing the print and moving from visual processing to language processing that what limits their comprehension is usually their knowledge of language. By the teenage years, therefore, differences among people in reading comprehension are closely related to differences in listening comprehension. The two skills are so highly related to one another that they form a single construct in some studies (Adlof, Catts, and Little, 2006). This close association is consistent with what has been called the simple view of reading: the idea that reading comprehension requires the ability to recognize written words and the ability to understand language and that a person’s comprehension ability is the product of his or her abilities in each component (Gough and Tunmer, 1986; Nation, 2019). Once people have become fast and accurate at recognizing individual words, as they typically are by the teenage years, it is primarily differences in language ability that make some people better at understanding what they read than others. Given the volume of reading material that is available in the modern world, we often wish that we could read more quickly than we do. There are limits on our ability to understand language, though, and these limits affect the speed with which we can read as well as the speed with which we can listen. It is not realistic to expect that we can read at a rate of several thousand words per minute with excellent comprehension (Rayner, Schotter, Masson, Potter, and Treiman, 2016). Practice at taking in visual information more quickly will not help if the information comes in more quickly than the language comprehension system can process it. The kind of practice that will make us better readers is more prosaic: practice with language, especially with the vocabulary and syntactic structures that are more common in written language than in spoken language.
512 Rebecca Treiman and Brett Kessler
24.4 Learning to read The central problem in learning to read is learning to read words: learning their orthographic forms and learning to use these to quickly access stored information about the words. Experienced readers do this with extraordinary ease, but learning to do so is a long process. Some of the things that children need to know in order to learn to read words are the same for any type of writing system. Whether a writing system is logographic or phonographic, children need to know that what adults read when they read to them from storybooks are the small, dark marks and not the large, colorful pictures. Many 3-and 4- year-olds do not appear to understand this (Hiebert, 1983). Children also need to know that the identity of a written word reflects the elements within it: ‹dog› corresponds to dog whether it is near a picture of a dog or a picture of a car. This, too, is not obvious to children who are not yet able to read. When an adult places a card bearing the printed word ‹dog› under a picture of a dog and identifies it as saying “dog,” 3-and 4-year-olds are happy to accept this. But if a toy rabbit moves the card as if by accident to under a picture of a car, children frequently report that the word now says “car” (Bialystok, 2000). Other things that children need to know in order to read are different for different writing systems. For example, children are accustomed to treating objects that are mirror images of one another as instances of the same thing. A mitten is a mitten whether it is presented with the thumb facing left or the thumb facing right. Children who are learning some writing systems, however, must learn to distinguish shapes that differ only in left–right orientation, such as ‹p› and ‹q› in the Latin alphabet andㅏ/a/ and ㅓ/ʌ/in Korean. This is not easy (e.g., Treiman and Kessler, 2011). Acquiring mirror- image invariance is not necessary for learners of a script such as Tamil, which does not include any symbols that differ only in left–right orientation. Learners of phonographic writing systems must learn to treat words as sequences of smaller sounds. Analyzing language at the level of phonemes, phonemic analysis, is difficult for young children. For example, preschool children have difficulty determining that the spoken words steak and sponge begin with the same sound or that smoke and tack end with the same sound (Treiman and Zukowski, 1991). Syllabic analysis, the level of analysis that is required for phonographic writing systems of the syllabic type, is easier. For example, the preschool children in the study just mentioned did well at judging that hammer and hammock start alike and that plastic and heavy do not. Learning to read words, therefore, may be easier in a syllabic writing system than an alphabetic one (Gleitman and Rozin, 1977). Children’s experiences at home help them acquire some of the emergent literacy skills that provide the foundation for word reading. For example, a U.S. parent may encourage a child to identify magnetic letters on the door of the family’s refrigerator and may correct errors such as identifying ‹d› as b. Parents may tell a child the letter sequences in some words, such as the child’s name, although they do not usually explain why words
Learning and Using Written Word Forms 513 contain the letter sequences they do (Farry-Thorn, Treiman, and Robins, 2020). When children enter school, they may be able to identify some letters and read and write their names and a few other words. Learning more than this, however, usually happens at school. How should reading be taught at school? This question has given rise to controversies so heated that they have been called the reading wars (Kim, 2008). One idea is that children will learn to read words largely on their own if adults read to them frequently from storybooks and provide opportunities for them to try reading and writing themselves. The focus, in this view, should be on gaining meaning from print. Teaching about individual letters and sounds is unnecessary and may even be harmful, for these units do not carry meaning on their own. This approach is often called the whole-language approach. An alternative view, in the case of alphabetic writing systems, is that children should be systematically taught about the links between letters and phonemes. Such phonics instruction will allow children to decode words, and accuracy and fluency at the word level will allow children to obtain meaning from texts. Research findings support this view: Phonics instruction is important for learning to read words in an alphabetic writing system. It generally works better than pure whole-language instruction (Castles, Rastle, and Nation, 2018; Ehri, Nunes, Stahl et al., 2001), especially when instruction in phonemic analysis is also provided (Ehri, Nunes, Willows et al., 2001). This is true for typically developing children and, importantly, for children and adolescents with serious problems in reading (Galuschka, Ise, Krick, and Schulte-Körne, 2014). Phonics instruction typically begins by teaching children a sound for each letter of the alphabet. These are sometimes taught in alphabetical order, beginning with /æ/for ‹a› and proceeding to /z/for ‹z›. Children may be taught two sounds for some letters, such as both /k/and /s/for English ‹c›, and single sounds for some letter groups, such as /ʃ/for ‹sh›. Children are taught to use these correspondences to make sense of words’ spellings and to decode written words they have not previously encountered. A common belief in the U.S. and some other countries is that children are ready to read only after they have learned all or almost all letters of the alphabet by name, sound, or both. An alternative and probably more effective approach is to begin by introducing a small set of common letters and teaching children how to read words consisting of those letters. For example, an Austrian program discussed by Feitelson (1988) begins by teaching children about ‹m› and ‹i› and how they are used in ‹Mimi›, the name of a puppet, and in ‹im› ‘in the’. New letters are added one at a time, children using each new letter in reading as they learn it. For example, the third letter that is introduced in this program is ‹a›, and children immediately use it to read ‹mama› ‘mother’ and other words. We do not know of research evaluating this sequential approach, but it seems promising. Children are introduced more quickly to what can be done with letters, and they can see the value in learning about them. When choosing which spelling–sound correspondences to teach and in which order, it is useful to consider which correspondences allow for the most efficient and useful mappings between words’ spellings and their pronunciations in a particular language (Vousden, Ellefson, Solity, and Chater, 2011). For English, teaching ‹sh› as a unit that
514 Rebecca Treiman and Brett Kessler has the /ʃ/pronunciation is valuable because almost all of the many words that children will encounter with this letter sequence contain /ʃ/rather than /s/followed by /h/. Phonics programs do generally teach ‹sh› as a unit. There may be value in doing the same for some other letter groups. For example, ‹ook› may be worth teaching as a unit that is pronounced with /ʊk/, given that the large majority of words that children will encounter with this spelling sequence have this pronunciation. Young children’s difficulties in breaking words into individual phonemes provide another reason for using some larger units in early reading instruction. For example, consonant clusters such as ‹st› might be introduced as units given that children tend to think of them this way (Treiman and Zukowski, 1991). For English and other writing systems that take some account of morphology, additional questions arise about how to teach children about this aspect of the system. For example, should children who have gone beyond the beginning stage of learning to read be taught that English words that end with ‹ic› are usually adjectives and that ‹ed› is usually pronounced differently in past tense verbs than it is in other words (e.g., ‹iced› vs. ‹shed›)? Or will children acquire knowledge of such things on their own by applying their implicit statistical learning skills to the words that they see in texts? Some have argued for explicit teaching of morphology for languages such as English (Bowers and Bowers, 2017), and we find merit in this view. Teaching people what they need to know is more effective than trusting that they will learn it on their own, and this is likely to be true for the morphological aspects of written languages as well as the phonological aspects. Learning to read words may best be conceptualized as learning how one’s writing system works. For alphabetic writing systems, this involves learning about the links between spellings and sounds at the level of phonemes: what is traditionally called phonics. For languages such as English, other things are important as well, including learning how the spellings of some words reflect their morphology. A benefit of this way of thinking about learning to read words is that it applies to all types of writing systems. For example, learning about the workings of the Chinese writing system involves learning about the components that make up characters. We should not declare phonics as the winner of the reading wars and retain all aspects of phonics instruction as currently practiced in a particular instructional program or country. There are many questions about the best ways to provide such instruction. The effort that has gone into fighting the reading wars should be redirected toward finding the most effective methods and ensuring that teachers have the content knowledge and the pedagogical skills they need to provide this instruction. Our discussion of learning to read has focused so far on the central problem: learning to read words. In line with the simple view of reading, a widely accepted theory that was mentioned earlier, it is word reading that sets a limit on reading comprehension in younger children. For these children, the correlation between reading comprehension and decoding is high, and the correlation between reading comprehension and listening comprehension is lower. In older children, who have better word reading
Learning and Using Written Word Forms 515 skills, the correlation between reading comprehension and listening comprehension is higher than it is in younger children (Adlof, Catts, and Little, 2006). Although such results are consistent with the simple view of reading, we find the simple view too simple because it can detract attention from the differences between everyday spoken language and the language of books. Being able to read words accurately and efficiently is not sufficient for comprehension if a text contains words that are unfamiliar in their oral form or grammatical structures that are not fully mastered. There are some such differences for all languages, as discussed earlier, and the gap between spoken language and written language is especially large for some languages, including Arabic. Methods are needed to address these gaps. A study with learners of Arabic suggested one possible approach: exposing children to literary language in spoken form in preschool. In this study, benefits were found for reading comprehension in first and second grade (Abu-Rabia, 2000). Parents often think that reading to their children is the best thing they can do to ensure that their children will become good readers. This belief is false in the sense that young children do not usually learn much about written words from being read to. In fact, young children spend little time looking at or thinking about the written words on the page when adults read to them from storybooks; they devote most of their attention to the pictures (Evans and Saint-Aubin, 2005). But reading to children is valuable in other ways, one of which is that it exposes them to the language of books. Knowledge of this language, we have seen, is an important component of reading comprehension. Reading books to children requires few specialized skills, fewer specialized skills than teaching children about how a writing system works. This difference probably helps explain why alphabet instruction is more effective when it is provided by trained teachers than when it is provided by parents (Piasta and Wagner, 2010). We suggest that programs for parents that are designed to foster children’s literacy skills concentrate on encouraging parents to read to their children and providing materials for them to do so, not on having parents teaching their children about letters and reading.
24.5 Spelling Spelling is more difficult than reading, with many expert readers complaining that they are poor spellers. This phenomenon might lead one to think that there is little correlation between the two skills. In fact, the correlation between word reading ability and spelling ability is around .70 according to Ehri (2000), almost as high as could be expected given the reliability of the tests that are used to measure the skills. The differences that people notice between word reading and spelling largely reflect the fact that just about everyone has words that they can read but not confidently write. Some orthographic entries in our mental lexicons are fuzzy, as when the entry for ‹desperate› specifies that the fifth letter is a vowel but does not specify which one. These gaps hinder spelling more than they do
516 Rebecca Treiman and Brett Kessler reading. It is not that we have one mental lexicon for reading and another for spelling, as some have postulated (Campbell, 1987; Weekes and Coltheart, 1996), but that spelling a word is intrinsically more difficult than reading it. Spelling a word encourages attention to all of its orthographic components, and the analyses that we often perform before or during the spelling process encourage attention to the word’s phonological components. For these reasons, spelling has an important role to play in literacy instruction. Recall the sequential approach to phonics that was recommended earlier in which children begin by learning a small set of letters, use these letters to read words, and then learn additional letters. Spelling can be incorporated into this approach by having children use the letters that they learn to spell words as well as read them. Indeed, there has long been a view that spelling should be stressed more than reading in early literacy instruction, even that it should be introduced before reading (Chomsky, 1971). Children’s spellings provide useful information about the quality of the orthographic information in their lexicons and the kind of instruction that is needed for the quality to improve (Treiman and Kessler, 2014). For example, children who fail to spell consonants in clusters, as in ‹sep› for step, may need practice segmenting clusters in spoken words into their component phonemes. This aspect of phonemic analysis, as mentioned previously, is difficult for children. Even when children produce few or no fully correct spellings on a list of dictated words, we can obtain information that helps to predict their later literacy skills by considering such things as the degree to which their spellings include some appropriate letters (Treiman, Hulslander, Olson et al., 2019). This can help in determining, at an early point, which children are on track to succeed and which children need extra help.
24.6 Effects of learning to read and write Learning the written forms of words does not just involve adding pieces of information to one’s mental lexicon. It can affect what we know or think we know about words. Even as adults, for example, we may encounter a word in reading that we have not encountered in oral form. We construct a phonological form for the item based on the spelling and store that form in the lexicon, but this pronunciation may be incorrect. As a result, we may be better at engaging in coitus or experiencing ennui than at pronouncing them (Seidenberg, 2013). When people learn the spelling of a word that is familiar to them in its oral form, their representation of its phonology may change. For example, someone whose lexical entry for vanilla has the pronunciation /vəˈnɪlə/may add the pronunciation /væˈnɪlə/, which corresponds more closely to the spelling. Even though the word is hardly ever pronounced as /væˈnɪlə/, people may judge it to start with /væn/(Taft and
Learning and Using Written Word Forms 517 Hambly, 1985). Changes in individual language users may build up to the point that the spelling pronunciation of a word becomes widely accepted. One example involves medicine, which was earlier pronounced with two syllables in England but is now more commonly pronounced with three syllables. This phenomenon is currently underway with often, which people now often pronounce with a /t/that had been silent for 500 years. Given the ways in which spelling influences our lexical representations, we must be careful when thinking about phonemic analysis skill and its relationship to reading ability. Tests that are designed to tap phonemic analysis ask people to do such things as deleting the final sound of boot and saying the word that remains. Children and adults answer these questions based in part on their knowledge of spelling (Castles and Coltheart, 2004). This means that good performance on phonemic analysis tests is in part a consequence rather than a cause of good reading and spelling ability. Representations of morphology, like representations of phonology, may be influenced by knowledge of writing. For example, learning the spellings of intonation and health may lead people to relate these words to the morphemes tone and heal in ways they would not do if they did not know the spellings. With morphology, as with phonology, knowledge of spelling can sometimes lead to incorrect analyses of words. For example, people may believe that reign is closely related to sovereign because of the shared spelling, even though there is no direct historical link between the two words. Indeed, people’s sense of language as composed of words may be based, in part, on knowledge of a writing system that sets off words by spaces or other marks. Supporting this view, children who do not know how to read have some difficulties when asked to segment orally presented sentences into words, as do illiterate adults (Kurvers, van Hout, and Vallen, 2007). Fluent readers of Chinese, which as mentioned earlier does not mark word boundaries, sometimes disagree when asked about them (Hoosain, 1992). Using children’s morphological awareness or word segmentation skills to predict their future reading performance may run into the same logical problem as using phonemic analysis: These skills reflect a child’s current reading and spelling skills and may be consequences, rather than causes, of reading ability. Teachers, like other literate adults, conceptualize words in ways that are influenced by their knowledge of reading and writing. This can make it hard for them to appreciate the views of children whose mental lexicons do not yet include orthographic representations. For example, we recently encountered a teacher who was helping a child spell bears by telling the child that the last sound of the word is /s/and asking the child which letter should be used to spell the /s/sound. Of course, the pronunciation of bears ends with /z/, not /s/. Such spelling-influenced views of sounds are not a problem for most people, but they are a problem for teachers and the students they are trying to help. The training that is provided to many teachers does not give them sufficient opportunities to learn about the structure of spoken language and the way in which it is represented in writing (Moats, 2014).
518 Rebecca Treiman and Brett Kessler
24.7 Conclusions The natural route to the mental lexicon, for hearing people, is through speech. Learning to access words through their visual forms is an additional feature that must be painstakingly bolted onto the language system (Pinker, 1997). Once we are able to read quickly and easily, we often overlook the time and effort that it took us to learn to do so. We may also overlook the ways in which our knowledge of reading and writing have influenced our knowledge about language. In this chapter, we hope to have shed light on these matters and to have shown why the learning and teaching of reading is not a simple task.
Pa rt I I I B
V IA M E A N I N G
Chapter 25
The Dy nam ics of Word Produ ct i on Oriana Kilbourn-C eron and Matthew Goldrick
25.1 Introduction The common sense notion of lexicon as a dictionary implies a static, fixed repository of information about the properties of individual words. Retrieving information from the lexicon can be likened to looking up a word in a printed dictionary. For example, the speaker could “turn to the correct page” in memory and access a word’s meaning, pronunciation, and grammatical properties. In this chapter, we discuss evidence from speech production suggesting an alternative view. We characterize the lexicon in production as a process, lexical access, that involves the dynamic interaction of information from multiple lexical representations, resulting in the production of variable word forms (see Purse, Tamminga, and White, this volume, for detailed discussion of word form variation). After outlining our theoretical framework, we examine lexical access in the context of single word production. We then discuss a relatively less-well explored area: how lexical access changes when speakers plan and produce multiple words in connected speech. We conclude by discussing the open theoretical issues raised by findings in connected speech.
25.2 Dynamics of word production Our framework is situated in the context of connectionist models of language production, which has served as the dominant paradigm for psycholinguistic studies of
522 Oriana Kilbourn-Ceron and Matthew Goldrick speech production since the mid-1980s (e.g., Dell, 1986; MacKay, 1987; Stemberger, 1985). Following general connectionist principles (Smolensky, 1999), mental representations are seen as patterns of activation over simple processing units. For example, representations of word meaning might be instantiated over a set of units that encode semantic (i.e., meaning-related) features. When processing the lexicalized concept , features like [pet], [feline], [animate] might have high levels of activation, while many other unassociated features like [clothing], [inanimate], and so forth, have low levels of activation. A schematic representation is presented in Figure 25.1. The key point being that the representation of a single concept like is distributed over several simple representational units, rather than being in a “single location” (e.g., page 324 of the dictionary). Mental processes—here, retrieval of information about a word—are seen as the spread of activation: transformations of activation patterns by sets of numerical connections between representational units. For example, suppose semantic features are used to access information about the constituent morphemes that make up a word. In a connectionist model, this would be realized by having activation spread along connections between semantic features and corresponding morphemes (e.g., [feline], [pet] would
Concept feline
clothing
animate
pet
Noun
Verb
DOG
CAT
d
k
æ
t
Morpheme CAT [sg, noun]
Word Form [kæt]
Representational Units
Figure 25.1 The lexicalized concept represented as a pattern of activation over simple processing units. Thick circle outlines represent higher levels of activation. This represents an idealized snapshot of the peak activation levels of all the nodes associated.
The Dynamics of Word Production 523
Concept feline
clothing
animate
pet
Noun
Verb
DOG
CAT
d
k
æ
t
Morpheme CAT [sg, noun]
Word Form [kæt]
Representational Units and Connections
Figure 25.2 Illustration of connections and spreading activation between nodes. Activation of the nodes associated with entail activation of conceptually related , which shares the nodes [pet] and [animate]. This spreads partial activation to the morpheme.
activate the morpheme representation CAT; note that this morpheme is also realized as a distributed pattern over multiple representational units).1 A key consequence of viewing representations as distributed patterns, along with processing via spreading activation, is co-activation. Because mental representations are numerical patterns of activation, at any point during processing it is possible—and, given some representational structures, required—that multiple mental representations are simultaneously activated. For example, if the meaning of is a set of features that partially overlap with the meaning of (e.g., [pet]), then activating the meaning of implies that part of the mental representation of is also activated. This is illustrated in Figure 25.2, which now shows some of the connections between units at different levels of representation. Critically, due to the spread of activation, co-activation can arise even in the absence of shared representational units. Following the above 1 While
mental representations are always patterns of activation, the particular way in which activation is distributed differs across proposals. In localist architectures, a single unit is activated to denote each linguistic element (e.g., there is a CAT unit that is active, and all other units are inactive). In distributed architectures, multiple units are active to varying degrees (and individual units are often not linguistically interpretable). In many cases both representational types yield equivalent predictions (Smolensky, 1986), but they can yield distinct empirical consequences under some conditions (e.g., damage or disruption of processing; Goldrick, 2008).
524 Oriana Kilbourn-Ceron and Matthew Goldrick example of producing , if semantic features spread activation to morphemes, the link from the feature [pet] to both CAT and DOG morphemes will cause both to be simultaneously activated even though there is no direct link between these two morphemes. Note that this co-activation will be graded; while CAT will receive activation from all the semantic features of , DOG will receive activation from only a subset and will therefore have a lower level of activation. Importantly, as a product of a process of spreading activation, co-activation is time- varying and context-dependent. This partly reflects the spread of activation. For example, suppose that distinct morpheme representations inhibit one another; CAT and DOG have a negative connection between one another (see Harley, 1995, for an example of this type of spreading activation network architecture). While positive links from shared semantic features will increase the activation of both morphemes, these inhibitory links will make high levels of co-activation unstable. Over several cycles of activation spreading, the activation of DOG will be suppressed by the more-active CAT morpheme unit. In addition to such intrinsic dynamics, there may also be control systems that regulate co-activation by enhancing the activation of target representations (Dell, 1986) or inhibiting the activation of competitors (Dell and O’Seaghdha, 1994). As we discuss in the third section, co-activation can also be modulated by the context in which a word is produced. The activation of recently produced words or planned upcoming words will be higher than their baseline levels (Dell, 1986), and words corresponding to other text or objects in the production context can also receive a boost in activation (Harley, 1984). Under this view, then, the “lexical representation” of a word is therefore a time-and context-dependent pattern of co-activation of multiple representations. In the following sections, we review how such a perspective provides a window into key empirical phenomena in word production.
25.3 Co-activation: evidence from single word production While spontaneous speech errors have played an important role in informing theories of speech production (e.g., Garrett, 1975), controlled laboratory studies have primarily focused on the production of single words. Even in this highly constrained context it is clear that word production is a dynamic process involving multiple levels of representation.
25.3.1 Representational distinctions in word production To communicate an intended message in an interpretable linguistic form, a speaker must retrieve semantic, grammatical, and word form information from memory.
The Dynamics of Word Production 525 Semantic information is the representation of meaning within the language production system, and grammatical information includes syntactic category (noun, verb, etc.) and noun class (animate/inanimate, common/neuter, masculine/feminine, etc.). Word form information relates to the sounds of the word, syllabic organization, and lexical stress and tone. These different aspects of linguistic knowledge are not retrieved as a single entity as a dictionary-lookup metaphor would suggest, but rather form distinct (yet interacting) sub-components of lexical access (see Nozari, this volume, for discussion of the neural structures realizing these distinctions). A key source of evidence for this is the spontaneously occurring tip of the tongue (TOT) state. A common laboratory paradigm for studying this provides participants with definitions of lower frequency words, and asks if they are in a TOT state (with appropriate controls to verify that they do, in fact, know the target word and are attempting to retrieve it; Gollan and Brown, 2006). In this state, participants are often able to guess, above chance levels, various aspects of word form (e.g., length, first letter/sound) and, independently, grammatical properties (e.g., grammatical gender; Miozzo and Caramazza, 1997). Acquired deficits resulting in persistent TOT-like states provide converging evidence that grammatical information can be accessed when there is no evidence of access to word form information (e.g., Badecker, Miozzo, and Zanuttini, 1995). The fractionation of access to these different aspects of lexical representations suggest that they are not retrieved as a unified lexical entry but rather during distinct stages of processing. There is also some evidence that different aspects of lexical representations are retrieved/processed at different points in time. Schriefers, Meyer, and Levelt (1990) explored this using the picture-word interference paradigm, a Stroop-like task where participants must name a picture while ignore a distractor word (here, auditorily presented). Schriefers and colleagues found that semantic distractors which overlap in meaning-based category (e.g., ‘dog’ presented while naming a picture of a cat) have earlier effects than phonological distractors which overlap in sounds (e.g., ‘hat’ for cat). However, the relative timing of these effects can shift in time (see, e.g., Jescheniak and Schriefers, 2001, for earlier effects of phonological distractors, within same time course as semantic distractors). While some electrophysiological results are consistent with a distinct time course for retrieval of semantic, grammatical, and phonological information (see Indefrey, 2011, for a review) more recent findings have been variable and inconsistent (see Nozari and Pinet, 2020, for a review). A number of studies have gone beyond these broad representational distinctions to examine the detailed structure of semantic (Vinson, Andrews, and Vigliocco, 2014), grammatical (Bock and Ferreira, 2014) and word form representations (Buchwald, 2014; Goldrick, 2014). Note that unlike a static dictionary, lexical access does not result in the retrieval of a fixed representation of word form; variation is a hallmark of speech (Bürki, 2018). These can be relatively subtle, gradient changes to the articulatory (and resulting acoustic) properties of the word (e.g., increasing amplitude and duration when a word is emphasized; Arnold and Watson, 2015) as well as more categorical changes (in English, dropping a post-stress schwa; e.g., pronouncing words like camera as camra; Bürki and Gaskell, 2012).
526 Oriana Kilbourn-Ceron and Matthew Goldrick
25.3.2 Co-activation in single word production While there are distinct representations of various aspects of linguistic structure in production, processing of these representations is not limited to discretely separated points in time; there is considerable overlap in the processing of different aspects of structure. The earliest evidence in favor of this idea came from the analysis of semantically related errors in production. While some errors share no sounds with the target (e.g., intended → ‘dog’) others are mixed, sharing both word form and meaning ( → ‘rat’). Under a discrete account, where semantic and phonological information are processed at distinct points in time, we would predict that the probability of mixed errors would be simply the sum of two independent probabilities: the chance of making a semantic error ( → dog) and the probability of making form-related errors ( → hat). However, in spontaneous speech (e.g., Dell and Reich, 1981) as well as in acquired disorders impacting non-semantic processes (e.g., Rapp and Goldrick, 2000), mixed errors arise at rates significantly higher than predicted by this discrete baseline (see Goldrick, 2006, for a review). To account for the unexpectedly high rate of mixed errors, theories have assumed that semantic and phonological information are co-active during production. Extending the example from the first section of the chapter, assume that initial stages of processing involve the spread of activation from semantic features to morphological representations. These representations in turn activate their constituent sounds (e.g., [feline], [pet] activate CAT, which in turn activates its individual constituent sounds /k//ae//t/). If we adopt a cascading activation architecture (McClelland, 1979), prior to the selection of a target morpheme, activation freely flows throughout all levels of the system. (Alternatively, in a discrete system, activation is constrained to the current processing stage, and there would be no activation of sound structure during, for example, semantic processing.) Cascading activation allows the co-activation of semantic neighbors to modulate phonological processing (e.g., leads to activation of [pet], which leads to partial activation of RAT, which in turn will lead to activation of /r/), increasing the likelihood that mixed errors will occur during phonological processing. Allowing activation to feedback from phonological representations allows them to influence target morpheme selection, increasing the likelihood of mixed errors at the morphological level (e.g., feedback from /ae/and /t/, activated by the target CAT, will enhance activation of RAT). Evidence for the co-activation of semantic and phonological information has been documented in other paradigms as well. Many studies of reaction times have examined picture naming in the context of distractor words or pictures to examine this question (e.g., the picture-word interference paradigm discussed above; see Melinger, Branigan, and Pickering, 2014 for a review). For example, Morsella and Miozzo (2002) had participants name a monochromatic line drawing while ignoring a superimposed picture of a different color (e.g., target picture was shown in green, and a distractor picture was shown in red). Pictures with phonologically related distractors ( for
The Dynamics of Word Production 527 target ) were named more quickly than pictures with unrelated distractors ( for target ). Assuming that pictures can only activate phonological information via semantics, this suggests a co-activation of form information based on distinct semantic representations—consistent with a cascading activation account. More recent electrophysiological data is also broadly consistent with co-activation of multiple types of information during word production (although inconsistencies in results mean that the specific degree of co-activation at particular time points is unclear; for reviews, see Munding, Dubarry, and Alario, 2016; Nozari and Pinet, 2020; Strijkers and Costa, 2016a). For example, Miozzo, Pulvermüller, and Houk (2015) examined magnetoencephalographic (MEG) signals during picture naming. Using principal components analysis, they combined multiple measures of word form (e.g., phoneme length, phonological neighborhood density) and semantic properties (e.g., number of semantic features that are critical for identifying a concept) for entry into a regression predicting MEG signals. The results suggested a significant degree of overlap in the time course of semantic and phonological effects, consistent with co-activation. However, using analysis of single trials (which prevents spurious findings of temporal overlap due to variation in timing across participants/trials), Dubarry, Llorens, Trébuchon, Carron, Liégeois-Chauvel, Bénar, and Alario (2017) found relatively low degrees of temporal overlap in activation across brain areas associated with word form vs. meaning. Thus, while studies agree that some degree of co-activation is present in neurophysiological measures, more precise measures of this will require more mature methods and a larger empirical base. Word form variation also provides evidence for co-activation. Several studies have reported that lexical relationships between words modulate within-category variation in speech sounds (see Bürki, 2018, for a recent review). For example, Baese-Berk and Goldrick (2009) examined variation in voice onset time (VOT), a primary phonetic measure distinguishing voiceless and voiced stops (e.g., in English, /p/, /t/, /k/have longer VOTs than their voiced counterparts /b/, /d/, /g/). Examining voiceless stops, Baese-Berk and Goldrick found that words that contrasted with another lexical item beginning with a voiced stop (e.g., contrasts with ) have longer, more voiceless VOTs than words that do not have such a contrast (, which has no contrastive counterpart ). This suggests that phonological information of co- activated words influences within-category variation of speech sounds. Furthermore, this effect is dynamic, increasing when the voiced competitor word is present in the communication context (Baese-Berk and Goldrick, 2009) and when communicating with a speaker that has difficulty perceiving (Buz, Tanenhaus, and Jaeger, 2016).
25.3.3 Interim summary Results using many different methodologies and paradigms provide clear support for the simultaneous activation of distinct aspects of linguistic structure during lexical access (see also Kroll, Bice, Botezatu, and Zirnstein, this volume, for a review of
528 Oriana Kilbourn-Ceron and Matthew Goldrick evidence for co-activation of languages during multilingual production). However, it is important to acknowledge that the degree of co-activation must be constrained, in order to preserve the functional independence of different forms of structure (as reviewed in Section 25.3.1; Brehm and Goldrick, 2016; Goldrick, 2006; but see Strijkers and Costa, 2016b, for an alternative view). Furthermore, while the precise empirical picture is not entirely clear, there is also a good deal of evidence that the relative activation of different aspects of linguistic knowledge may differ over time. For example, semantic information may be activated earlier than phonological information (Schriefers et al., 1990). This provides some preliminary evidence that even in the context of single word production, co-activation is dynamic; it is controlled throughout the production process such that the amount of co-activation is limited (Dell and O’Seaghdha, 1992; Rapp and Goldrick, 2000).
25.4 Dynamics of word production: evidence from connected speech Production is not stringing together unrelated episodes of retrieval, but rather involves the integration of information across time, including both what was said previously and what the speaker will say next. Recently produced words influence the speed and phonetic outcomes of lexical access, and the construction of the syntactic frame required for sentence production modulates patterns of co-activation at the phonological level. Connected speech also reveals that planning of upcoming words and structures can affect production of the current word, which can show substantial phonetic differences. Furthermore, the influence of upcoming material is itself variable and subject to contextual constraints, suggesting that a complex, dynamical system underlies lexical access during connected speech.
25.4.1 The recent past modulates processing and co-activation A word or structure that has been processed and spoken aloud can continue to exert influence on subsequent processing. In studies of single word picture naming, it has been consistently observed that speakers respond more quickly upon second or third presentation of the same stimulus (Jescheniak and Levelt, 1994). Beyond repetition of words themselves, recent experience with syntactic structures and with sub-word phonological units like segments or syllables can affect lexical access. One source of evidence for this comes from studies of priming or adaptation, where processing of a given structure is facilitated by previous processing of the same structure (see Dell, Kelley, Bian, and Holmes, 2019, for a recent review). Syntactic or structural
The Dynamics of Word Production 529 priming provides a well-known example of this phenomenon and highlights the role of recent experience during production of sentence-sized utterances. Participants are primed to produce one of two structural variants with (nearly) identical meaning (e.g., The boy kicked the ball vs. The ball was kicked by the boy) by prior presentation of a sentence sharing the syntactic structure (e.g., The lightning struck the church vs. The church was struck by the lightning). When asked to provide a description of a picture that could be accurately described by either syntactic variant, speakers are biased toward using a structure they have recently heard or used themselves. A key component of this priming effect appears to derive solely from the structure shared by the sentence, rather than overlapping content words, ordering of thematic roles, or shared semantic features (see Pickering and Ferreira, 2008, for a review). This suggests a representation of syntactic structure that is distinct from semantic and phonological elements. As we undertake investigations of word production in sentence contexts, consideration must be given as to how the stages of lexical access are situated with respect to the processing of the syntactic structure that organized multi-word constituents. A study by Konopka (2012) provides evidence that priming of syntactic structures can have direct consequences for the timing of lexical access across multiple words. They examined semantic interference effects in sentences beginning with conjoined noun phrases (e.g., ‘The map and the globe are above the king’). Finding a difference in reaction time between phrases with semantically related and unrelated nouns suggests that lexical access begins for both nouns prior to speech onset. Semantic interference was found only for sentences which were syntactically primed and when the initial noun was high frequency or had recently been elicited in a different task. These results suggest the timing of lexical access for the noun is influenced by the timing of syntactic processing as well as the time course of lexical access for the noun earlier in the sentence. Another type of evidence for the influence of recent experience on lexical access comes from sentence completion with picture naming tasks. Heller and Goldrick (2014) examined how the co-activation of phonologically related words, or neighbors, was modulated by context (e.g., the noun is neighbors with nouns like and words like from different grammatical categories). They compared the production of picture names after reading aloud a sentence (e.g. ‘The bird sings its [picture].’) vs. bare naming (i.e., no sentence preamble). Nouns (i.e., picture names) with a greater number of within-category phonological neighbors, that is, more noun neighbors ( for target ), were named more slowly than nouns with fewer noun neighbors; however, this effect was only present in sentence contexts. This suggests that grammatical processes can modulate the co-activation of phonologically related words, restricting the set of co-activated words to those consistent with the grammatical context. Recently produced words other than the target also exert their influence during target production. Speakers have difficulty with sequences of words in which the first sound is repeated, producing them less quickly and at slower speech rates (Bock, 1987; O’Seaghdha and Marin, 2000; Wheeldon, 2003). Jaeger, Furth, and Hilliard (2012) found that in spontaneous production, speakers avoid selecting a verb that would create
530 Oriana Kilbourn-Ceron and Matthew Goldrick overlap with the onset of the subject, for example, Hannah handed/passed the flask to Patty. This suggests that residual activation at the phonological level can exert influence on lexical selection of the target. The fact that these effects are inhibitory presents an interesting contrast with the results of picture-word interference tasks, where phonological overlap with a distractor often results in facilitation of target word production (e.g., Damian and Dumay, 2009; though see Breining, Nozari, and Rapp, 2016). This incongruency highlights the need for research that includes multi-word utterances in order to better understand word production in naturalistic contexts.
25.4.2 The influence of upcoming material on word form processing While citation forms of words are generally agreed upon and encoded in dictionaries, examination of these same words embedded in spontaneous conversation reveals that their form is highly variable. Beyond the acoustic parameters of pitch, duration, loudness, and intensity, which are modulated to create prosodic contours, word forms can lose, gain, or change segments depending on the words that surround them, including words that appear later in the utterance. This provides clear evidence that phonological processing is not just the retrieval and encoding of a single stored form, but a dynamic process of form selection, context-dependent segmental adjustments, and prosodic integration. The extent to which upcoming words influence processing of the current target word reveals how far speakers plan in advance, what types of representations can be simultaneously active, and when.
25.4.2.1 Context-sensitive segmental adjustments Word forms are actively adjusted during phonological processing by integrating information about their surrounding phonological context (see Purse, Tamminga, and White, this volume, on phonological variation). Consider the word , which in citation form ends in a [t]. The addition of the suffix to make changes the final sound in to a flap [ɾ], a shorter, voiced articulation. This change can be triggered not just by the addition of inflection, but by any following word that begins with a vowel, as in , or . This process applies quite generally to any word that ends in /t/or /d/, modulo some conditions on word stress location (Kaisse, 1985). This productivity suggests that the process of word form encoding is generative, actively looking ahead to upcoming phonological material and appropriately adjusting word forms according to language-specific patterns. However, the amount of look-ahead seems to be constrained, especially at the phonological level. While there is evidence that higher-level phonological properties, like number of words and phrases, are known to speakers before they initiate speech (e.g., Wheeldon and Lahiri, 1997; Wynne, Wheeldon, and Lahiri, 2018), fine-grained details like segmental identity are not planned out to the same degree for all words in
The Dynamics of Word Production 531 a sentence before speech onset. Furthermore, speakers seem to have some flexibility as to how many words in advance they complete planning for before speaking, so the influence of surrounding words may differ from instance to instance even when planning the same word in the same sentence context. Wynne et al. (2018) asked speakers to say short sentences in response to a question prompt and manipulated whether or not participants were given time to prepare the response after seeing the target words. They found evidence that speakers used a more incremental planning strategy when the response deadline was immediate, with response times correlating only with the number of syllables in the first word. (N.B. this and other studies suggest that the prosodic word, a grouping of stressed-bearing words plus unstressed function words, is the relevant unit, rather than individual lexical words; see Wheeldon, 2012, for a recent review.) On the other hand, a delayed response deadline led to response times that were correlated with the overall number of prosodic words in the sentence, suggesting that speakers planned at least higher-level phonological structures in advance when given extra planning time. Therefore, an important part of understanding word production in multi-word speech is understanding what aspects of the surrounding context are active during the processing of any given word. Researchers have tested both the upper and lower bounds of word form planning. In regard to the upper bound, studies have found evidence of phonological processing for up to two phonological words in advance, although results are mixed. For example, Costa and Caramazza (2002) used the picture-word interference paradigm to test whether the third word in a noun phrase was phonologically active prior to initiating speech. They tested phrases in English (determiner + adjective + noun) and in Spanish (determiner + noun + adjective), and in both cases found phonological facilitation for the third word in the phrases (which, in both languages, would be the second prosodic word since the determiner does not form its own prosodic word). This has been replicated for other types of syntactic constituents (English: Schnur, Costa, and Caramazza, 2006; Schnur, 2011), but other studies using similar paradigms have failed to find any phonological effects beyond the first prosodic word (Dutch: Meyer, 1996; German: Schriefers and Teruel, 1999). Part of this discrepancy may be due to speakers using different strategies. Michel Lange and Laganaro (2014), testing French speakers, found that phonological facilitation for the second word in a phrase only occurred for speakers with overall slower response latencies, suggesting that faster participants tend to begin speaking before phonological processing of the second word has begun (see also Gambi and Crocker, 2017 and Wagner, Jescheniak, and Schriefers, 2010, for discussion of flexible planning scope). As for the lower bound of word form planning, there is evidence that speakers can prepare articulatory postures and even begin speaking without having prepared a full word. Whalen (1990) compared productions of two-syllable nonsense words like ‘abu’ where one of the letters was masked until speech was initiated, forcing the speaker to quickly integrate new information about upcoming sounds into their speech plan. They found that speakers anticipated the upcoming sounds in the articulation of the initial vowel (e.g., the initial vowel of ‘api’ was acoustically distinct from the one in ‘apu’), but
532 Oriana Kilbourn-Ceron and Matthew Goldrick only for sounds that were not masked prior to articulation. Kawamoto and colleagues have explained this pattern of optional co-articulation by arguing that the minimal unit of planning is the segment, and speakers can initiate as soon as a single segment is prepared or choose to wait until all segments in a word become available (Kawamoto and Liu, 2007; for a recent review see Kawamoto, Liu, and Keller, 2015 and Krause and Kawamoto, 2020). While the limits of advance planning have yet to be fully understood, its variability may help us understand variation in word forms. While phonologically conditioned variants like the flap in write a letter are restricted to specific phonological contexts, they do not always appear when they could. Kilbourn-Ceron (2017b) examined /t,d/-final words in a corpus of spontaneous American English speech, and found that only 56.5% of those words in the qualifying environment (i.e., between two vowels) were in fact realized with the flap variant. Results from multiple analyses suggest that this pattern is in part attributable to the variable scope of advance planning. Context-triggered variants appear only when upcoming phonological information is available during processing of the target word; if such information is absent (due to planning difficulty), the citation form is produced. Focusing on lexical frequency as an index of retrieval difficulty, Kilbourn-Ceron, Clayards, and Wagner (2020) showed that flapping becomes increasingly less likely as the difficulty of retrieving the upcoming word increases. The same pattern was found for a case of word form variation in French (liaison; Kilbourn- Ceron, 2017a). Kilbourn-Ceron et al. (2020) extended the analysis of flapping beyond word frequency to include conditional probability of the upcoming word given the target word. Low probability should lead to retrieval difficulty specifically within a particular local context. This was also a strong predictor of flapping likelihood. These results suggest that phonological processing is dynamic, able to incorporate information about upcoming words when it is available, and is also able to proceed incrementally in its absence.
25.4.2.2 Storage of multiple forms Variation in word form is not entirely driven by active adjustments based on context. Word form retrieval itself may involve selection among multiple forms with equivalent meanings (a key tenet of “exemplar” perspectives in phonological theory: Bybee, 2001; Pierrehumbert, 2001).2 For example, in French the “schwa” vowel is often dropped in connected speech, so many words like fenêtre ‘window’ have reduced ([fnɛtʀ]) and non-reduced variants ([fənɛtʀ]). This could plausibly be the result of an optional online adjustment during phonological processing, as it is applicable to almost any word containing an unstressed schwa (Gess, Lyche, and Meisenberg, 2012). However, Bürki, Ernestus, and Frauenfelder (2010) found that speakers are sensitive to the relative frequency of the reduced and non-reduced variants for each word. They found that naming 2 A related and ongoing area of research is the storage of morphologically complex words; see Embick, Creemers, and Davies (this volume). In this section, we restrict our discussion to monomorphemic words that show variation in phonological form.
The Dynamics of Word Production 533 latencies were shorter for the reduced variant of a word only when the reduced variant was more common in everyday speech than the unreduced variant. Furthermore, they showed that this effect did not extend to non-words, which were produced equally quickly with or without a schwa. They conclude that speakers must store both variants, and also keep track of their relative frequency in long-term memory. Bürki and colleagues have provided further evidence for storage of multiple forms from the production of determiners in French. The form of determiners in French is based on the grammatical class of their associated head noun as well as the phonological shape of the adjacent word. For example, the definite determiner for feminine nouns is la, as in la fraise ‘the strawberry,’ but is reduced to l’ before a vowel, as in l’ancienne fraise ‘the old strawberry.’ In principle, only the phonological shape of the adjacent word should be taken into account if l’ is derived from la during phonological processing. However, Bürki, Laganaro, and Alario (2014) found that in phrases like l’ancienne fraise, where the onset of the noun imposes a conflicting phonological constraint, naming latencies are longer, suggesting that the phonological representations of all three words are active during the selection of the determiner’s form. Similar phonological consistency effects have been found for other French determiners (Bürki, Besana, Degiorgi, Gilbert, and Alario, 2019), Italian determiners (Miozzo and Caramazza, 1999), and English indefinite determiners (Spalek, Bock, and Schriefers, 2010). Bürki et al. (2019) argue that phonological consistency effects arise due to competition between the multiple stored forms of the determiner, and against a local adjustment from la to l’ during later stages of phonological encoding. This ambiguity between two potential mechanisms of variation has also been explored in sociolinguistic frameworks. Guy (2007) presented an analysis of coronal stop deletion (CSD), which relies on both mechanisms to account for phonologically sensitive yet variable patterns. CSD refers to the optional deletion of word-final /t/or / d/, which has been well documented in several communities of English speakers (see Tamminga, 2018, and references cited therein). Although overall rates of CSD differ, it has consistently been found to apply at higher rates when the following segment is a consonant, and less when the following segment is a vowel. Guy (2007) found that the difference between these two contexts is diminished for the highly frequent connective and, which is much more likely to be produced without the final /d/regardless of what segment follows. He suggests that CSD is a process that applies generally to all words in the lexicon during phonological processing, but that and also has a stored variant that lacks the final /d/. While the social factors that condition phonological variability, especially for CSD, have been an active area of research for many years, recent work in the sociolinguistic tradition has begun to integrate the study of individual-level cognitive factors, such as the time course of lexical access, into their broader study of variation. Tanner, Sonderegger, and Wagner (2017) and Tamminga (2018) present studies of CSD that link the variability of CSD, and in particular the variable influence of the following segment, to the scope of advance planning during speech production (see also Tamminga, MacKenzie, and Embick, 2016, for discussion of other sources of variation at the
534 Oriana Kilbourn-Ceron and Matthew Goldrick individual level). Ultimately, a complete account of word production will include an explanation of how social information is integrated with grammatical information during speech to yield the complex pattern of variation we observe in naturalistic speech (see e.g. Babel and Munson, 2014, for discussion).
25.4.2.4 Open issues in understanding word form variation The existence of word form variability, and the multiple mechanisms that support it, raise a number of questions. Here, we briefly touch on two. First, what determines whether a word form is variable vs. fixed (within a context)? For example, schwa- deletion in words like fenêtre is optional within the qualifying phonological context,3 whereas determiner form is completely restricted and invariant within a given phonological context. How does the word selection process differ in these two situations? Second, given the availability of either computing word form variants on-the-fly or storing them in memory, how does the production system determine which forms should be stored? Words like French fenêtre can actually appear in several different forms: the final cluster may be reduced to [fənɛt], or may appear with an extra schwa [fənɛtʀə], and both these forms may themselves appear without the initial schwa [fnɛt], [fnɛtʀə]. Which forms are stored, and why? Are there different storage criteria for function words like determiners and open-class words like nouns and verbs, or is it simply a matter of frequency of usage, as suggested by some storage-based accounts of word form variation (e.g., exemplar theories; Bybee, 2007)? The investigation of issues like these could benefit from taking into consideration the dynamical nature of word production, including the co-activation patterns associated with individual words and how the patterns are modulated in sentence contexts. For example, there is evidence that for a given task, planning scope differs between individuals (Michel Lange and Laganaro, 2014), suggesting that different planning strategies are available, which could in turn modulate the dynamics of word production by changing which words are co-activated at a given point in the planning process. These differences in planning scope appear to be present even within individuals (Kilbourn-Ceron, 2017b; Kilbourn-Ceron et al., 2020). How does this variability in planning scope influence the planning of both optionally variable word forms (e.g., variable liaison words in French) and “fixed” variation (e.g., determiners in French, which require advance planning.) Relatedly, how does within-and between-individual variation influence whether word form variants are stored or computed on the fly? For example, if a word regularly occurs in a difficult planning context, are speakers more likely to store those variants? Situating such issues within the dynamic, context-specific nature of speech production will provide new insights into the nature of word form planning.
3 There
is a constraint that disallows deletion of schwa after two consonants, blocking deletion for example in la belle fenêtre [la bɜl fənɜtʀ].
The Dynamics of Word Production 535
25.5 Conclusions and future directions Reviewing the current state of research on word production reveals the fundamentally dynamic nature of the lexicon. Rather than being a static repository of sound-meaning links, the lexicon is rich with associations between distinct semantic, grammatical, and phonological aspects of lexical entries. We have reviewed evidence that lexical access entails co-activation at each of these levels. We have also seen that the process of word production is itself dynamic, and changes depending on context. Recent context can create different conditions for lexical access, altering co-activation and word form processing. To ensure that the final product of lexical access is congruent with the grammatical and phonological context in which it will be embedded, speakers modify segments and/or select different word forms. Since the majority of psycholinguistic studies so far have focused on single word naming, we still have much to learn about word production in context, particularly in sentence contexts. How is co-activation managed in, for example, a sentence with multiple semantically and phonologically related nouns? What aspects of previous and upcoming words are available during target word encoding? Echoing Bürki (2018), we suggest that studying word form variation is a fruitful and underexplored area for understanding these processes; it is a key tool that will allow us to develop models accounting for the processing of naturalistic speech.
Acknowledgment Supported in part by a B3 Postdoctoral Fellowship award no. 255232 to Oriana Kilbourn- Ceron from the Fonds de recherche du Québec.
Chapter 26
T he neural basi s of word produc t i on Nazbanou Nozari
26.1 General approaches to studying the neural basis of language processing Two general approaches exist to uncovering the neural basis of language processing. The empirical approach entails the manipulation of language processing demands in two (or more) conditions, and the identification of brain regions preferentially involved in the condition(s) with more prominent linguistic demands. While useful for getting a general sense of the network involved in language processing, the absence of a theoretical framework limits the interpretations that can be assigned to regions in this network. The theoretical approach, on the other hand, starts with a theoretical model, the components or parameters of which are the target of the neural investigation. This approach thus allows for a more meaningful interpretation of the findings, but the caveat is that such interpretation is dependent on the specific model. Models of language production vary considerably in their scope and assumptions. For example, Hickok (2012) assumes that “speech production is fundamentally a motor control problem.” (p. 137), while psycholinguistic models see the main challenge of production as mapping meaning to sound, which includes a great deal more than motor control of speech (e.g., Dell, 1986; Levelt, Roelofs, and Meyer, 1999). Even when the scope is agreed upon, models still vary substantially in the number and nature of the layers of representations they propose and the dynamics of information flow between these layers. In this chapter, I adopt a theoretical approach inspired by the psycholinguistic tradition. Language production encompasses several tasks including reading, writing, and typing. But given the limited space, I will focus on a hallmark production task; namely,
The neural basis of word production 537 oral production of a word from meaning. Often tested experimentally using picture naming, this task requires activating semantic knowledge, selecting the correct word, mapping it onto a sequence of sounds, and finally articulating it. A neural account of such a process must therefore cover the neural basis of semantic, phonological, and articulatory-motor processing, at a minimum. Instead of adopting a specific psycholinguistic model, however, I will use a schematic model of word production, the components of which have been derived from various models and empirical findings, and will opt for the minimum number of layers necessary for mapping meaning to sounds.
26.2 What representations are involved in word production? Most empirical studies of the neural basis of word production have adopted the representations assumed by symbolic models of production (e.g., Kemmerer, 2018). This approach removes the burden of defining the nature of representations from neural investigations, but it must be noted that, unlike symbolic models, neural processing is continuous. It is thus reasonable to ask whether the distinct representations assumed by symbolic models really do have distinct neural representations. In my view, this is the biggest challenge in uncovering the neural basis of word production. I will thus dedicate some space to unpacking this problem. The readers only interested in the discussion of the neural evidence may skip this section. Figure 26.1 shows a schematic model of word production. Processing begins with a “message,” that is, semantic knowledge, and ends with “sound,” that is, the acoustic product of the articulatory system. Because the relationship between semantics and sounds is not systematic (e.g., not all male entities correspond to words that have the sound /m/), it is safe to assume that phonology does not simply spring from semantics, and that the two have distinct representations that must be clearly separated at the neural level. Since relevant phonological representations must be activated from semantics, we can further assume that there must be at least one intermediate layer of representation that mediates this mapping. We often think of this layer of representation as “words.” Symbolic models, however, assume multiple intermediate layers and do not always agree on the nature of the representations in these layers. For example, Levelt et al. (1999) assume that semantic features must first converge on localist representations called “lexical concepts.” Such representations are then mapped onto a different kind of representations called “lemmas,” that is, representations that link lexical concepts to syntactic information. Moreover, it is assumed that lemmas have different representations from morphemes (smallest units that convey meaning; e.g., ‘swim’ and ‘er’ in ‘swimmer’). Other symbolic models pose additional representations for word-like
538 Nazbanou Nozari
Visual inputobjects/actions S VP
NP D
General-purpose monitoring and control centers
N
NP
V
the cat chased D
N
Semantic features
the rat Words Syllable Onset
Rhyme
/k/
Nucleus Coda /æ/
/t/
Phonology
Motor output
Figure 26.1 Schematic of a word production model. D, determiner; N, noun; P, phrase; S, sentence; V, verb.
units, such as lexemes (domain-specific representations that have access to segments such as phonemes in spoken and graphemes in written production), and so forth. Should we look for the neural correlates of lexical concepts separately from lemmas, lexemes, and morphemes? Given the continuity of neural processing, I do not believe such separation to be either well-motivated or fruitful at the neural level. Empirical attempts in this vein have not uncovered clearly separate regions either, especially for higher levels of language processing such as lexical semantic processing. Such distinctions are clearer at the lower levels of production, such as representations of learned sequences (chunked into common syllables or consonant clusters) vs. representations of novel sequences (as individual segments). But even in those cases, as I will discuss in later sections, these two types of sequences are contained in the same neural regions. What is different are the neural pathways that map the abstract representations of these sequences onto motor commands. Therefore, for the purpose of a neural investigation, I will commit to one independent layer of representation between semantic features and phonemes, and call it simply “words” (Foygel and Dell, 2000), emphasizing that this choice does not invalidate the utility of linguistic representations such as morphemes for purposes other than neural investigations. I will also define a final layer of motor output in the schematic model, which I will unpack when discussing the neural correlates of motor production. Also important for modeling language production is the concept of separation of content and frame (Chomsky, 1975; Garrett, 1975; Lashley, 1951). The idea is that once a conceptual
The neural basis of word production 539 message has been constructed, two general kinds of frame must be built for the insertion of linguistic content. The syntactic frame is built under the guidance of the semantic message and syntactic rules, and identifies the phrasal structure—with syntactically labeled slots—for the insertion of lexical items. The segmental frame is built under the guidance of language-specific phonotactic rules, and contains information about the order, syllabic structure, and perhaps other aspects of segmental encoding, such as the stress pattern. This view has been embraced by both connectionist (e.g., Dell, 1988; MacKay, 1987; Stemberger, 1985) and symbolic models of segmental encoding (e.g., Meyer, 1990; Shattuck-Hufnagel, 1979), although alternatives such as parallel distributed processing approaches to content- frame separations have also been proposed (Dell, Juliano, and Govindjee, 1993). Since the focus of this chapter is on single word production, I will refrain from discussing the syntactic frames, which are primarily relevant to phrase-and sentence-level production, although I adopt the view that such frames exist and interact with representations at a level higher than phonology (Figure 26.1). However, as will be seen in the section on “articulatory-phonetic encoding and motor production,” the content-frame separation is immediately relevant to the discussion of the neural correlates of motor speech processing. To summarize, my schematic model (Figure 26.1) consists of semantic features (i.e., all pieces of information, including sensory-motor representations, that make up a concept), words (conjunctions of semantic features, also linked to syntactic information), phonology (abstract representations of sounds), and motor output, which itself comprises multiple parts. These representations interact closely with different types of frames at different levels (e.g., syntactic frames at the word level, syllabic frames at the phonological layer, etc.). This entire system is constantly monitored and regulated by monitoring and control mechanisms. The next four sections discuss the neural correlates of this schematic model. Studies of the neural correlates of language processing have employed various methodologies, including neuroimaging methods (e.g., PET, fMRI, tractography methods, etc.), lesion studies, brain stimulation (e.g., transcranial magnetic stimulation, or TMS, and transcranial indirect current stimulation, or tDCS), and electrophysiological recordings. We have recently written a comprehensive review of the electrophysiological studies of word production (Nozari and Pinet, under review). In light of the inferior spatial resolution of the most common method in this group (EEG), and the general problems we have laid out in that review regarding the interpretations assigned to the findings from this literature, I will not include the EEG data in the current chapter. Instead, I will use a combination of neuroimaging, lesion-based, and brain stimulation studies to discuss converging findings on the neural basis of word production.
26.3 Neural correlates of lexical semantic processing Many attempts at word production start with conceptualization, that is, activation of semantics. Several chapters in the book (e.g., Jackendoff, this volume; Stojnic and
540 Nazbanou Nozari Lepore; this volume, and Landau, this volume) are dedicated to the nature of semantic representations, so I will refrain from a long discussion here, but will point out two opposing views that have been taken in computational models of word production. Some models assume distributed semantic representations, or “semantic features” (e.g., Caramazza, 1997; Foygel and Dell, 2000). Others have insisted on non-decompositional semantics (Levelt et al., 1999; Roelofs, 1992), based on two arguments: (a) that speakers do not name superordinates of the target (e.g., ‘animal’ instead of ‘horse’; the hyperonym problem; Levelt, 1989), and (b) that words with more complex feature sets are not harder to access than those with simple feature sets (the complexity problem; Levelt, Schreuder, and Hoenkamp, 1978). The evidence against the first argument comes from individuals with aphasia, who, not uncommonly, produce superordinates instead of target names. The second argument also need not be true, since under a distributed view activation of a subset of features is sufficient to activate the concept. In fact, activation of concepts through the activation of various subsets of their features is critical in light of the evidence supporting the flexible context-dependent view of concepts (see Yee and Thompson-Schill, 2016 for a review). This position also naturally accommodates embodied views on concept processing (e.g., Barsalou, 1999). I will, therefore, adopt the distributed view. According to the distributed view of semantics (e.g., classic models such as Meynert and Wernicke’s, Eggert, 1977; embodied theories, e.g., Martin, 2016; and newer theories such as the hub-and-spokes model of semantic representation, e.g., Lambon Ralph, Jeffries, Patterson, and Rogers, 2017; Patterson, Nestor, and Rogers, 2007), concepts are constructed of various pieces of information learned by experience and encoded in modality-specific cortices (Fischer and Zwaan, 2008; Kiefer and Pulvermüller, 2012; Meteyard, Cuadrado, Bahrami, and Vigliocco, 2012). Some theories posit that such features converge onto unifying hubs (called amodal, cross-modal, heteromodal, transmodal, or supramodal representations by various research groups) that are assumed to be housed bilaterally in the anterior temporal lobes (ATLs; see Lambon Ralph et al., 2017 for a review). The claim that ATL is the main semantic hub has been argued primarily on the basis of semantic dementia, a neurodegenerative condition causing atrophy of bilateral anterior ventral and polar temporal regions, which causes consistent and pervasive semantic impairment in all modalities and almost all types of concepts, except the knowledge of simple numbers (e.g., Bozeat, Lambon Ralph, Patterson, Garrard, and Hodges, 2000; see also semantic variant of primary progressive aphasia, e.g., Gorno‐Tempini, Dronkers, Rankin et al., 2004; but see Simmons and Martin, 2009 for a different perspective). More recently, additional hubs have been proposed in temporal and parietal regions that are characteristically distant from primary sensory and motor cortices, but have rich connections to the modal association cortices (e.g., Binder, 2016; Binder, Desai, Graves, and Conant, 2009). There are good arguments for why hubs, that is, high-level conjunctive representations that summarize a number of semantic features into a concept, would be useful for abstract, yet flexible and context-sensitive, representation of pure semantic knowledge (Binder, 2016). But since the focus of this chapter is mapping semantic features onto
The neural basis of word production 541 sounds, such conjunctive representations are naturally necessary as the mediating layer, for the reasons discussed in the previous section. Some attempts have been made to separate these representations into lexical concept nodes and lemmas. For example, Kemmerer (2019) identifies the locus of the interference in cyclic blocked naming paradigms (where participants are progressively slowed down by naming a small set of semantically related pictures) as “lexical concepts,” but views pure anomia as a condition targeting “lemmas.” I do not know of any empirical evidence that clearly suggests that, as far as the “word” layer goes, these two cases involve different representations. Similarly, simulations of semantic errors using the lesioned version of the Foygel and Dell (2000) model have been ascribed to lemmas (in contrast to lexical concepts; Kemmerer, 2018), but in this, and most other versions of the two-step interactive model by Dell and his colleagues, a single “word” layer serves both to combine semantic features and link them to syntactic information, without differentiating between lexical concepts and lemmas. In short, the conjunctive representation of semantic knowledge (not to be confused with the distributed representation of semantic knowledge) and the representations that must mediate the mapping of meaning to sound (generally labeled as “words”) are similar enough, and, as suggested by the evidence below, spatially close enough, to be discussed in one place, although a hierarchy of such representations is not impossible. I will thus discuss them in one place. Note that this is not the same as assuming semantic knowledge and words are the same. They are not, as shown in Figure 26.1, and they can be damaged separately. Rather, the issue is whether post-semantic representations that unify various semantic features into a lexical concept are distinct enough from lemmas to be considered two separate sets of representations or not. I adopt the position that the latter distinction is not critical for a neural investigation. A similar approach was adopted by Binder et al. (2009). In one of the largest and most meticulously controlled meta-analyses of neuroimaging data in semantic processing, the authors analyzed data from 120 studies targeting semantic access from (spoken or written) words, and identified 1,145 foci of activation representing a distributed lexical semantic network. The thresholded activation likelihood estimate (ALE) map lateralized the effect largely to the left hemisphere (with some extensions to the right in angular gyrus and posterior cingulate cortex), and identified seven key regions in temporal, parietal, frontal, and paralimbic areas (Figure 26.2). I will briefly review the hypothesized roles of these main areas in light of the empirical evidence.
26.3.1 Temporal regions A main region implicated in the study of Binder et al. (2009) was the lateral part of temporal lobe, including the entire length of the middle temporal gyrus (MTG) and posterior parts of the inferior temporal gyrus (ITG; Figure 26.2, region 1). MTG (and less frequently posterior ITG) activation has been reliably reported during picture naming (de Zubicaray, Miozzo, Johnson, Schiller, and McMahon, 2012; Maess, Friederici, Damian, Meyer, and Levelt, 2002; Moore and Price, 1999; Murtha, Chertkow,
542 Nazbanou Nozari
Figure 26.2 Regions involved in semantic-lexical processing. (1) Lateral temporal lobe, including the entire length of the middle temporal gyrus (MTG) and posterior portions of the inferior temporal gyrus (ITG); (2) a ventromedial region of the temporal lobe centered on mid-fusiform gyrus and adjacent parahippocampus; (3) angular gyrus (AG) and adjacent supramarginal gyrus (SMG); (4) ventromedial and orbital prefrontal cortex; (5) dorsomedial prefrontal cortex in the superior frontal gyrus and adjacent middle frontal gyrus; (6) posterior cingulate gyrus and adjacent ventral precuneus; (7) inferior frontal gyrus (IFG), especially the pars orbitalis. Reproduced from Binder et al. (2009).
Beauregard, and Evans, 1999). Moreover, activation of this region is sensitive to the co-activation of words that are semantically similar during picture naming (e.g., de Zubicaray, Wilson, McMahon, and Muthiah, 2001; see Nozari and Pinet, under review for a review). Left mid-MTG is also implicated in the meta-analysis of Indefrey and Levelt (2004), in which the authors contrasted regions activated during picture naming and associative word generation (both of which require lexical retrieval) with those activated by tasks like reading words and pseudowords (which they argued have the activation of phonological code as the first processing step). In the same vein, along with ATL, damage to MTG has been shown to correlate with semantic errors (e.g., saying ‘dog’ instead of ‘cat’) in picture naming tasks (Henseler, Regenbrecht, and Obrig, 2014; Mirman, Chen, Zhang et al., 2015; Schwartz, Kimberg, Walker et al., 2011, 2009; Walker, Schwartz, Kimberg et al., 2011). Importantly, MTG is also consistently implicated in tasks requiring word comprehension, and although selective damage to MTG is rare, it is often associated with language comprehension and semantic deficits (Dronkers, Wilkins, Van Valin, Redfern, and Jaeger, 2004; Hillis and Caramazza, 1991; Kertesz, Lau, and Polk, 1993). Large lesions to MTG and ITG (sometimes along with the fusiform and angular gyri) lead to transcortical sensory aphasia, a syndrome with impaired speech comprehension despite intact phonological production abilities, such as auditory repetition of words and sentences (Alexander, Hiltbrunner, and Fischer, 1989; Kertesz, Sheppard, and MacKenzie, 1982; Rapcsak and Rubens, 1994). The convergence of lexical semantic deficits in both comprehension and production after damage to anterior and lateral aspects of the temporal lobe implies that
The neural basis of word production 543 these regions most likely store conjunctive representations that are shared between production and comprehension (Ben Shalom and Poeppel, 2008; Friederici, 2002). The role of the more posterior parts of MTG (pMTG) is less clear. Some have linked this region to anomia (Antonucci, Beeson, and Rapcsak, 2004; Baldo, Arévalo, Patterson, and Dronkers, 2013; Raymer, Foundas, Maher et al., 1997), although failure to find a word even in the presence of good semantic comprehension may have different etiologies, including a failure to activate the right word from the concept, failure of inhibiting competing words (e.g., Cloutman, Gottesman, Chaudhry et al., 2009; Nozari, 2019), or failure of activating representations further downstream. Thus without a finer-grained analysis of error types and other accompanying deficits, association with anomia is not particularly telling about the function of a neural region. Others have proposed that the most posterior parts of the left middle and inferior lateral temporal cortex, that is, the temporal-occipital junction, show the greatest concentration of activation foci for processing the meaning of artifacts (as opposed to living things) in neuroimaging studies (see Binder et al. 2009, for a review of these studies). Together with the vicinity of this region to the visual motion processing and parietal praxis-coding regions, this finding has been taken to imply the region’s specialization for processing the visual attributes of actions and tools (A. Martin, Ungerleider, and Haxby, 2000). In line with this claim, Hanna Damasio and her colleagues found lesions to this region to cause a particular difficulty in naming common manipulable objects such as a fork (as opposed to proper nouns and nouns of common animals; Damasio, Grabowski, Tranel, Hichwa, and Damasio, 1996). This position follows the view that semantic knowledge is systematically organized in the temporal lobe, which is agreed upon by most researchers, although substantial disagreement exists about the nature of this organization (Binder et al., 2009; Damasio et al., 1996; Lambon Ralph et al., 2017; Martin, 2016). A third account has been proposed for the role of pMTG as a region involved in implementing cognitive control over semantic retrieval (e.g., Lambon Ralph et al., 2017). For example, several studies from the same research group have shown that applying inhibitory TMS to this region disrupts semantic processing most strongly in conditions with high cognitive control demand, such as matching words with low vs. high semantic association (salt-grain vs. salt-pepper; Lambon Ralph et al., 2017; Whitney, Kirk, O’Sullivan, Lambon Ralph, and Jefferies, 2011a, 2011b). Not far from this interpretation, Binder (2017) interprets the more prominent involvement of this region in processing complex sentences vs. simple words as evidence of a work space for integrating the meaning of multiple words while their phonological forms are held active in phonological working memory, a task for which the pMTG is well suited based on its rich connectivity to the angular gyrus, anterior and inferior temporal lobe, and inferior and superior frontal gyri. In contrast, the superior temporal gyrus (STG) does not seem to be reliably implicated in studies involving semantic processing (Binder et al., 2009; see also e.g., Price, 2000). It has sometimes been implicated during overt word production (de Zubicaray et al., 2001; Hocking, McMahon, and de Zubicaray, 2008), but this probably reflects a role in the auditory processing of self-produced speech, in line with the region’s confirmed role in auditory processing (see Poeppel and Sun, this volume). I will return to this when
544 Nazbanou Nozari discussing speech monitoring. Binder et al.’s (2009) meta-analysis also identified a focal region in the ventral part of the temporal cortex in fusiform and parahippocampal gyri (Figure 26.2, region 2). These regions are not often reported in studies of language processing, with the exception of reading and writing (e.g., Devlin, Jamison, Gonnerman, and Matthews, 2006). The close proximity of the mid-fusiform gyrus to the object perception areas has led to proposals that this region is involved in retrieving knowledge of the visual attributes of objects (Chao and Martin, 1999; Kan, Barsalou, Solomon, Minor, and Thompson-Schill, 2003; Thompson-Schill, Aguirre, Desposito, and Farah, 1999; Vandenbulcke, Peeters, Fannes, and Vandenberghe, 2006). The parahippocampal component has been suggested to link episodic memory to long-term memory by linking hippocampus to lateral cortex (Levy, Bayley, and Squire, 2004).
26.3.2 Parietal regions The angular gyrus (AG) and a part of the supramarginal gyrus (SMG) just anterior to it were parts of the inferior parietal lobule implicated in the meta-analysis of Binder et al. (2009; Figure 26.2, region 3). Although receiving little input from primary sensory areas, AG is richly connected to other association regions (Cavada and Goldman‐Rakic, 1989a, 1989b; Hyvärinen, 1982; Seltzer and Pandya, 1994). Due to this connectivity, it has been implicated as one of the best candidates for high-level integration and potentially another semantic hub in addition to the ATL (Binder and Desai, 2011; Patterson et al., 2007). In keeping with this notion, damage to AG leads to a host of deficits that reflect problems in integration of complex knowledge such as alexia and agraphia (Dejerine, 1892), anomia (Benson, 1979), and acalculia (Cipolotti, Butterworth, and Denes, 1991), among others. In neuroimaging studies of sentence comprehension, AG’s activation has been shown in late stages when all the bits of information are to be integrated into a coherent sentence (Humphries, Binder, Medler, and Liebenthal, 2007), in connected discourse vs. unrelated sentences (Fletcher, Happé, Frith et al., 1995; Homae, Hashimoto, Nakajima, Miyashita, and Sakai, 2002; Xu, Kemeny, Park, Frattali, and Braun, 2005), in response to semantically anomalous words (Friederici, Rüschemeyer, Hahne, and Fiebach, 2003; Ni, Constable, Mencl et al., 2000), and in the processing of thematic relationships, such as the comparison of ‘lake house’ to its reversed construct ‘house lake’ (right-lateralized effect; Graves, Binder, Desai, Conant, and Seidenberg, 2010). Further attempts to pinpoint the exact function of AG in linguistic operations have shown AG’s sensitivity to event-denoting verbs (Boylan, Trueswell, and Thompson-Schill, 2015), and relational compounds, for example, ‘wood stove’ (Boylan, Trueswell, and Thompson-Schill, 2017). Moreover, AG has been implicated in metaphor processing (Bambini, Gentili, Ricciardi, Bertinetto, and Pietrini, 2011), and in the production of creative metaphoric language (e.g., ‘The lamp is a supernova’) vs. literal expressions (‘The lamp is bright’ (Benedek, Beaty, Jauk et al., 2014). In studies by Lambon Ralph, Jefferies, and their colleagues, dorsal AG, along with pMTG, has been implicated in semantic tasks
The neural basis of word production 545 demanding executive control, with patterns very similar to the left ventral prefrontal cortex discussed below (see Lambon Ralph et al., 2017 for a review). SMG is often implicated in tasks that tap into the knowledge for actions, and lesions to this region and pMTG lead to ideomotor apraxia (Buxbaum, Kyle, and Menon, 2005; Haaland, Harrington, and Knight, 2000; Jax, Buxbaum, and Moll, 2006; Tranel, Kemmerer, Adolphs, H. Damasio, and A. Damasio, 2003). Consistent with this region’s role in the praxis features of object knowledge (e.g., Buxbaum, Kyle, Tang, and Detre, 2006), repetitive TMS to left SMG in a picture naming task selectively impairs the naming of manipulable artifacts (Pobric, Jefferies, and Lambon Ralph, 2010). SMG has a more extensive role in phonological encoding which will be discussed in a later section.
26.3.3 Frontal regions The meta-analysis of Binder et al. (2009) identified three regions in the medial prefrontal cortex. The first is ventromedial prefrontal cortex (Figure 26.2, region 4), which has ties to emotion and reward processing, and is likely to be involved in processing the emotional aspects of concepts and words (A. Damasio 1994; Kuchinke, Jacobs, Grubich et al., 2005; Phillips, Drevets, Rauch, and Lane, 2003). The second is the dorsomedial prefrontal cortex (Figure 26.2, region 5) anterior to the supplementary motor area (SMA). Because these two regions share a common blood supply, isolated damage to them is rare, making functional separation difficult. Damage to this general region leads to transcortical motor aphasia, which is characterized by reduced speech output unless speech is constrained enough (e.g., counting 1–10). This profile has led to the proposal that this region is involved in self-guided retrieval of semantic information, especially to serve a communicative goal. The third region is posterior cingulate cortex (Figure 26.2, region 6), for which many functions have been proposed, the most likely of which is acting as the interface between semantic retrieval and formation of episodic memories in hippocampal areas (Binder et al., 2009). On the lateral surface of the frontal cortex, Binder et al.’s (2009) meta-analysis revealed left inferior frontal gyrus (LIFG) as an important locus of lexical semantic processing (Figure 26.2, region 7). The functions attributed to this region (sometimes referred to as the ventrolateral prefrontal cortex) are numerous and encompass a variety of semantic, phonological, and syntactic operations (see Novick, Trueswell, and Thompson-Schill, 2005, and Nozari and Thompson-Schill, 2016 for reviews). Importantly, LIFG is one of the regions most associated with task difficulty. Along with other regions such as the anterior cingulate cortex (ACC), it is often activated when selection demands are high. This could be because the correct response is weaker than a prepotent but incorrect response (e.g., naming the ink color ‘blue’ while ignoring the written word ‘red’ in incongruent Stroop trials), or because several responses are all equally probable (e.g., generating a verb for a noun that is not strongly associated with a verb, e.g., ‘cat’: Play? Purr? Meow?; Thompson-Schill, D’Esposito, Aguirre, and Farah, 1997; Thompson-Schill, Swick, Farah et al., 1998). In keeping with this, picture naming
546 Nazbanou Nozari under circumstances of increased competition between lexical semantic alternatives consistently activates this region (Kan and Thompson-Schill, 2004; Schnur, Schwartz, Kimberg et al., 2009). TMS studies suggest that this region’s function, which has often been called “conflict resolution,” is more prominent for lexical semantic decisions in the anterior part of LIFG. Some have argued for additional regions with a similar function in temporoparietal regions (pMTG and dorsal AG; Davey, Cornelissen, Thompson et al., 2015; Hoffman, Jefferies, and Ralph, 2010; Whitney et al., 2011a, 2011b; see Noonan, Jefferies, Visser, and Lambon Ralph, 2013, for a review). Posterior LIFG, on the other hand, has often been implicated in tasks that require manipulation of phonological information, such as syllabification and sequencing (Devlin, Matthews, and Rushworth, 2003; Gough, Nobre, and Devlin, 2005; see also Clos, Amunts, Laird, Fox, and Eickhoff, 2013, for a meta-analytic connectivity-based parcellation of LIFG). More fine-grained functional organizations on multiple axes have also been proposed for the lateral frontal cortex (e.g., Bahlmann, Blumenfeld, and D’Esposito, 2015). Another account of the role of LIFG is that of strengthening associations (R. Martin and Cheng, 2006; Wagner, Paré-Blagoev, Clark, and Poldrack, 2001). In keeping with this view, a TMS study over this region resulted in the disruption of performance when participants had to match items with weak associations (e.g., ‘salt’/’grain’), but performance was unaffected for items with strong associations (e.g., ‘salt’/’pepper’; Whitney et al., 2011a). In a recent eye-tracking study, we showed that a group of individuals with anterior lesions, including LIFG, were more impaired in using both semantic and phonological cues in locating a referent during sentence comprehension, compared to individuals with posterior lesions (Nozari, Mirman, and Thompson-Schill, 2016). We have discussed the results within the framework of a drift diffusion model, in which LIFG’s role has been proposed as boosting the rate of evidence accumulation for establishing an association. In the context of selection among activated alternatives, this function is equivalent to conflict resolution. Connectivity measures also support the idea of LIFG implementing control over semantic selection. For example, damage to the uncinated fasciculus, which connects the pars orbitalis of LIFG to ATL, has been shown to be correlated with semantic control deficits (Harvey, Wei, Ellmore, Hamilton, and Schnur, 2013). Harvey and Schnur (2015) also found that damage to the inferior fronto-occipital fasciculus, which connects LIFG with the posterior temporal lobe, was related to semantic interference during picture naming. To summarize, the lexical semantic network can be roughly divided into two parts: regions that store long-term knowledge, and those that are involved in initiating and controlling the retrieval of that knowledge. By most accounts, temporal regions have the former function, with potential subregions for different types of knowledge and varying degrees of integration (Binder, 2017). In contrast, superior and medial frontal structures appear to have a role in motivation and task initiation, and LIFG seems to be involved in strengthening of associations and helping with selection among competing alternatives. The role of inferior parietal regions is less agreed upon, but the evidence points to an integrative function.
The neural basis of word production 547
26.4 Phonological encoding The term “phonological code” has sometimes been used to refer exclusively to long- term phonological representations of known words, specifically excluding pronounceable nonword strings like BILF (e.g., Indefrey and Levelt, 2004). Others, however, have included such strings when discussing neural correlates of phonological encoding (e.g., Acheson, Hamidi, Binder, and Postle, 2010; Yue, Martin, Hamilton, and Rose, 2019). Since similar neural regions have been implicated in processing of both words and pronounceable nonwords, I will not emphasize this distinction, and instead define a phonological code as a representation containing an ordered sequence of phonemes to be translated into articulatory gestures, without the representation itself being rich in phonetic details (Oppenheim and Dell, 2008). In the case of real words, such representations naturally link the word representations to sounds. Binder (2015) summarizes the findings of 14 neuroimaging studies that tap into phonological processing with controls for semantic, articulatory, and auditory processing. Manipulations include comparison of pseudoword vs. word reading, naming pictures with high-vs. low- frequency names, lexical decision with phonologically related and unrelated primes, a variety of rhyme matching tasks with words and pseudowords, and silent rehearsal of words, pseudowords, or pseudosentences (Binder, 2015; Appendix e-1). Figure 26.3A shows the activation loci from these studies. Figure 26.3B shows the result of a lesion-symptom mapping study in 40 individuals with left hemisphere stroke. The map indicates the correlation between lesioned voxels and scores in a silent visual rhyme judgment task (snow/blow/plow; Pillay, Stengel, Humphries, Book, and Binder, 2014). Finally, Figure 26.3C shows sites where direct cortical stimulation in patients undergoing brain surgery elicited phonological errors during reading without impairing comprehension (Anderson, Gilmore, Roper et al., 1999; Roux, Durand, Jucla et al., 2012). Examination of these three figures shows a striking convergence of the results of neuroimaging, lesion, and stimulation studies of phonological processing in pSTG and SMG, with pMTG also implicated in the neuroimaging and stimulation studies but not in the lesion study, suggesting that it may support phonological encoding but not be necessary for it. The reader may have noticed that the general region implicated by these studies is often referred to as Wernicke’s area. It is important to note that damage to this region does not cause Wernicke’s aphasia, but a different syndrome called conduction aphasia, which unlike Wernicke’s aphasia does not impact comprehension. Instead, phonological production abilities are clearly impaired (Axer, A. Keyserlingk, Berks, and D. Keyserlingk, 2001; H. Damasio and A. Damasio, 1980; Fridriksson, Kjartansson, Morgan et al., 2010; see Buchsbaum, Baldo, Okada et al., 2011, for a review). Similar symptoms can be evoked in neurotypical speakers by cortical stimulation of the same region (Anderson et al., 1999; Corina, Loudermilk, Detwiler et al., 2010; Hamberger, Miozzo, Schevon et al., 2016; Quigg and Fountain, 1999). Moreover, cortical degeneration of this
548 Nazbanou Nozari (a)
(b)
(c)
Figure 26.3 Neural correlates of phonological encoding. (A) A summary of 14 functional neuroimaging studies examining phonological encoding. (B) Lesion sites in 40 left-hemisphere stroke survivors with selective phonological impairment. (C) Locations where cortical stimulation in 14 patients led to phonological errors during reading. See text for descriptions and references. Reproduced from Binder (2015).
region leads to the logopenic variant of primary progressive aphasia, the hallmark of which is phonological paraphasias and impaired verbal short-term memory, anomia, and various degrees of impaired comprehension of sentences (but not single words; Croot Karen, Ballard Kirrie, Leyton Cristian E., and Hodges John R., 2012; Leyton, Ballard, Piguet, and Hodges, 2014; Rohrer, Ridgway, Crutch et al., 2010; see Henry and Gorno-Tempini, 2010, for a review). There is now wide agreement that pSTG stores phonological representations. However, researchers disagree on whether the same representations are activated during phonological working memory tasks (Acheson et al., 2010), or whether a different part of cortex acts as a “phonological buffer” (Baddeley and Hitch, 1974), that is, a region that temporarily keeps phonological representations activated while a task is being performed. In support of the latter view, lesion overlap studies have localized phonological short-term memory to the left SMG (e.g., Paulesu, Shallice, Danelli et al., 2017). Also, a recent fMRI study using multivariate pattern analysis showed that stimuli could be decoded from SMG, but not STG, during the delay period in a memory task. Furthermore, a functional connectivity analysis in the same study demonstrated that the connection between the left temporal and parietal regions became stronger as memory load increased, suggesting a greater collaboration between the storage and buffer regions in temporal and parietal cortices, respectively (Yue et al., 2019). I must note that the inferior frontal cortex has also been named as a potential region involved in phonological working memory. I will return to the role of this region and its relevance to speech production in the next section. One neural region deserves a special mention in this section. Known as area Spt (Sylvian parietal temporal) in the dual-stream models of Hickok and Poeppel (2007), this region has been identified as carrying out the sensory-motor translation of speech. Others, however, disagree with this claim, and propose that the auditory-motor interface involves a much larger portion of the pSTG (Niziolek and Guenther, 2013; Tourville, Reilly, and Guenther, 2008). This debate is difficult to settle, because area Spt
The neural basis of word production 549 is often defined functionally, as a region that exhibits both auditory and motor response properties, albeit within an anatomically constrained area. It is known to have considerable variability across individuals (despite its relative stability within individuals), making it very difficult to localize it in standardized space. Nevertheless, the findings of Guenther and colleagues, especially the bilateral nature of translational regions, may be important in expanding the anatomical constraints for the functional search for this region. To summarize, the regions most reliably implicated in the storage and active maintenance of phonological codes are pSTG and SMG, with the former’s role as a store for phonological representations, and the latter’s role as phonological buffer. Lesions to these areas often lead to phonological paraphasias and sometimes difficulty in understanding long sentences (which hinges on keeping phonological codes active until they could be mapped onto semantic representations). In contrast, impaired comprehension of words and short phrases often observed after damage to MTG and the AG, which were strongly implicated in lexical semantic processing, is prominently absent (Damasio, Tranel, Grabowski, Adolphs, and Damasio, 2004; Dronkers et al., 2004; Kertesz et al., 1993; Kertesz et al., 1982; Thothathiri, Kimberg, and Schwartz, 2011). This double dissociation between semantic and phonological impairment is showcased in transcortical sensory vs. conduction aphasia.
26.5 Articulatory-phonetic encoding and motor production Neural correlates of vocalization are better understood than the more abstract parts of language production, due to more extensive opportunities for single- unit recording and focal lesioning in nonhuman primates. These comprise both cortical and subcortical structures. Psycholinguistic models of word production are often sparse in details regarding motor speech processes. Therefore, for this part of the review, I adopt the framework of a computationally sophisticated and neurally explicit model of motor speech control, DIVA (directions into velocities of articulators; Guenther, 1994, 1995; Tourville and Guenther, 2011), and its more recent version GODIVA (gradient order DIVA; Bohland, Bullock, and Guenther, 2010). I will first review the relevant regions in the cerebral cortex, followed by the subcortical structures.
26.5.1 Cerebral cortex Figure 26.4 shows the cortical activity on the inflated cortical surface for 116 participants reading mono-and bi-syllabic utterances aloud (Kearney and Guenther, 2019; see
550 Nazbanou Nozari
Figure 26.4 The cortical regions activated during reading aloud of mono-and bi-syllabic utterances compared to passive viewing of letters, plotted on inflated surfaces. N = 116. Upper and lower panels show lateral and medial surfaces, respectively. Images on the left and right show left and right hemispheres, respectively. aINS, anterior insula; aSTG, anterior superior temporal gyrus; CMA, cingulate motor area; HG, Heschl’s gyrus; IFo, inferior frontal gyrus pars opercularis; IFr, inferior frontal gyrus pars orbitalis; IFt, inferior frontal gyrus pars triangularis; ITO, inferior temporo-occipital junction; OC, occipital cortex; pMTG, posterior middle temporal gyrus; PoCG, postcentral gyrus; PrCG, precentral gyrus; preSMA, pre-supplementary motor area; pSTG, posterior superior temporal gyrus; SMA, supplementary motor area; SMG, supramarginal gyrus; SPL, superior parietal lobule. Reproduced from Kearney and Guenther (2019).
Guenther, 2016, Figure 2.14 for the same data, along with results of two meta-analyses of similar data from Brown, Ingham, Ingham, Laird, and Fox, 2005; Turkeltaub, Eden, Jones, and Zeffiro, 2002, superimposed). The first striking finding is that, unlike the higher cortical functions related to language production reviewed in previous sections, which are largely left-lateralized, articulatory processes (many of them shared between linguistic and non-linguistic vocalization) are largely bilateral. The relevant cortical areas include both medial and lateral surfaces of the frontal and prefrontal cortex, a large portion of the STG, and parts of the parietal cortex including the postcentral gyrus, and to a lesser extent the superior parietal lobule. I will present a brief overview of the function of these regions in this and the next section. For a more extensive review of these regions and the neural pathways involved in speech motor control, I refer the reader to Guenther (2016).
The neural basis of word production 551 Not surprisingly, some of the most important cortical regions for speech articulation are motor areas, including the primary motor and premotor cortices in the precentral gyrus, supplementary and pre-supplementary motor areas (SMA and preSMA), and motor cingulate cortex. Meta-analyses of somatotopic studies of speech articulators suggest the following dorsoventral ordering in the precentral gyrus: larynx, lips, jaws, tongue, and throat, although this ordering is rough, with multiple representations for each articulator and substantial overlap between the regions for different articulators (Takai, Brown, and Liotti, 2010; see also ECoG studies for additional evidence; Bouchard, Mesgarani, Johnson, and Chang, 2013; Farrell, Burbank, Lettich, and Ojemann, 2007). Unilateral damage to the precentral gyrus usually causes only minor disruptions in face and mouth movements, likely due to the largely bilateral connections in this region (Penfield and Roberts, 1959). Bilateral damage to the precentral gyrus in humans is rare, and if found is often accompanied by extensive lesions beyond this area, which makes a neuropsychological interpretation of this region’s function difficult. SMAs are also often activated during the production of even simple syllables. Unilateral damage to these areas is followed by near-total recovery of speech within months (Laplane, Talairach, Meininger, Bancaud, and Orgogozo, 1977). The hallmark of damage to these areas is transient mutism, which is often specific to self-initiated speech, while constrained or automatic speech (e.g., repeating words, reciting a learned sequence such as counting 1–10) could remain intact. This is part of a syndrome called transcortical motor aphasia (alluded to earlier when discussing the simultaneous damage of SMA and the adjacent dorsomedial prefrontal cortex), which may also entail problems such as involuntary vocalization, and anomalies in the rate, prosody, and fluency of speech may also be present (Freedman, Alexander, and Naeser, 1984). To best understand the difference between the functions of lateral and medial motor areas, recall the content-frame separation discussed at the beginning of this chapter. Generally speaking, the evidence points to a role of the lateral frontal cortex (specifically ventral premotor cortex, vPMC, and ventral primary motor cortex, vMC, both in the precentral gyrus, and left posterior inferior frontal sulcus, pIFS) in representing the content of speech, while the medial surface of the frontal cortex (SMA and preSMA) represents frames. GODIVA proposes the content-frame separation in the context of two related loops: a planning loop, the main job of which is to temporarily store (i.e., buffer) the utterance to be produced, and a motor loop, which generates the actual motor commands for production (Figure 26.5). The planning loop consists of the preSMA, which contains the abstract sequential frame and its corresponding counterpart on the lateral surface, that is, left pIFS, which buffers the phonological content. The motor loop comprises the SMA, which generates the abstract initiation map and its counterpart on the lateral surface, left vPMC, which contains the speech sound map (i.e., nodes whose activation leads to the read out of the motor programs). A combination of signals from SMA and vPMC is sent to the vMC, which contains the actual motor gestures. The representations shown in Figure 26.5 are for the word ‘blue’ when it has already been learned and practiced, hence the “chunked” representations such as the consonant cluster /bl/at the syllable level in the pIFS, and the holistic /blu/at the level of speech
552 Nazbanou Nozari sound map in the vPMC. These chunked representations do not exist earlier in development when ‘blue’ is a novel sequence. Instead, the individual segments /b/, /l/, and /u/must be assembled at multiple levels of the system, which puts novel sequences at a disadvantage compared to well-practiced ones. In addition, representations of well- practiced sequences take advantage of subcortical projections via the basal ganglia and cerebellum, which further facilitates quick mapping of chunked representations. The reader may have noticed that the function attributed to the pIFS by GODIVA, that is, phonological buffering, is similar to what I discussed in the previous section as having been attributed by some to the SMG (e.g., Yue et al., 2019). When comparing the production of complex syllable sequences (three unique syllables) with simple syllable sequences (same syllable repeated three times), Bohland and Guenther (2006) found that, in addition to vPMC and SMA, greater activation was observed in both IFS and SMG (as well as preSMA and anterior insula), regions that have also been implicated in a large meta-analysis of over 100 neuroimaging studies of working memory (Rottschy, Langner, Dogan et al., 2012). Interestingly, in that meta-analysis the only locus that had been preferentially activated in verbal over non-verbal working memory tasks was IFS,
Frame
Content
PreSMA Planning loop
pIFS Phonological buffer
Sequential structure buffer S/u/
S/b1/
/b1/
vPMC
SMA Initiation map Motor loop
/u/
Speech sound map I/b/
I/1/
I/u/
/b1u/
vMC
Motor gestures
G/b/
G/1/
G/u/
Figure 26.5 Schematic of the planning and motor loops in speech production. The left (blue in the colored version) and right (yellow in the colored version) panels show regions involved in frame and content processing, respectively. Chunked representations in each region are shown for a learned word (“blue”). For a novel syllable (not shown), representations in all boxes would consist of individual segments (e.g., three separate segments for /b/, /l/and /u/). Solid arrows show cortico-cortical projections. Dotted arrows show connections via basal ganglia. Dashed arrows show connections via cerebellum. G, gestural node; I, initiation map; pIFS, posterior inferior frontal sulcus; preSMA, pre-supplementary motor area; S, syllabic structure node; SMA, supplementary motor area; vMC, ventral primary motor cortex; vPMC, ventral premotor cortex.
The neural basis of word production 553 leading Guenther (2016) to conclude that this region may be involved in articulatory rehearsal (see Curtis and D’Esposito, 2003, for a different perspective). Thus while both SMG and IFS could both be parts of Baddeley and Hitch’s (1974) phonological loop, SMG may act as the “phonological store,” whereas IFS could implement the “articulatory process.” Together, they help maintain verbal information in working memory. While such a division of labor is speculative, it aligns well with the greater proximity of SGM to the auditory representations and IFS to motor representations. IFG is also implicated in speech production, and unlike some of the other cortical regions, its activation is left-lateralized (Ghosh, Tourville, and Guenther, 2008). As discussed in the earlier sections, anterior LIFG has been implicated in controlled semantic retrieval, while posterior LIFG has been tied to lower-level processes. The pSTG (which, as discussed in the previous section, is presumed to contain phonological representations) is connected to LIFG via the long segment of the arcuate fasciculus, and damage to this tract, or intracranial stimulation of it, results in phonemic paraphasias during word production; (Berthier, Lambon Ralph, Pujol, and Green, 2012; Duffau, Moritz-Gasser, and Mandonnet, 2014; Schwartz, Faseyitan, Kim, and Coslett, 2012). Damage to LIFG itself causes apraxia of speech, a condition that impairs motor programming of speech sequences without affecting the articulators themselves (Graff-Radford, Jones, Strand et al., 2014; Hillis, Work, Barker et al., 2004; Richardson, Fillmore, Rorden, LaPointe, and Fridriksson, 2012). Speech apraxia was previously attributed to damage in the insula (Baldo, Wilkins, Ogar, Willock, and Dronkers, 2011; Dronkers, 1996), but as it is a part of the paralimbic system (a system that integrates the functions of the neocortex with the motivational/emotional functions of the limbic system), some have suggested that the role of the insula in speech production is better aligned with motivational factors. In keeping with this interpretation, the insula seems to be involved in a large variety of non-linguistic functions with little in common in terms of cognitive processes. A motivational role has also been proposed for motor cingulate cortex, part of the ACC, which also belongs to the paralimbic system. Bilateral damage to the ACC leads to akinetic mutism, a condition characterized by an absence of motivation to produce speech, although when speech is occasionally produced, it is intact in both meaning and form (Rosenbeck, 2004). In summary, various areas in the medial and lateral frontal and prefrontal regions contribute to converting abstract phonology into the sequences of motor gestures that make up speech. In addition to these, some temporoparietal regions have been implicated during production (Figure 26.4), but these are less likely to be directly involved in the act of production. The activity in the STG, including primary auditory cortex located in Heschel’s gyrus, as well as higher-order auditory cortical areas of anterior and posterior STG (aSTG and pSTG), is mostly due to hearing one’s own voice. However, magnetoencephalography and fMRI studies have also found activation of these regions during covert speech (Okada and Hickok, 2006), which is compatible with their involvement in the prediction of the sensory consequences of actions. This function, along with the role of the postcentral gyrus in the parietal lobe, will be discussed in the section on monitoring.
554 Nazbanou Nozari
26.5.2 Subcortical structures Several subcortical structures have also been implicated in the motor production of speech (see Guenther, 2016 for a review). Most prominent is the combination of the basal ganglia and the thalamic nuclei that are involved in linking the content-frame representations discussed above. Empirical evidence suggests a greater involvement of the caudate nucleus in the planning loop. Electrical stimulation of this nucleus or the anterior thalamic nuclei causes speech that cannot be inhibited (see Crosson, 1992 for a review). According to GODIVA, the basal ganglia in the cortico-basal motor loop, including putamen, in collaboration with the cortico-cerebellar loop, helps develop chunked representations of practiced sequences in pIFS, preSMA, and vPMC, as explained earlier in this section. In line with this proposal, overt production of novel compared to practiced syllables leads to greater activation in the pIFS, preSMA, and vPMC as well as the anterior insula and SMG (Segawa, Tourville, Beal, and Guenther, 2015), because novel syllables have not yet formed more condensed representations (chunks) in these areas. As such, subcortical structures play a critical role in making production more efficient as a function of learning. In addition to its role in facilitating the mapping of chunked speech into gestures, the cerebellum is also involved in online monitoring of speech production using sensory information. I will discuss this briefly in the next section.
26.6 Monitoring of word production The problem of monitoring in language production closely mirrors the problem of model scope, alluded to at the beginning of this chapter. As a model of motor speech production, GODIVA also contains detailed mechanisms for the monitoring and control of speech motor operations. But naturally, such mechanisms are restricted to the scope of the model, which addresses lower-level production processes. As noted earlier, this is only one part of the puzzle of language production. Attempts have been made to extend the general framework of GODIVA to higher processes involved in language production, but such attempts either do not extend beyond the level of phonological representations (e.g., Hickok, 2012), or when they claim to extend to higher levels, fail to meet the basic requirements of such a framework at higher levels (e.g., Pickering and Garrod, 2013; see Nozari, under review, for arguments). I will start by discussing GODIVA, as it provides the most detailed predictions regarding the neural correlates of monitoring at the motor level. I will then briefly touch upon what is still missing from the monitoring literature. According to GODIVA, production starts by activating a speech sound map in the left vPMC, which activates a stored motor program in vMC. At the same time, two more representations are activated through a forward model (i.e., a model that anticipates the
The neural basis of word production 555 perceptual consequences of an action): an auditory target and a somatosensory target. The auditory target is part of the auditory state map, which contains a talker-normalized representation related to the formant frequencies of the speech signal, and is localized to pSTG including the planum temporale. The projections from vPMC to pSTG serve the purpose of canceling out the auditory information, thus, if the incoming auditory signal falls within the auditory target region, its excitatory effects will be suppressed. Such auditory suppression of self-produced speech is well-documented and contingent on intact cerebellar function (Christoffels, Formisano, and Schiller, 2007; Franken, Hagoort, and Acheson, 2015; Knolle, Schröger, Baess, and Kotz, 2011). If, on the other hand, the incoming auditory information is outside of the target region, a reduced degree of speech-induced suppression is observed (Heinks-Maldonado, Nagarajan, and Houde, 2006), which generates an error signal. Note that the error signal in the above model is in the auditory space. In order for it to be used to adjust motor movements, it must be transformed into motor coordinates. As discussed earlier, Hickok and colleagues propose that this transformation is carried out by the area Spt, an area at the border between the parietal operculum and planum temporale in the left hemisphere (Buchsbaum, Hickok, and Humphries, 2001; Hickok, Buchsbaum, Humphries, and Muftuler, 2003), whereas Guenther and colleagues propose that the auditory-motor interface involves a much larger portion of the pSTG bilaterally. These areas accomplish the transformation by sending projections to right vPMC and then to vMC through both cortico-cortical projections and cortico-cerebellar loops. Predictions of the model have been confirmed by studies in which the syllables produced by speakers undergo real-time perturbation of formant frequencies (e.g., Tourville et al., 2008; Niziolek and Guenther, 2013). In addition to bilateral pSTG, these studies have implicated right IFG as part of the auditory feedback loop. Importantly, adjustments to speech have been subconscious, with participants often unaware of the artificial speech modification or any attempts at correcting it. This independence from conscious awareness is critical for a monitoring mechanism that must continuously assess and regulate production without interfering with its primary processes (Nozari, under review; Nozari, Martin, and McCloskey, 2019). Activation of the speech sound map also activates a somatosensory target as part of the somatosensory state map in the ventral postcentral gyrus and SMG. The workings of this feedback loop are generally similar to the auditory feedback loop. Similar to the suppression of auditory feedback during the self-produced speech, tactile sensation is reduced by one’s own movement, a function that is again attributed to the cerebellum (Blakemore, Frith, and Wolpert, 2001; Blakemore, Wolpert, and Frith, 1998). Thus, by the logic explained for the auditory feedback loop, a mismatch between the predicted and actual somatosensory representations leads to an error signal, which is transmitted to right vPMC for transformation to corrective movements that are then sent to bilateral vMC. GODIVA thus elegantly explains both the developmental trajectory of learning to imitate the phonology of one’s language and the subtle adjustments made to learned speech after auditory or somatosensory perturbation. Note that this mechanism requires that
556 Nazbanou Nozari perceptual feedback is available, that is, the action has been produced. This requirement is reasonable for the first stages of language learning, where production is rarely on target and must be heavily modified by overt feedback. Later in life, when motor production is mastered, speakers do not rely nearly as much on overt feedback to detect problems in their speech. Evidence for this claim comes from many instances of covert error detection (i.e., detecting errors before they become overt), or intercepting errors early in production and applying fast repairs (Hartsuiker and Kolk, 2001; Levelt, 1983). In this vein, Hickok (2012) proposed a mechanism similar to GODIVA, in which persistent activation of perceptual representations during self-produced speech generates an error signal. But unlike GODIVA, the anticipated activation of perceptual representations does not need to be compared against the actual perceptual input. Instead, it is directly suppressed through the motor program, thus eliminating the need for overt feedback. Even if such direct suppression is possible, the model’s basic mechanism still hinges on having two sets of representations (motor and perceptual) corresponding to the same utterance. While this is a reasonable assumption for representations at or below the level of phonemes, I do not know of any evidence supporting such a dichotomy at the higher levels of language production, for example at the level of lexical semantic representations. In fact, as reviewed in earlier sections, neural evidence suggests that such a duplication is unlikely to exist at those levels. This problem, which I have called the problem of duplicate representations (Nozari, under review), makes the extension of monitoring mechanisms that rely on sensory-motor comparisons to higher levels of language production infeasible. Two solutions remain: either language production is only monitored at a late stage, that is, after phonological encoding, or it is monitored at an earlier stage but with a different mechanism. The differences observed in the detection and repairs of semantic vs. phonological errors (e.g., (Nozari, Dell, and Schwartz, 2011; Schuchard, Middleton, and Schwartz, 2017); see Nozari, under review, for details) make the first possibility unlikely. This has led to proposals for alternative monitoring mechanisms at the higher levels of language production. One such mechanism is conflict-based monitoring (Botvinick, Braver, Barch, Carter, and Cohen, 2001), in which the close activation of multiple representations (i.e., high conflict) signals the higher likelihood of an error (regardless of whether an error is actually committed or not), and leads to the greater recruitment of control resources to resolve this conflict. This mechanism is “layer-specific,” meaning that conflict is computed between different representations (e.g., lexical representations of ‘cat’ and ‘dog’; Hanley, Cortis, Budd, and Nozari, 2016; Nozari et al., 2011) within the same layer. This is fundamentally different from the models discussed above, in which computations depend critically on activation at different levels for the same item (e.g., motor vs. perceptual representations of ‘cat’). The conflict-based account thus eliminates the problem of duplicate representations at the higher levels of the system. To this is added a domain-general component, which reads out the conflict from specific parts of the system and uses this information to regulate top-down control over the parts from which conflict has arisen.
The neural basis of word production 557 The layer-specificity of the conflict-based account predicts the engagement of the same regions that are involved in the primary production processes in monitoring processes. Neural correlates of the domain-general part of the conflict-based monitor are under debate, but medial prefrontal cortex (especially ACC and preSMA) and lateral prefrontal cortex (especially the LIFG) have been the main candidates. Due to the methodological difficulties involved in eliciting errors from neurotypical adult speakers, very few neuroimaging studies have to date investigated the neural correlates of error detection in natural production tasks. A tongue-twister study by (Gauvin, De Baene, Brass, and Hartsuiker, 2016) implicated pre-SMA, dorsal ACC, anterior insula, and inferior frontal gyrus. Interestingly, that study failed to find a reliable contribution of the auditory or perceptual areas to error detection. But neural correlates of the conflict-based model can be assessed in another way: note that the scope of conflict-based monitoring extends beyond error detection to the online regulation of production, potentially on every production attempt (Nozari and Hepner, 2018, 2019). The need for regulation increases with increased conflict. According to conflict-based monitoring, most errors are simply a subset of this situation. Thus one could test the predictions of the conflict- based monitor by looking at the neural correlates of word production in situations of high conflict, for example, in the presence of a semantically related competitor. In keeping with the predictions of the conflict-based monitor, middle and posterior MTG, ACC, and LIFG are among the regions most frequently implicated in such studies (e.g., de Zubicaray, McMahon, Eastburn, and Pringle, 2006; de Zubicaray, McMahon, and Howard, 2015; de Zubicaray et al., 2001; Kan and Thompson-Schill, 2004; Schnur, Schwartz, Brecher, and Hodgson, 2006; Schnur et al., 2009). To summarize, just like the primary production processes themselves, the monitoring processes for motor production are much better understood than those for higher- level production processes, but given the very different nature of representations in the higher and lower levels of the production system, the most likely possibility is that more than one monitoring mechanism exists for the regulation of the production system (Nozari, under review).
26.7 Summary and conclusion Given that no model of language production to date covers the entire process of mapping concepts to articulatory gestures, I used theoretical insights from models with different scopes, in the hope of painting as complete of a picture of the neural basis of word production as possible. Despite this heterogeneity, a neural model emerged within which the flow of information roughly tracks the information flow in cognitive models. Conceptualization begins by connecting distributed semantic features in conjunctive zones in the temporal lobe. The mapping of concepts to words and phonology is represented by the flow of information from anterior to middle and finally posterior parts of the MTG and STG. The adjacent SMG helps with the buffering of phonological
558 Nazbanou Nozari information until it is time for production. Phonological codes are then mapped onto corresponding representations in the frontal cortex. This region has a complex hierarchy of planning and execution loops, and contains both abstract frames (in SMA and preSMA) and content (in the vPMC and IFS), with the final-stage articulatory gestures stored in the vMC. This architecture is very similar to that proposed by Hickok and Poeppel (2007). This process is supplemented by several other systems whose roles are less well understood. Medial frontal structures most likely serve motivational and evaluative functions. Closely related to these is the LIFG, which seems to play a role in resolving the conflict between competing alternatives and strengthening associations. Medial temporal regions may be important for connecting episodic memory in the hippocampus to long- term memory in cortical regions, and thus play a critical role in learning. The AG plays some kind of integrative function for events with relational or complex representations, although the exact nature of this function is not well understood. And last, but not least, subcortical regions such as the basal ganglia and cerebellum seem to play a critical role in learning, both by creating more efficient representations and faster mapping, and, in the case of the cerebellum, by being involved in generating predictions about the consequences of a motor plan. I end by highlighting the fact that we have come to understand a great deal about the neural basis of word production, but there is a long way to go. Many of the functions attributed to the regions described in this chapter are speculative, or are too imprecise for us to claim that we have really understood the role of that region. This is particularly true for subcortical structures, which are hard to evaluate by many routine techniques. Moreover, certain problems such as monitoring and control are far from solved in language production, and relevant data are currently sparse. These areas provide great avenues for future research on the neural basis of word production.
Pa rt I I I C
I N T E R FAC E S A N D B OU N DA R I E S
Chapter 27
The Struct u re of T he Lexical I t e m a nd Sentence Me a ni ng C omp osi t i on Maria Mercedes Piñango
27.1 Introduction 27.1.1 Preliminaries: the lexical item and the mental lexicon There is consensus that language use is better understood as lexically-driven; that is, that the processes of language comprehension and production appear as the manifestation of the identification (e.g., retrieval, activation) of lexical items in succession (e.g., Tyler, 1989; Fodor, 1995; MacDonald, Pearlmutter, and Seidenberg, 1994; Goldberg, 1996; see also Kroll, Bice, Botezatu, and Zirnstein, this volume; Mayberry and Wille, this volume). Understood in this way, linguistic composition, either for production or for comprehension purposes, is a kind of co-articulated sequencing of lexical items (akin to the co-articulated sequencing of speech gestures; Liberman, 1993). By the same token, comprehenders experience sentence meaning as the construal of a mental representation that grows in referentiality and therefore, informativity, as the sentence unfolds in real-time. The structure of the linguistic signal (intonation pattern, order of lexical item retrieval, etc.) matters to meaning construal, but it sorely underspecifies it. To fully construe an intended interpretation, the comprehender must make use of, or otherwise infer, a context that supports and completes, as it were, the sentence meaning structure. Such context includes factors as dissimilar as preceding linguistic material, knowledge by the speaker/hearer of the conventionalized
562 Maria Mercedes Piñango interpretation of sequences of lexical items (e.g., idiomatic expressions) and knowledge that the speaker/hearer has of their interlocutor, and even the physical and emotional environment of the person uttering or hearing the sentence as it is being uttered/understood (e.g., see Schumacher (2017) for discussion of these utterance-level contextual factors). We thus understand utterance comprehension as the result of the interaction of context, which is, by and large, implicit, with sentence meaning composition that results from the concatenation of explicit lexical meanings, that is, meanings that are associated with pronunciations. Such context-lexical meaning interaction appears principled, carefully guided by the rules of composition of the language system itself. How can this be? How can sentence meaning in a situation of apparent lexical underspecification be construed with such predictability (and speed)? By the same token, how can context, which varies significantly across speaker/hearers and across situations be seen to participate in such a predictable manner in the meaning construal of a sentence? This chapter is about these questions. Its starting point is a particular model of lexical item structure, the full-entry model and its potential to support potentially intricate language-cognition interactions. Those interactions are made fully explicit through the careful exploration of one phenomenon, the iterative reading that arises from sentences like “Jo sneezed for a full minute,” and the observation that such an interpretation has no overt linguistic (i.e., morphophonological or syntactic) support, raising the question of the source of such a meaning. The chapter is organized as follows. The remainder of Section 27.1 presents the lexical item, the fundamental unit of real-time linguistic composition, the mental lexicon as the mental space that holds it and the processing implications that are derived from it. As mentioned, this description is grounded in a full-entry approach to lexical representation (e.g., Jackendoff, 1975; Jackendoff, 1997; Piñango, 2019; Jackendoff and Audring, 2020; see also Jackendoff, this volume) according to which lexical meaning is a generalization that results from continued exposure to a conceptual structure through linguistics means. Under full-entry conditions, there is no a priori limit on what, or how much, meaning can be stored i.e., lexicalized; instead, lexical meaning is taken as the minimal conceptual representation abstracted from those multiple contexts which enables that lexical item to support utterance construal. This is the key distinction that this approach makes, and that the case study that we present here supports. Section 27.2 focuses on the implications of the lexical infrastructure presented for our characterization of linguistic meaning composition and its relation to a full-entry structure. We capitalize on two observations: (1) that the linguistic expression resulting from lexical meanings morpho-syntactically combined appear to underspecify sentence meaning, and (2) that one way of accounting for this apparent “mismatch” between the linguistic expression and what comprehenders ultimately obtain is by having lexical meaning incorporate context through the resolution of variables encoded in its structure. We argue that a full-entry approach to lexical representation represents this view. It takes linguistic meaning composition as determined by the semantico-conceptual demands of each lexical item participating in the sentence. This, together with meaning
THE STRUCTURE OF THE LEXICAL ITEM 563 underspecification, places the burden of utterance construal on the resolution of the variables contained in the lexical meanings and the context required to resolve them. Indeed, what will be argued is that for any sentence, context, at least semantic context, amounts to the set of conceptual representations that can potentially satisfy the compositional requirements of the lexical items in the sentence. Satisfaction of these requirements is what relates these conceptual representations to the sentence meaning and in so doing enables the construal of the utterance’s meaning. Altogether then, the goal of this Section is to make precise the compositional implications of the full-entry lexical structure. Section 27.3 presents one test case where these compositional implications are observed. It discusses one long-standing test case, the composition of “durative” for, as in “The rabbit jumped on the beach for an hour.” In its iterative reading, composition of the for-adverbial exerts greater computational load than the non-iterative counterpart: “The rabbit floated on the beach for an hour.” Evidence shows that the root of this observed cost can best be explained as the real-time search for a partition measure demanded by the meaning of for. Such partition measure must be obtained from context. In other words, the specifications encoded in the lexical meaning of for determine the aspects of the context that are relevant for the interpretation of the utterance. This section is grounded in a robust, long-standing body of evidence from English and more recent results from English and Japanese supporting the fundamental observation that composition of this lexical item results in processing cost, a cost which can be isolated, and which a full-entry lexical structure naturally captures. Section 27.4 concludes the chapter with a discussion of the lessons that a view of sentence meaning composition that is built on a full-entry lexical structure provides. These are not limited to our understanding of meaning composition. They extend to our understanding of the embedding of the language system in cognition; our understanding of acquisition and of other fundamental dynamics of language such as meaning variation and change.1
1 In the late 90s I embarked on a research program with Edgar Zurif and Ray Jackendoff: the study of non-syntactically supported meaning composition referred to, at the time as “Enriched Composition.” The first results published more than 20 years ago in Piñango, Zurif, and Jackendoff (1999) focused on the processing implementation of “Aspectual Coercion.” Those early results supported the psychological validity of a combinatorial linguistic meaning system correlated to, yet independent from, morpho- syntactic composition. Subsequently, in collaboration with Ashwini Deo and Yao-Ying Lai (Deo and Piñango, 2011; Piñango and Deo, 2016; Lai et al., 2017), we investigated the lexical basis of this early observation, a line of inquiry that set to uncover the properties of linguistic meaning architecture and its implications for our understanding of the embedding of language in the rest of cognition. This chapter, which brings together the main insights from this search, is dedicated to these co-authors. I am extremely grateful to one anonymous reviewer, to Ray Jackendoff, and to the editors of this Handbook for most insightful suggestions for revisions of a previous version. All errors remain my own.
564 Maria Mercedes Piñango
27.1.2 Properties of lexical items: the full-entry model The term full-entry was first introduced in Jackendoff (1975) with the purpose of distinguishing a lexical structure that lists in full all of its properties even those that are redundant with other lexical items. The prototypical example is the listing of decide and decision: two lexical items that have different syntactic properties, but which are nonetheless morphologically and semantically related and therefore predictable from each other. Instead of deriving one from the other, as a generative model would, and in contrast with an impoverished-entry model, which would list only the non-redundant properties of each and derive the rest, under full-entry, all such redundancy would be made explicit and connected by a lexical redundancy rule: an encoding mechanism that distinguishes in information content terms lexical relations that are predictable from those that are not (Jackendoff, 1975, pp. 641–643). Crucial to the model, information designated in this way as “redundant” does not count as independent information, and therefore does not “inflate” the lexicon. Therein lies the advantage of a full-entry lexical structure: it permits the full listing of all information relevant to the composition of a lexical item, even when redundant, resulting in a lexicon that can contain within the same space lexical meanings that have minimal contextual constraints and also those whose contextual constraints are highly specified. The full-entry structure was originally proposed as a lexical model that competed with other lexical systems in terms of information measure. This focus gave rise to the possibility of thinking about a lexical representation not as the smallest structure possible, but as the smallest structure necessary to account for the patterns of lexical usage observed. This means that the constraint on the lexical meaning, or the lexical item as a whole, emerges from its usage patterns, irrespective of the kinds of information that such patterns bring together. That is, thinking about the structure of the lexical item in terms of necessity permits consideration of lexical-representations that although constrained, allow us to bring together distinct sources of information deemed independently necessary to capture the attested real-time meaning comprehension patterns. From this perspective then, the full-entry lexical structure is not only the unit of language knowledge but also crucially, for present purposes, the unit of language use. We define it as an arbitrary association of units from at least four combinatorial systems: a. a system of principally organized articulatory and acoustic configurations, that is, phonetics-phonology (e.g., Browman and Goldstein, 1992); b. a system of principally organized conceptual structures, that is, semantics- pragmatics (e.g., Jackendoff, 1975, 1997; Pustejovsky, 1991, 1995; see also Jackendoff, this volume); c. a system of word formation rules that specify among other things whether the lexical item has the status of a stem, an affix, a clitic, and so forth, that allow it to participate in further word formation processes; and finally,
THE STRUCTURE OF THE LEXICAL ITEM 565 d. a syntactic structure system that specifies the subcategorization constraints, which allow it to participate in sentence formation processes (e.g., Ford, Bresnan, and Kaplan, 1982; see also Anderson, 1992 and Grimshaw, 1979 for specific proposals of lexicalized word formation and sentence formation, respectively, which have directly informed the version of the full-entry architecture presented here). This notion of a lexical item is therefore a linguistic abstraction proposed to capture not only native speakers’ shared experience of knowing a “word” but also their shared experience that knowing a “word” is manifested in knowing how to use it during the linguistic act. Figure 27.1 sketches a possible lexical representation for the English begin as in “Sue and Taylor begin the novel.” The meaning structure, presented in conceptual semantics notation (Jackendoff, 1983, 1990; Pinker, 1989), follows Piñango and Deo’s (2016) characterization of the English aspectual class, which includes the meaning for begin, as well as finish, continue, start, and end (see also Piñango, 2019 for further discussion of this lexico-semantic representation). Figure 27.1 presents the minimum of information required to use the English word begin competently (as a native speaker would). It tells us how it is pronounced, its morphological status, and the possible syntactic environments in which it can occur, and for literate individuals, how it is written. In this way, a lexical “packaging” determines the boundaries of what is committed to memory and what minimally must be acquired. So, this full-entry structure of the lexical item is conservative, and constrained by general processing behavior. It is therefore a viable scaffolding for the creation and use of lexical items in a language. From this perspective, knowing a language presupposes a commitment to long-term memory of a significant subset of its lexical items and their
Figure 27.1 Lexical entry for English begin (constitutive reading).
566 Maria Mercedes Piñango retrieval during production and comprehension in a co-articulated fashion within a working memory space. By the same token, the mental lexicon as the long-term memory space that holds lexical items is not unstructured. Robust, long-standing psycholinguistic evidence indicates that lexical items appear sensitive to the commonalities that they share among them along syntactic (e.g., Bock, 1986; Bock and Griffin, 2000; Bencini and Goldberg, 2000), phonological (e.g., Marslen-Wilson and Warren, 1994; Lahiri and Marslen-Wilson, 1991; Swinney, 1979) and lexico-conceptual parameters (e.g., Swinney, 1979; Blumstein, Milberg, and Shrier, 1982). A full-entry structure is not only conservative given its descriptive focus, but it is also methodologically useful. It demands a precise, and therefore testable, description of what must and must not be listed, and in doing so, it makes possible the examination of language knowledge as embedded in a cognitive infrastructure. It is through this possibility that we get the first clear sense of how lexical meaning and non-lexical meaning (i.e., non-linguistic conceptual structure) may actually be part and parcel of the same system. This possibility is supported by a record going back at least 50 years indicating that lexical meanings appear to be organized not only in terms of co-occurrence of linguistic usage but also in terms of conceptual associations (e.g., Collins and Loftus, 1975; Neely, 1977; Jackendoff, 1975, 2006; Jackendoff and Audring, 2020; see also Jackendoff, this volume). Another, less apparent, way in which lexical meanings may organize themselves in the lexicon is through their formal conceptual properties in the form of semantic classes. Piñango and Deo (2016), for example, show that a specific semantic characterization, namely the requirement that a complement be construable as a structured individual, unifies a set of verbal predicates: the aspectual class including begin, start, continue, finish, and end; furthermore, it distinguishes this class from other verbal predicates, such as the psychological class including enjoy, tolerate, endure, and so forth, which may trigger similar readings but, arguably, through formally different semantic mechanisms. Notably, the linguistic arguments that motivate this semantic class distinction find direct support in their distinct psycholinguistic and neurolinguistic profiles, which would otherwise remain unexplained (Lai, Lacadie, Constable, Deo, and Piñango, 2017). We end this section by noting that the full-entry view of the lexical item is not only relevant to the mental lexicon but also embodies a Tripartite Parallel Architecture of the language system, the cognitively-grounded framework for all models of language structure and meaning structure whereby: (1) linguistic meaning composition results not only from the combinatorial and generative mechanisms of the morphological, syntactic, and phonetic-phonological systems but also from the combinatorial and generative mechanisms of the meaning system itself; (2) the structure of the grammar, however organized, is nondirectional and constraint-based; and (3) the necessity of a mental lexicon separate from a grammar is treated as an empirical question (e.g., Jackendoff, 1987, 1997, 2007; Jackendoff and Audring, 2020; Bresnan, 2001; Bresnan, Asudeh, Toivonen, and Wechsler, 2015; Anderson, 1992; Pustejovsky, 1995; Avrutin, 1999, 2006; Wiese, 2004; Culicover and Jackendoff, 2005; Hagoort, 2014; Piñango, 2019; see also Jackendoff, this volume).
THE STRUCTURE OF THE LEXICAL ITEM 567
27.2 Properties of a full-entry-driven processing system In a full-entry mental lexicon, whatever manipulates lexical items amounts to “the processor.” One fundamental component of lexical processing is retrieval, the idea that to be used for the processes of comprehension and production, lexical items must be selected from the mental lexicon, the long-term memory bank. One consequence of a full-entry lexical structure is that it greatly simplifies the linguistic processing mechanism, and specifically for our purposes, the linguistic meaning composition process. Linguistic processing, the umbrella term for linguistic production and comprehension, simply involves the concatenation of lexical items in a co-articulated fashion. Lexical co-articulation amounts to the satisfaction of the selectional requirements of each lexical item retrieved at each level—phonological, morphological, syntactic, and semantic/pragmatic in parallel although with arguably distinct time-course constraints (e.g., Trueswell et al., 1993; Boland, 1997; McElree and Griffith, 1995; Pustejovsky, 1995; Jackendoff, 2007; Piñango, Winnick, Ullah, and Zurif, 2006; Piñango, Zurif, and Jackendoff, 1999). An acceptable linguistic expression is one where all its lexical items’ requirements have been satisfied. In this way, lexical restrictions at all levels of information dictate the immediate “local environment” in which a given lexical item can appear, and thus represent a trigger for retrieval. Such “obligatory” information, must be satisfied first even when non-obligatory resolutions are also possible (e.g., Hickok, 1993). From this it follows that lexically specified restrictions are the “substance” of processing prediction. Such substance is the yet-to-be-satisfied lexical requirements of a lexical item that has already been retrieved (e.g., Chow, Smith, Lau, and Phillips, 2016; Piñango, Finn, Lacadie, and Constable, 2016). 2 Information that is shown to have predictive value must be lexically encoded. Although retrieval is involved in both production and comprehension, it manifests differently depending on the process. Those manifestations are lexical recognition and lexical recall. Lexical recognition out of the incoming stream allows comprehension to unfold. Its morphophonological serialized nature provides word order “for free,” as it were, as recognition takes place. This supports compositional processes such as prosodic structure, subcategorization satisfaction and meaning structure building that carry over multiple lexical items at a time. Part and parcel of the recognition process are searches that resolve the selectional restrictions that are posed by lexical items as they are retrieved.3
2 See
Piñango et al. (2016) for an implementation of prediction with a full entry approach in a lexically-driven system and in connection to composition of long-distance dependencies. 3 Classic examples of satisfaction of selectional restrictions are the determination of the participants within a sentence that can bear semantic roles as licensed by a verbal predicate e.g., Carlson and Tannehaus, 1989; MacDonald et al., 1994; Chow et al., 2016; and the detection of a “gap”, a
568 Maria Mercedes Piñango By contrast, lexical recall underpins sentence production. Here, the time-course of structure building presumably privileges morphosyntactic and prosodic composition that make possible the linearization of lexico-semantic structure (see Levelt, 2001 and references therein for a specific proposal of lexically-driven model of language production). As has been shown robustly in the literature, the implementation of these two tasks, recognition and recall, although stable across speakers, is expected to be facilitated and/or hindered by the speaker/hearer’s familiarity of use of the lexical items at play. Familiarity is brought about by frequency of use, a factor that is known to modulate most potential processing cost incurred by searches brought about by lexical demands (e.g., MacDonald et al., 1994; Trueswell and Tanenhaus, 1995; MacDonald, 2013). So, as we can see, with a full-entry lexical structure, processing is guided by lexically encoded restrictions i.e., knowledge of a grammar, and supported by one fundamental task, retrieval, implemented in two ways recognition and recall within a working memory space.4 Now we come back to linguistic meaning composition. Earlier in the chapter, we asked two questions: (1) How can sentence meaning in a situation of apparent lexical underspecification be construed with such speed and predictability? And (2) how can context, which varies significantly across speaker/hearers and situations, be seen to participate in such a predictable manner in the meaning construal of a sentence? The answer to both questions is provided in the full-entry configuration. Following the description above, a lexical meaning amounts to an instruction for what aspects of the surrounding lexical items and other extra-sentential environmental factors are required. In other words, just like syntactic subcategorization constraints, lexical meanings describe precisely the properties of the context that may satisfy its variables. This is, in turn, what amounts to contextual interpretation. What makes a “context” for a given sentence is the set of elements, linguistic and non-linguistic, that can potentially satisfy the requirements of the lexical meanings in the sentence. This degree of specification is what makes possible the speed and precision observed in sentence composition. The burden is placed in the diversity of information that a lexical item is allowed to encode. One implication of this perspective on context is that the contextual conditions may not be as large or diverse as they would appear at first to be. The kinds of elements that can satisfy lexical variables (i.e., the contexts) are constrained, and as we will see in Section 27.3, predictable. This idea is not new. It underlies our general understanding of semantic/thematic roles for example. Argument structure/thematic role information has an immediate impact on syntactic disambiguation during real-time sentence syntactico-semantic object, once a wh-word has been retrieved (e.g., Frazier and Flores d’Arcais, 1989; Swinney, Ford, Frauenfelder, and Bresnan, 1988; Piñango et al., 2016) 4 For early evidence that connects lexical constraint satisfaction with memory resources see Zurif, Swinney, Prather, Wingfield, and Brownell, 1995 and references therein. For evidence that shows the direct dependence of sentence comprehension on timing of lexical retrieval see Burkhardt, Avrutin, Piñango, and Ruigendijck, 2008 and Love, Swinney, Walenski, and Zurif, 2008 and references therein.
THE STRUCTURE OF THE LEXICAL ITEM 569 comprehension. This is attributed to the lexical encoding of thematic relations that are fundamental to the structure of an eventuality and appear to guide interpretation by helping to “prune” or “inhibit” alternative semantic representations that are associated with contextually inconsistent yet syntactically possible alternatives. Crucially, semantic roles appear themselves as semantic abstractions that maintain a certain level of uniformity while applying to a diversity of referents (e.g., Clifton, Frazier, and Connine, 1984; Carlson and Tanenhaus, 1989; Boland, Tanenhaus, Carlson, and Garnsey, 1989; Trueswell, Tanenhaus, and Garnsey, 1994; Tanenhaus and Trueswell, 1995; McElree and Griffith, 1995; Boland, 1997; see also Rissman and Majid, 2019 for recent discussion about the psychological reality of thematic roles). In sum, psycholinguistic evidence is consistent with a model of sentence meaning composition that is rooted in the lexico-semantic specification of the lexical items in the sentence, and the networks in the lexicon to which those lexical meanings belong. Lexical meaning is thus a structured guide to semantic composition manifested in the form of directed searches through a variety of search-spaces (lexical, situational, affective to the speaker/hearer etc.). In this way, context for a lexical meaning is all that content that must be found in the linguistic and non-linguistic utterance’s environment without which a semantic expression is not acceptable. The challenge for those seeking to understand the specifics of this process of variable satisfaction is to identify the variables contained in the lexical meanings, the conceptual restrictions on the satisfaction of those variables, and the processing implications of the satisfaction process as comprehension unfolds, that is, the complexity parameters for the searches, possibly as a function of search space. What follows represents an illustration of such an enterprise through the exploration of the linguistic and psycholinguistic insights that have emerged from the examination of one phenomenon: the iterative reading in sentences with for-adverbial modification as in “Sam clapped for two minutes straight” or “Lou swam for a year in the local pool.”
27.3 The full-entry BASED lexical processing and the case of for +time phrase COMPOSITION Focusing on one phenomenon in depth enables us to understand the various levels of analysis that are required to capture the connection between a linguistic representation and its manifestation through sentence comprehension or production, and to determine the degree of generality that these levels of analysis potentially afford us, which will in turn shed light on the properties of the larger architecture of the system that we are trying to describe. Accordingly, this Section is about one question: the source of the iterative reading in sentences like “Sue sneezed for (almost) a minute” and the implications of a viable analysis that can capture it. I choose this question for
570 Maria Mercedes Piñango the following reasons: (1) a lot has been revealed about the properties of this reading and from more than one perspective: semantic analysis, real-time comprehension behavior, and neurological behavior. This body of work has shown in turn; (2) consensus that it targets the semantic compositional system almost exclusively; (3) that it is not an isolated case but, arguably, one clear manifestation of standard lexically-driven meaning composition; and (4) that as we will see, its solution will hinge on the right description of the lexical meaning for for. And this illustrates the point that not only the major lexical categories like nouns and verbs, but also prepositions that can sometimes be classified as “functional” in the sense of containing minimal semantic content can be the bearers of rich meaning structure that is capable of triggering widespread semantico-conceptual compositional processes. So, in sacrificing breath for depth we seek to gain a more precise understanding of how sentence meaning composition could possibly be connected to the lexical meanings that are contained in it, in a cognitive viable manner. One reason that sentences like “Sue sneezed for (almost) a minute” are of interest is because the sense of iteration that it saliently conveys, in this case repeated sneezing, does not seem rooted to any of the individual words, or any morpho-syntactic element in the sentence. No one lexical item in the sentence can be said to bear the iterative meaning. So, the question arises as to what gives rise to such an otherwise strongly preferred reading. This is an important question because in principle the existence of this kind of reading would appear to challenge strict compositionality: the idea that sentence meaning is the result of the lexical meanings and the way in which they are put together through syntactic composition (e.g., Partee, 1995 and references therein). The challenge emerges from the observation that under strict compositionality it falls on syntax to provide the combinatorial means that support sentence meaning composition. However, that appears not to be the case in iterativity generation. Indeed, whatever compositional mechanism is at play, it appears to be not only predictable (i.e., rule-driven) but exclusively semantic. If that were to be the case, it would suggest that not only syntax but also the meaning system can provide combinatorial mechanisms of its own. This would then support a view of sentence meaning composition whereby multiple compositional systems are acting in parallel as real-time comprehension unfolds and as the full-entry model predicts. Therein lies the relevance of this case for our understanding of the connection between lexical structure and sentence meaning composition—the focus of this chapter. We start with two questions: What is the compositional mechanism that gives rise to iteration? Where in the system is it encoded? The term aspectual coercion was originally proposed to refer to the mechanism of meaning composition that allowed iteration to be construed. The conditions were specific. They were proposed to involve a semelfactive verb (e.g., jump, sneeze, clap) and a temporal modifier (e.g., for an hour, until the morning). Crucially, the temporal modifier was aspectually incompatible with the telicity of the predicate, thus creating an aspectual mismatch that must be resolved for the sentence to be acceptable. The term coercion then reflects the observation that whatever the mechanism that resolves it, forces a repair of the mismatch, in this case, in the form of iteration insertion (e.g., Verkuyl, 1993;
THE STRUCTURE OF THE LEXICAL ITEM 571 Moens and Steedman, 1987; Pustejovsky, 1991, 1995; Briscoe, Copestake, and Bougarev, 1990; Jackendoff, 1997; Piñango, Zurif, and Jackendoff, 1999). From the beginning, the consensus was that whatever the source, the sense of iteration had to be rooted in some aspect of the semantic composition of the sentence. Even when notationally different, all proposals within the coercion approach share two ideas: (1) the iterative reading emerges as the resolution of a mismatch between a telic verbal predicate and a temporal modifier, and (2) the mismatch is caused by the fact that the telic predicate is inherently temporally bounded which, through composition, is being imposed an additional and external temporal boundary. At this point, and to save the composition, a semantic ITER operator, bearing the iteration meaning, is inserted in the semantic representation of the verb+temporal adverbial with the purpose of repairing the mismatch (see Jackendoff, 1997, pp. 52–53, for an example of such a proposal). Independently of its coverage, the operator insertion approach is useful because it embraces purely semantic combinatoriality, a situation that brings this phenomenon in line with other coercion-like cases, also claimed to result from semantic generative processes (see Deo and Piñango, 2011; Katsika, Braze, Deo, and Piñango, 2012 for an overview). For the aspectual case, semantic composition is temporally blocked by telic predicate+durative modifier mismatch and a semantic-only solution packaged as an operator, rescues it, generating in the process an additional meaning, the iterative reading. By the same token, it presents a problem for a lexicalist approach to grammatical encoding. Specifically, there is no independently established way by which this ITER operator could be lexicalized, as it does not have any kind of morphophonological association, a pronunciation, that would place it in the lexicon. To be sure, the other direction is robustly attested, “empty-morphs,” as in, for example, thematic vowels in Romance languages (Anderson, 1992, Ch. 3) possess morphophonology but no meaning and in this way illustrate incomplete lexical items (see also Jackendoff, 1997; Jackendoff and Audring, 2020 for extended discussion of this possibility of lexical manifestation). But operators such as ITER do not naturally fall within the “incomplete” lexical item set; they possess only meaning and therefore have no independent mechanism to connect themselves to a lexically-driven linguistic meaning composition process.5 So, while the effects of ITER-insertion capture the possibility of morphosyntactically independent meaning expression, and even meaning creation, the absence of a lexical characterization weakens its viability. It keeps it outside of the lexical system and thus not readily available for examination from a comprehension perspective. This concern notwithstanding, the ITER operator approach afforded, at the time, deeper neurocognitive examination. Starting with Piñango, Zurif, and Jackendoff (1999), two questions drove that exploration that skirted the encoding issue: (1) Is 5 Jackendoff
(1975) discusses this issue extensively. At the time, this concern was built into the full- entry approach through the assumption that semantic redundancy rules (S-rules) were not only separate from morphological (M)-rules, but only warranted once the basis of a morphological relation: “Hence we must require a morphological relationship before semantic redundancy can be considered” (Jackendoff, 1975, pp. 650–655).
572 Maria Mercedes Piñango ITER-insertion isolable in real-time comprehension? If so, (2) is its time-course consistent with other manifestations of semantic composition? These questions got to the real-time visibility of the semantic compositional process. The answer to both questions is a robust “yes.” Table 27.1 contains a representative sample of the experimental record as originally reported in Deo and Piñango (2011). It includes the various linguistic means by which the coercion effect has been isolated, the experimental task used, the specific observation reported and the converging ways in which the reported effect was interpreted. Overall, the observations from this experimental record are the following: (1) Regardless of linguistic manipulation, there is a measurable effect of composition of telic predicate and durative phrase modifier, generating an iterative reading; (2) The effect is interpretable as processing cost; (3) The effect is observable as delayed cost relative to the point in the sentence where it is triggered (Piñango et al., 1999, 2006; Downey, 2006; Brennan and Pylkkänen, 2008), suggesting a slow to develop compositional process. (4) The processing effect also has neurolinguistic a manifestation involving two specific brain regions: left posterior superior temporal cortex (Wernicke’s area) and ventro- medial frontal cortex (Piñango and Zurif, 2001 and Brennan and Pylkkänen, 2008, respectively), consistent with localization of other semantic compositional processes. This said, (5) the effect is observable only when the durative phrase is headed by the preposition for, and not when it is headed by the preposition until (Downey, 2006). This latter observation directly questioned the coercion generalization. Nevertheless, these observations, while not fully converging, were seen as supportive of a compositional process beyond the semantics of the verbal predicate, and not necessarily inconsistent with an ITER-insertion approach.
27.3.1 Challenges to the operator-based approach It is within this context that Deo and Piñango (2011) bring up two linguistic observations that fundamentally challenge the traditional characterization of aspectual coercion as resulting from an aspectual mismatch and of ITER operator insertion as a possible rescue operation: (1) an iterative reading similar to that observed in aspectual coercion may arise in the absence of a verb-modifier mismatch, and (2) an iterative reading need not always arise in the presence of a telic predicate + durative phrase (Deo and Piñango, 2011, pp. 306–307). Observation (1) is illustrated by sentences such as “Mary sang a lied/ran a mile/ swam two miles for two months” or “Sam walked/drove to the university for a year.” In these sentences, the verbal predicates in question are telic, and are modified by a durative phrase (for two months, for a year), a situation that predictably induces iteration. These cases are problematic for the ITER analysis because they elicit iteration yet, there is no mismatch. To account for it, iteration in these cases has been attributed to extra- sentential pragmatic inference. It works in this way: the knowledge of a typical event of “singing a lied”/“running a mile” or “swimming two miles” leads to the inference that if
THE STRUCTURE OF THE LEXICAL ITEM 573 Table 27.1 Experimental record on Aspectual Coercion comprehension, 1999– 2008 (Deo and Piñango, 2011) Study/Task/Minimal pair tested
Observation
Interpretation of effect
1
Piñango et al. (1999, 2006) Cross-Modal Lexical Decision (CMLD) a. The man examined the little bundle of fur for a long time to see if it was alive. b. The man kicked the little bundle of fur for a long time to see if it was alive.
Increased reaction time for (1b) 300ms after adverb but no difference right at the adverb
Processing cost induced by “iterative meaning without morphosyntactic support”
2
Piñango and Zurif (2001) Focal Brain Lesion (Aphasia) Question–Answer Task a. The horse jumped over the fence yesterday. once or many times? b. The horse jumped for an hour yesterday. once or many times?
Wernicke’s Aphasics performed at chance (guessed) for (2b); above chance for (2a) Broca’s Aphasics above chance for (2a) and (2b)
“Implementation of ITER recruits Wernicke’s area (left posterior superior temporal and posterior lower parietal cortex)”
3
Todorova, Straub, Badecker, and Frank (2000) Self-Paced Reading a. Even though Howard sent a large check to his daughter for many years, she refused to accept his money. b. Even though Howard sent large checks to his daughter for many years, she refused to accept his money.
Increased reading time at the temporal adverb in (3a) vs. (3b)
Cost attributed to (a) “telic commitment” or (b) “repair via ITER-insertion”
4
Downey (2006) Event-Related Response Potential (ERP) a. The girl dove into the pool for a penny. b. The girl dove into the pool for an hour.
Sustained centro- parietal positivity starting at 300ms after adverb. This was present for phrases but not for until phrasal counterparts
Increased activity results from “semantic generation of iterative interpretation”
5
Brennan and Pylkkänen (2008) Self-Paced Reading and Magnetoencephalography (MEG) a. For forty-five seconds, the computer beeped in the busy lab. b. After forty-five seconds, the computer beeped in the busy lab.
Increased reading time at verb in (5a) Increased brain activity for (5a) in the VMF around the verb (437–452 ms)
Cost attributed to (a) “repair via ITER” or (b) “pragmatic adjustment”
574 Maria Mercedes Piñango the denoted events occurred “for two months/a year” they had to have occurred in an iterated fashion. Observation (2) is illustrated by sentences such as “Jo read a book/built a sand castle/baked a cake for an hour.” Sentences like these also have a telic predicate and are modified by a durative phrase. So they are predicted to have an iterative reading, contrary to fact. Their interpretation is durative. These cases therefore show that telicity of the modified predicate is not a necessary component in the generation of an iterative reading. In sum, observation (1) tells us that an ITER-insertion approach does not exhaust the possible approaches to capture iteration generation. Observation (2) tells us that including the telicity properties on the verbal predicate does not predict the presence of an iteration generation effect. Deo and Piñango (2011) address these issues by proposing that the locus of the iterative generation observed lies not on the aspectual properties of the predicate but on the semantics of the for-adverbial.6 Their account states that the lexico-semantic structure of for introduces a regular partition of intervals (i.e., a set of collectively exhaustive, non-overlapping, equimeasured subsets). Regardless of predicate, for-adverbial modification provides the potential to generate an iterative reading. That is so because both durative and iterative readings involve the setting of a partition measure. The actual measure of such intervals, the set of subsets, is a variable. This variable is central to the composition of for+time phrase because its value determines the extent to which iteration will be interpreted. And herein lies the compositional effort. For the for-phrase to be interpretable, and by extension the sentence within which it is contained, a partition measure must be searched for. The search space involves the other lexical meanings in the sentence and the larger discourse of the utterance. This process is triggered by the lexical requirements of the for-adverbial phrase alone, irrespective of the telicity properties of the verbal predicate or the specific temporal constraints of the time phrase. So, in this way this solution not only has better coverage than the ITER-insertion approach, it also has a greater degree of generality, since it predicts that any sentence that contains a for+time phrase composition will demand the saturation of a variable, the partition measure. Finally, this analysis provides an answer to our initial questions: what is the mechanism of iteration generation? The mechanism is simply the standard process of composition of the lexical meaning of for and the time phrase with the rest of the sentence’s meaning. Where in the system is the mechanism encoded? It is part and parcel of the lexical representation of for and it is triggered by lexical retrieval. Figure 27.2 sketches a lexical representation for durative for formulated in model theoretic terms, on the basis of the semantic representation proposed by Deo and Piñango (2011). This meaning representation states the following. The interval associated with 6 As mentioned, circumscribing the phenomenon to for-adverbials was motivated in large part by processing results reported by Downey (2006) showing that the extra-cost was only observable with for-adverbials, in contrast to until-adverbials, and also by the observation that semantically, these two adverbials exhibit very different properties (see Deo and Piñango, 2011 for some discussion).
THE STRUCTURE OF THE LEXICAL ITEM 575
Figure 27.2 Lexical entry for English durative ˆfor –time adverbial.
the DP following “for” is partitioned into subintervals in context. The length of each cell of the partition is a free variable whose value is contextually determined. Accordingly, a phrase of the form “for x-time (P)” is true at an interval i only if the duration of the interval is x-time and every member of j the contextually determined regular partition of the interval, coincides with the meaning of the predicate. The predicate P coincides with the interval i if the predicate P is instantiated within the interval i or at a superinterval of i. A regular partition of an interval i is a set of non-empty, collectively exhaustive, mutually exclusive, and equimeasured subset of i. Each of its subsets has the same partition measure (Deo and Piñango, 2011).7 One advantage of this universal quantifier analysis of for-adverbials is that it builds on the independently motivated context-determined temporal partitions (Deo, 2009) and on the idea of intermediate distributivity, also independently motivated (Schwarzschild, 1996; Champollion, 2010a, 2010b, 2017). It neither places idiosyncratic restrictions on the arguments of for, nor does it appeal to reading-specific operator insertion in accounting for the readings associated with them. It assumes the most conservative processing system by attributing the comprehension cost to a generalized search process involved in the retrieval of a partition measure without which the linguistic expression cannot be felicitously interpreted. A key objective of this analysis is accounting for the observation that “Punctual iteration” such as “jumped for an hour” is only one kind of iteration. The sentence “Sue swam for a year,” with a durative verb, also engenders an iterative reading with multiple running events. Such iterativity is triggered by typicality considerations for what a normal partition measure for human running could be.8 This is the “Durative iteration”
7 One could think of this reading as potentially subsumable under the habitual reading of imperfective aspect. Imperfective aspect is standardly conceived as denoting a property of a situation that holds over some interval of time. Specifically, it implies that a predicate under its scope presents the Subinterval Property (Bennett and Partee, 1972): if a predicate P is true at some interval I, it follows that the predicate P is true at all (relevant) subintervals of I. The Imperfective domain is usually considered to encompass two meanings: the imperfective and the progressive. In the case of “Jo sneezes for minute,” the sentence radical holds at all relevant subintervals of the interval under consideration, akin to the habitual reading effect. In this way, so-called lexical aspect of this sort could begin to be understood under the larger grammatical aspect system as a case of habitual (see Fuchs and Piñango in press; Fuchs, Piñango, and Deo, in press; and Fuchs, 2020 for further discussion of the lexico-conceptual system underlying linguistic aspect). 8 And these typicality considerations are expected to emerge in the standard manner: as the result of the comprehender’s life-time experience with the given conceptual construct.
576 Maria Mercedes Piñango reading. What unifies the durative and punctual iterative readings is the implication that in both cases an interval-based reading must obtain: When the verb is semelfactive (punctual), jump in “jumped for an hour” expresses intervals of jumping. When the verb is durative, swim in “swam for a year” expresses intervals of typical sessions of swimming. Both kinds of iteration involve a non-infinitesimal partition measure that makes reference to distinct events overlapping with each cells of a regular partition. A non-infinitesimal partition is thus the kind of partition measure that underpins an iterative reading. By contrast, when the partition measure is set to infinitesimal (i.e., there is no gap between the partitions) as in, “swim for an hour,” the sentence gives rise to a durative reading. This involves reference to a single event overlapping with all cells of the interval and amounts to universal quantification over moments, such that any moment that overlaps with an interval is contained in it. Deo and Piñango (2011) reason that the infinitesimal partition measure represents a kind of default measure most likely because it is the least compositionally costly as it comes from the immediately preceding verbal predicate, a local and therefore most salient element in the context. Subintervals in the infinitesimal partition are formally the same as those in the non-infinitesimal partition, except that infinitesimal subintervals are cognitively experienced as less salient.9 As mentioned, psycholinguistically, this solution translates into a search throughout the sentential environment, a process that (a) takes time, and consequently (b) can produce measurable processing cost. So, the consequences of this solution can be tested in the following way. Sentence meaning with a for+time adverbial results from the following step-by-step process for the English sentences discussed so far. Step 1: lexical meanings corresponding to the subject and verb are composed with each other. Step 2: the for+time adverbial is retrieved, which introduces an interval that requires specification of a partition measure. Step 3: If the preferred partition measure is the infinitesimal reading (durative reading), search is over and for-adverbial partition measure requirement has been met. Step 4: If the infinitesimal partition results in a pragmatically odd/implausible interpretation, the processor proceeds to construe a plausible partition measure informed in addition by the typicality of the event in question. This may require a more extended search through the available search-space, a process that results in measurable processing cost. Notably, this is the process that is entailed by the meaning of for as Deo and Piñango (2011) formulate it. And this point is key to the mutually constraining approach of this research program. It is telling us that the structure of the meaning constrains the processing mechanisms that come into play including the determination of cost.
9 An anonymous reviewer points out that this analysis is only one way of accounting for these iterative readings. That is correct. However, I do not know of any other analysis that accounts for these readings in a unified way that roots the solution in standard lexicalization mechanisms, and makes testable processing predictions. Hence the focus on this analysis.
THE STRUCTURE OF THE LEXICAL ITEM 577 Summing up, the rise of iterative readings with for-adverbials does not depend on one factor only, such as the telicity of the verb. Instead, the construal of an interval’s internal structure emerges from the interaction between typicality of event duration, the length of the measuring interval (associated with DPtime), and the possibility to construe a plausible partition measure that accommodates both. When the length of the measuring interval appears large in comparison to the duration of a “typical” event associated with the predicate, for example, a piano playing practice session, or a drive to the local market, the partition measure construed is also large. When the length of the measuring interval is short in comparison to the duration of a “typical” event in the predicate, for example, the duration of a skip or a sneeze, the partition measure is also short. In both, the source of the iteration is the same, but the outcomes differ according to the larger conceptual meaning associated with the events in question. We take this precise coordination of the multiple constraints to represent a key contribution of the analysis. It allows for the “negotiation,” as it were, of lexico- semantic and (non- linguistic) conceptual- pragmatic knowledge to take place at the level of lexical composition—at the very core of linguistic composition. This not only questions the need for a view of pragmatics as a separate, extra-sentential, level of meaning composition, as is traditional, but it also explains why the resulting interpretation for these cases is iteration. The reading is simply an optimization of the constraints placed by the requirements of for, and those placed by event typicality. So, the very constraint-satisfaction mechanisms assumed for co- articulated lexical composition appear also to be at play when dealing with the integration of semantic and pragmatic meaning (see Trueswell and Gleitman, 2007 for a converging view from acquisition; see also Gleitman and Trueswell, this volume). On this analysis then, the interpretation of Punctual iteration and Durative iteration alike entails no break in interpretation. Instead, such interpretation requires the satisfaction of the requirements of the lexical items in the sentence. One of these requirements, encoded as a variable in the lexical item of for includes the retrieval of a partition measure. Nothing extra or exceptional from a compositional perspective has taken place. All necessary complexity has been localized to the lexical meaning of for.
3.2 New evidence: English for-adverbials and Japanese kan-adverbials composition The partition-measure analysis makes a prediction: comprehension of “durative” iteration readings, for example, “Taylor swam in the local pool for a year,” will elicit the same comprehension cost as “Taylor jumped on the trampoline for an hour” as compared to the non-iterative “Taylor swam in the local pool for an hour.” Two recent experimental reports speak to this prediction and extend it to novel but related cases. We turn to those below.
578 Maria Mercedes Piñango Table 27.2 Conditions tested in kan-adverbial self-paced reading comprehension (Lai et al., under review)
3.2.1 English for-adverbials Piñango, Lai, Foster-Hanson Lacadie, and Deo (2016, in prep.) test the predictions of the partition measure-retrieval account. Using self-paced reading they compare the processing of Punctual Iteration with that of Durative Iteration (e.g., “Anna jumped for an hour” vs. “Anna jogged for a year”) vs. No Iteration (e.g., “Anna jogged for ten minutes”). Planned comparisons show a significant difference between No Iteration and Punctual Iteration at the adverbial window, replicating previous work. Crucially, they also show a significant difference between No Iteration and Durative Iteration at the adverbial window. Finally, they report no difference between Punctual Iteration and Durative Iteration (Piñango et al., 2016, in prep.).
27.3.2.2 Japanese kan-adverbials Lai, Piñango, and Sakai (under review) examine the behavior of Japanese -kan translated into English as for in sentences like “The athlete jumped for 20 minutes” crossing two verb types (Punctual: jump /Durative: jog) and the length of the intervals denoted by for-adverbials (Short: ten minutes /Long: a year). Table 27.2 below shows a sample sentence set. This work seeks to replicate the observations from English for. If successful, the replication would suggest that the compositional mechanisms of -kan are similar to those of English for and consequently, that their lexicalized meanings are also formally similar. Moreover, Lai and colleagues extend the iterative paradigm to include “gapped” iteration. This reading emerges from combining a punctual/ semelfactive predicate and a long temporal interval as in the sentence “The athlete jumped for two months,” which in Japanese provides the reading of iterated jumping sessions for the period of two months.
THE STRUCTURE OF THE LEXICAL ITEM 579 From an initial questionnaire, Lai et al. (under review) report that although all sentences regardless of condition are rated “natural,” the naturalness decreases as a function of meaning uncertainty. Specifically, the degree of naturalness within this range differs depending on the nature of the iteration. No iteration sentences are deemed more natural than Durative and Gapped iteration counterparts. Punctual iteration sentences and Durative iteration sentences are deemed more natural than Gapped iteration sentences. No difference in naturalness is found between No iteration and Punctual iteration sentences or between Punctual iteration and Durative iteration counterparts. It would appear then, that overall interval length leads to a decrease in perceived naturalness. The authors reason that the source of the decrease in naturalness could result from the effort required to construe a precise partition measure (see Lai and Piñango, 2019 for a similar argument in connection to other phenomena). This effort may be induced by the by lack of familiarity with the specific type of partition measure, which would increase the search space. It is not clear, for example, whether people are generally familiar with an athlete’s standard “session of running.” Importantly, any explanation based on felicity conditions would affect the search conditions for the partition measure. This would therefore be consistent with the key implication of the lexical meaning proposed for -kan: the notion that compositional effort is best connected to availability of the meanings that satisfy its meaning variables. Subsequently, Lai and colleagues test those same sentences using self-paced reading (moving window). Here the pattern across iterative conditions unifies: a similar processing profile across conditions up to the verb, with the iterative conditions remaining significantly costlier right after the kan-phrase + verb segment, as compared to the non- iterative counterpart, before all together tailing off. This represents, therefore, a replication of the standard for-adverbial comprehension effect and a key extension to include gapped iteration, predicted by the analysis to reveal a converging pattern but never tested before.10 Altogether, the processing evidence from English and Japanese is consistent with the partition-measure retrieval analysis. This analysis, together with standard assumptions about lexical retrieval and sentence comprehension, captures the attested readings and the processing profile reported—a profile that can now be understood simply as resulting from a process of searching through the local conceptual structure/pragmatic space in a lexically-guided manner. This analytical approach to meaning composition, whereby the underspecified content is codified as a variable to be satisfied 10 This work also reports results from a comparison that probes the interaction of - kan comprehension with context sensitivity as measured by the Autism Quotient questionnaire. Study participants are categorized either high or low autism quotient (AQ) independently hypothesized to manifest low vs. high context sensitivity, respectively. Results show that decrease in reading time tails off only for high AQ participants, suggesting that those participants’ ability to determine a plausible partition measure for the iterative conditions and thus achieve a successful semantic representation is different—it is slower. This would suggest, in turn, a greater degree of interaction than traditionally thought between sentence meaning compositional processes and their non- linguistic cognitive embedding. See Lai et al. (under review) for further discussion.
580 Maria Mercedes Piñango through context, bridges, in an organic manner, the local lexical process with the harder to measure “context” including plausibility and world-knowledge considerations. A requirement of partition-measure retrieval formally links the lexical semantics of for and the larger communicative situation a natural consequence of a full-entry lexical structure (see Piñango, 2019 for further discussion). At the same time, the resulting analysis highlights the importance of the psychological commitment that constrains this whole enterprise: whereas there may be more than one way to describe the meaning of for that predicts all the iteration cases discussed, only a subset of those ways will be (a) lexicalizable, (b) able to predict the processing and neurological profile reported, and (c) able to exhibit the contextual effects attested that connect its resolution to the individual’s non-linguistic knowledge system. And this is one of the key points of this whole discussion: to evidence the necessary, mutually constraining connection between the structure of the analysis, the compositional effects (i.e., possible readings), the processing consequences, and how examining all these elements together can represent a fruitful path of inquiry into the larger questions in neurocognition of language, such as the architecture of the language system, its cognitive embedding, and its relation to the mental lexicon.
27.4 Concluding thoughts: General implications of a full-entry approach to lexical structure 27.4.1 A lexically-driven view of linguistic meaning composition The exploration presented here has been motivated by the search for a metric for meaning composition grounded in two questions: what are the units of linguistic meaning and how do they operate such that we see the compositional effects that we do? The full-entry model that we have capitalized on here provides us with the answer that linguistic meanings—whatever they are—appear not to “live” in isolation as discrete entities, rather they present themselves as segments of conceptual structure that are made available for linguistic use through an association with a pronunciation, and morphological and syntactic specifications that determine its local linguistic environment during production and comprehension. In other words, the units and their means of composition are all of a piece. As we have shown here, the psychological commitment constrains the kinds of structures that we can posit as valid, and forces us to consider each subsystem: semantic, phonological, morphological, syntactic, in connection to the others but independently combinatorial. That is, in essence, what the notion of lexical items as abstractions represent: dynamic building blocks of a system of multiple levels
THE STRUCTURE OF THE LEXICAL ITEM 581 of information acting in tandem, whose units have the very structure of the system that contains them. The psychological validity of this system is further supported by the necessity to invoke a memory basis for both the mental lexicon (long-term memory) and production and comprehension processes (working memory), the bridges to cognition. The lexical item is thus a unit for storage and for composition. In this respect, both production (recall) and comprehension (recognition) are making use of the same mental lexicon through the same mechanism, lexical retrieval, to different effects. Indeed, evidence suggests that recall and recognition are not disconnected, and that in fact some of the comprehension effects normally attributed to the behavior of syntactic processing can be accounted for by the lexical association stemming from production behavior of the lexical items involved (see MacDonald, 2013 for a specific proposal of how this connection can be modeled).
27.4.2 The lexicon and meaning dynamics A full-entry approach allows us to preserve the connection between the conceptual segment that represents the lexical meaning and the larger conceptual structure from which it was obtained in the first place. It is this connection that allows it to remain subject to shifts in context and shifts in usage across speakers. This view of the lexical item and the mental lexicon solves the problem of the necessity of a grammar that is separate from the production/comprehension processes that make it visible in the first place. How do we decide what lives in the grammar and what does not? Systematic association with a pronunciation. The meaning that enters the mental lexicon does so because it has been linked to morphophonological material. Conversely, any meaning structure that is no longer associated with a specific pronunciation either dies (it is forgotten) or becomes systematically associated with another meaning itself associated with a pronunciation (see Goldberg, 1996; Bencini and Goldberg, 2000; Bresnan and Ford, 2010 for a constructionist implementation of this idea). If the mental lexicon has all the information for how we use lexical items: what is the grammar? The answer that follows from a full-entry view is that “the grammar” is a disembodied construct, useful to characterize the patterns of universal linguistic knowledge structure and of language-specific generalizations, but without psychological basis as an actual isolable neurocognitive structure. Instead, what amounts to the grammar for any given language is the knowledge structure emerging from a collective of lexical items that can be used during comprehension and production by a speech community in a mutually intelligible manner. This means that linguistic proficiency amounts to the size of the lexical collective that has been committed to memory and how robust this memory commitment is, for any given speaker. Such commitment is in turn mediated by the degree and quality of exposure of the speaker to the language; how much and how many contexts the collective of lexical items is used for during comprehension and production. It is this individualized instantiation what makes possible meaning variation
582 Maria Mercedes Piñango across speakers, whereas what propels change is the possibility of transmission of variation across lexical items themselves organized as semantic classes (see Deo, 2009, 2015, and references therein for extensive discussion). This view also provides an approach to acquisition. In a full-entry system, the lexical item scheme —that is, the knowledge that for linguistic input to qualify as such it must encode associations of phono-morpho-syn and meaning information—represents the search guide that children use to “extract” information from the data they encounter. The granularity of these data are themselves “mixed”—they come in the form of complete sentences, words, phrases, and morpho- syntactically incomplete segments. Interestingly, this does not represent an impediment to acquisition (e.g., Pinker, 1989; see also Grigoroglou and Papafragou, this volume). A full-entry-based lexicon predicts this to be the case since the structure of the lexical item itself represents the constraining filter. Once a potential lexical structure has been identified, obtaining the right level of resolution (i.e., morphological, phrasal, etc.) is only a matter of further exposure; all lexical items pose the same challenge to children, which is to extract their combinatorial information. (See Trueswell and Gleitman, 2007 and references therein for a specific proposal of such a lexically-driven acquisition system; see also Gleitman and Trueswell, this volume.) In sum, the full-entry-based approach to linguistic meaning composition presented here places the structure of the grammar inside the internal structure of the mental lexicon, the holder of the grammar, and inside the lexical item, the unit of the grammar. It makes the “grammar of a language” the body of knowledge distributed across the lexical items of that language. In this way, the lexical item built in a full-entry fashion distributes the combinatorial and generative burden across all subsystems in a constraint-satisfaction dynamic. For the semantic subsystem, it creates an expectation of a precise and highly structured psychologically-viable lexico-semantic representation. The excursion through English for is intended precisely as a demonstration of what such an expectation entails.
Chapter 28
On The Dy na mi c s of L ex ical Ac ces s I n T wo or More L ang uag e s Judith F. Kroll, Kinsey Bice, Mona Roxana Botezatu, and Megan Zirnstein
28.1 Introduction The lexicon has played a critical role in the history of research on bilingualism. Early models considered whether the two languages might be represented independently or as an integrated system (see Kroll and Tokowicz, 2005, for a review). A surge of research on visual and spoken word recognition and word production then demonstrated compelling evidence for the co-activation of the language not in use (e.g., Dijkstra, 2005; Marian and Spivey, 2003; Kroll, Bobb, and Wodniecka, 2006). Bilinguals are not monolingual like in the way that they understand and speak words in each language. The evidence on bilingual word recognition has been used to model accounts of language nonselectivity, with interactions hypothesized to cross freely between languages across the semantics, phonology, and orthography (Dijkstra and Van Heuven, 1998;, 2002). Debates about how a newly learned second language (L2) might come to be connected to the existing native language (L1) in adult learners has focused not only on transfer at the level of the grammar but also on the lexicon, to understand if and when the L2 might come to be functionally independent of the L1 (e.g., Brysbaert and Duyck, 2010; Kroll and Stewart, 1994). Critically, the lexical interactions across the two languages appear to characterize bilingual experience in many different forms. They are seen for individuals at the earliest stages of L2 learning (e.g., Bice and Kroll, 2015), learners who acquired the two languages in childhood or beyond, and for those who are highly proficient in both languages but for whom the two languages do not share the same written script (e.g., Hoshino and Kroll, 2008). In what may be the most dramatic case, cross-language
584 J. F. KROLL, K. BICE, M. R. Botezatu, AND M. ZIRNSTEIN interactions are also observed for hearing and deaf individuals who use one signed and one spoken or written language (e.g., Emmorey, Borinstein, Thompson, and Gollan, 2008; Morford, Wilkinson, Villwock, Piñar, and Kroll, 2011; and see Mayberry and Wille, this volume). The robustness of the co-activation of the bilingual’s two languages suggests that the two languages come to rely on shared networks of neural activation and control that themselves do not depend on structural features of each of the languages. Indeed, a series of neuroimaging studies has shown that the same neural tissue supports the processing of both languages (e.g., Abutalebi and Green, 2008; Perani and Abutalebi, 2005; and see Nozari, this volume on the neural processes that support lexical production). At the same time, the parallel activation of the two languages creates a potential problem in that typically only one language must be selected as the target language. In the sections that follow, we consider a range of phenomena that illustrate how the dynamics of cross- language activation give rise to the engagement of control mechanisms that reflect the regulation of the bilingual’s two languages and the involvement of domain general cognitive resources. We demonstrate that dynamic changes in both languages can be seen in early stages of L2 learning, they affect the L1 as or even more dramatically than the L2, and are also seen in the most proficient bilinguals. Critically, these changes depend not only on the state of the learner or bilingual but on the interactional contexts in which they find themselves immersed (e.g., Green and Abutalebi, 2013). The resulting status and availability of the two languages will ultimately reflect both contributions. Although our focus in this review is on the lexicon, we will argue that the plasticity that is illustrated at the lexical level reflects the bilingual’s language system more generally.
28.2 Language immersion Although the phenomenology associated with language experience needs to be interpreted with caution, individuals frequently report the feeling that they are losing their native language when immersed in a second language during travel or study abroad. In this instance, the research on language immersion largely supports the phenomenology, with evidence for reduced access to the L1 while immersed in the L2. Linck, Kroll, and Sunderman (2009) compared the performance of two groups of L2 learners on a set of lexical comprehension and production tasks. The learners were all native English speakers at intermediate levels of learning Spanish as the L2. One group was studying in a classroom setting in the US while the other was in Spain, studying Spanish while traveling abroad. Although both groups were highly dominant in English as the native and much more proficient language, the L2 immersed group produced less English than the classroom group, suggesting that the context had the consequence of reducing access to the more dominant language. Performance on a lexical comprehension task for the same learners revealed a pattern of cross-language interaction consistent with the observation that the immersed learners had acquired greater proficiency
On The Dynamics of Lexical Access In Two or More Languages 585 than the classroom learners in that they were less sensitive to interference from L1 lexical distractors. Linck et al. also showed that the reduction in native language production recovered once learners returned to the English dominant environment, demonstrating that these changes reflect the ordinary dynamics of multiple language use, not language attrition. Other studies have reported similar findings (e.g., Baus, Costa, and Carreiras, 2013; and see Kroll, Dussias, and Bajo, 2018, for a review of the findings on language immersion more generally). The findings on native language lexical change under conditions of L2 immersion have a number of important implications. One is that L2 learning may rely on adjustments to the native language that enable the L2 to become part of an integrated language system. Traditional accounts of L2 learning have focused on the ways that the L1 is recruited to enable transfer to the L2 (e.g., MacWhinney, 2005). What we have learned in the recent studies is that cross-language interactions are also bidirectional, with the L2 affecting the L1, even during the earliest stages of L2 learning (e.g., Bice and Kroll, 2015). In the sections that follow, we discuss cross-language lexical interactions and the mechanisms that may come to control or regulate the L1. In some cases, exposure to, and use of the L2, affects the processing of the L1. In other cases, the L1 itself requires control to enable the proficient use of the L2 and potential inhibition of the L1, typically more dominant than the L2. As we discuss, a central question in this research is to understand how the mechanisms of language regulation are related to domain general cognitive control. Critically, the observed changes to the L1 that are seen during immersion in the L2 may not be restricted only to early stages of L2 learning. Instead, they appear to characterize the ordinary variation in lexical processes that accompany multiple language use across different interactional contexts for even highly proficient speakers. Although the past research has tended to categorize language immersion into native vs. non-native contexts, the emerging studies show that varied language environments, both within the L1 and L2 contexts, are likely to differ in the demands that they place on speakers and in the resulting control processes that become engaged (e.g., Beatty-Martínez, Navarro-Torres, Dussias, Bajo, Guzzardo-Tamargo, and Kroll, 2020; Pot, Keijzer and De Bot, 2018). Notably, modulation of these processes can also be induced in the laboratory. When the L1 is spoken following the L2, there is evidence both at the level of lexical language switching, from one word to the other (e.g., Meuter and Allport, 1999), and at the level of contextual switching to L1 after extended lexical production in L2 in a longer sequence, that shows that the L1 is inhibited for some period of time (e.g., Misra, Guo, Bobb, and Kroll, 2012; Van Assche, Duyck, and Gollan, 2013). Although there is discussion and debate concerning the different levels at which these inhibitory control processes might operate (e.g., Guo, Liu, Misra, and Kroll, 2011; Kleinman and Gollan, 2018), the overall finding is clear. Bilinguals inhibit the L1 to enable the L2 to be produced. Whether the inhibition of the L1 under these circumstances provides a precise model of the documented suppression of the L1 when learners and bilinguals are actually immersed in an L2 environment, will remain to be determined as more is understood about the neural networks that support the engagement of cognitive resources during language processing (e.g., Hsu, Jaeggi, and Novick, 2017).
586 J. F. KROLL, K. BICE, M. R. Botezatu, AND M. ZIRNSTEIN In addition to identifying the levels at which inhibitory control processes function more generally, there are also differences among bilingual speakers that may reflect distinct choices in patterns of acculturation more than cognitive abilities and/or resources alone. Some bilinguals immersed in a predominantly L2 context will succumb to the power of the environment and eventually become more dominant in the L2 such that the L2 replaces the L1 as the dominant language. Other bilinguals will retain dominance in the native L1, despite the power of the external environment. While differences in the likelihood of switching language dominance may be partly a function of inhibitory control ability (e.g., see Beatty-Martínez et al., 2020), they are also likely to reflect choices made by communities of bilingual speakers to preserve their native language and culture. Zhang, Kang, Wu, Ma, and Guo (2015) reported a training study in which Chinese-English bilinguals in Beijing first performed the AX-CPT, a measure of inhibitory control, followed by ten days of language switching training on a picture naming task, and then a final session of AX-CPT, to determine whether inhibitory control was affected by training. The AX- CPT provides an index of both proactive and reactive control processes. Zhang et al. found an increase in proactive control for bilinguals following language switching training, coupled with inhibition of the L1 in the picture naming task, such that the time to name in Chinese, the L1, was slower than the time to name in English, the L2. When a similar study was conducted in the US, with Chinese-English bilinguals immersed in an English dominant environment, there was no improvement in proactive control but a significantly larger inhibitory effect for the L1 in picture naming (Zhang, Diaz, and Kroll, in preparation). Why? The Chinese-English bilinguals in the US remained highly dominant in Chinese as the L1. Because the context in this study was one that was largely made up of monolingual English speakers, the authors hypothesized that the cognitive resources required to maintain Chinese dominance, when most others around them did not speak Chinese, had effectively trained their inhibitory control to a level that was then insensitive to the experimental training. A comparison across the two studies on the initial measure of proactive control supported that conjecture, with higher initial levels of proactive control for the Chinese-English bilinguals in the US than those in China. The accompanying larger inhibitory effect for L1 in picture naming was consistent with that interpretation. The implication is that being a proficient bilingual itself does not determine the manner in which the two languages are regulated. The context of language use is crucial. In the sections that follow, we consider the role that these control dynamics play in determining language processing and new language learning and how they may be modulated by individual differences. We further examine the scope of L1 changes during L2 learning. Our review suggests that language experience needs to be conceptualized as a continuum, in ways that include bilingual and monolingual speakers alike, and that revisit lexical phenomena in both comprehension and production that have traditionally been taken to be stable features of native language lexical processing, but that may change as the interactional context changes.
On The Dynamics of Lexical Access In Two or More Languages 587
28.3 New learning and the effects of linguistic diversity The dynamic nature of the learning process modulates lexical access at different stages of proficiency, and not always linearly with proficiency. As discussed, the very initial stages of learning a new language in adulthood have been shown to perturb L1 processing, including L1 access (Bice and Kroll, 2015; Bogulski, Bice, and Kroll, 2019; Bramer, Ren, Henley et al., 2017). Nevertheless, the L1 remains dominant during the earliest stages of learning. As proficiency is gained, the less dominant L2 becomes easier to access at the same time that the learner becomes more practiced and efficient at inhibiting the L1. By the time the L2 begins to be sufficiently automatic and proficient to actually compete with the L1 contenders, the learner has already become somewhat skilled at the intricate engagement and disengagement of inhibitory control. It is important to note that the nature of L1 inhibition throughout the course of L2 learning is not necessarily linear; in some ways, increasing L2 proficiency may reduce the need to inhibit the L1, making the L1 more accessible, but at the same time, increasing L2 proficiency should increase the competition produced by the L2, making the L1 less accessible. These specific dynamics—increasing proficiency in the L2, increasing efficiency in language control (especially the L1), and re-accessing the L1 after considerable inhibition—are subject to variation for each individual learner, depending on various factors such as the context of learning or use (Elston-Güttler, Paulmann, and Kotz, 2005; Kreiner and Degani, 2015). Recent work has shifted to focus on how different language experiences impact the nature of cognitive control in bilingualism and during language learning (e.g., Bonfieni, Branigan, Pickering, and Sorace, 2020; DeLuca, Rothman, Bialystok, and Pliatsikas, 2019; Green and Abutalebi, 2013). Some of these studies have demonstrated that previous language experience impacts new language learning, even for monolinguals who have relatively little “language experience” in terms of dual language use. Bice and Kroll (2019) reported that monolinguals living in linguistically diverse communities demonstrated an initial advantage when attempting to learn a new language that was not part of the linguistic landscape of the community. Compared to monolinguals living in a homogenously unilingual community, monolinguals living in a community in which numerous languages were spoken and overheard in daily public discourse were faster to become sensitive to the unique sounds of the language and the patterns of how the sounds work together to form patterns in words (see Creel, this volume). The study was unable to determine the exact source of the early advantage for the monolinguals in their ability to implicitly extract low-level features of the new language—whether it was the experience of overhearing foreign speech sounds, interacting with non- native speakers, understanding that languages can work very differently or have different structures, and/or other factors associated with living in linguistically diverse communities. Therefore, mere exposure to and interaction with linguistic diversity may increase access to the sounds and words in unknown languages without the learner’s
588 J. F. KROLL, K. BICE, M. R. Botezatu, AND M. ZIRNSTEIN awareness. Whether these mechanisms in adult learners resemble the shaping of the speech system early in life remains to be seen (see Swingley, this volume). Living in linguistically diverse contexts may also alter how the underlying cognitive architecture supports learning a new language. In another study, Bice, Weekes, and Kroll (in preparation) compared monolinguals living in a unilingual environment (central Pennsylvania, USA) to those living in a diverse metropolis (Hong Kong SAR, China) in their ability to learn vocabulary in the primary language spoken in Hong Kong, Cantonese. Unlike the previous study on linguistic diversity, the focus of this study was to ask how the level of learning impacted lexical access in their only known language (English). Both groups were asked to name a set of pictures in English after attempting to learn the words for the same pictures in Cantonese. The logic was that successful vocabulary learning depends on inhibiting the L1 translation equivalent, so longer latencies in English for the formerly “learned” items should be related to better learning. Preliminary results revealed an interesting modulation of L1 access and engagement of cognitive control. Slower latencies in English for the pictures that had been presented for learning in Cantonese were related to higher accuracy on the vocabulary post-test for monolinguals in both contexts. That is, the hypothesis was confirmed that successful learning depended upon effective inhibition of the L1 (English) translation equivalent, making it more difficult to access those translation equivalents when subsequently required to name them. Even in the earliest stages of learning, reduced L1 access appears to be related to better L2 learning. But the context in which learning occurred was also important, by modulating the way that cognitive control was engaged during learning. Only monolinguals living in Hong Kong revealed a relationship between a measure of cognitive control (performance on the AX-CPT) and the degree of item- specific L1 slowing; monolinguals living in Pennsylvania did not show any evidence of coordination between their learning and cognitive control mechanisms. The pattern of results suggests that successful early learning is related to inhibition of the L1, but also that the interactional context itself may facilitate the coordination of cognitive control mechanisms during language and learning processes (see Beatty-Martínez et al., 2020, for a related discussion). If successful adult language learning depends on the ability of learners to inhibit the L1, and bilinguals have already become adept at L1 inhibition as a function of dual language use, then one might expect that bilinguals could capitalize on this already-acquired skill to be better learners of a new language (L3). Indeed, many studies have reported that bilinguals outperform monolinguals in new language learning (e.g., Bradley, King, and Hernandez, 2013; Kaushanskaya and Marian, 2009a;, 2009b; Kaushanskaya, 2012). While several different accounts have been proposed to explain the bilingual learning advantage, a recent study reported strong evidence for the hypothesis that bilinguals learn to regulate their L1. Bogulski et al.(2019) compared monolingual English speakers with three groups of bilinguals in new vocabulary learning. One group of bilinguals (English-Spanish) learned new vocabulary through their L1, creating a context in which they could rely on their developed L1 inhibitory skills during new learning. The other
On The Dynamics of Lexical Access In Two or More Languages 589 two groups (Spanish-English and Chinese-English bilinguals) learned new vocabulary through their L2, English, a language that they had relatively little need to inhibit or regulate. The result was that only the English-Spanish bilinguals learning via their L1 had better vocabulary learning than the monolinguals. A curious finding was that these bilinguals learning via their L1 and hypothesized to be regulating the L1 under these conditions, were also exceptionally slow to name the English translations of novel vocabulary words. In a separate test of lexical fluency in English, these English-Spanish bilinguals were as fast to name pictures in English as the monolingual English speakers. But somehow during learning, they adopted a strategy that produced very long English naming latencies. Bogulski et al. interpreted this slowing as a desirable difficulty, a concept borrowed from the memory and learning literature to indicate that conditions of learning that encourage elaboration and feedback from errors may produce costs at the time of initial encoding that later map onto benefits in performance (Bjork, Bjork, and McDaniel, 2011; Bjork, Dunlosky, and Kornell, 2013; Bjork and Kroll, 2015). In a post-hoc analysis, Bogulski et al. (2019) exploited the fact that many Spanish- English bilinguals immersed in English in the US switch language dominance to become more dominant in the L2. The Spanish-English bilinguals in this study varied in whether Spanish or English was the dominant language, so it was possible to re-analyze the data in this way. Although the n was small for the reanalysis, there was a marginally significant effect showing that those bilinguals who had become English dominant, produced the same pattern of long study latencies and enhanced vocabulary learning shown by the English-Spanish bilinguals. The result suggests that it is the dominant rather than the native language that is important, with an advantage in learning when it is possible to exploit the inhibitory skill associated with the more dominant language (and see section below on language regulation in prediction).
28.4 Individual differences and the language-cognitive control interface New language learners, whether at the earliest moments of acquisition or during later stages of gaining skill in a second language (L2), have the seemingly insurmountable task of overcoming the strong dominance of their native L1. As discussed previously, inhibitory control can occur at many levels of processing, although most research on this topic has focused at the lexical level, and the related orthographic, phonological, and/ or semantic competition that can easily occur when attempting to acquire new L2 vocabulary. The strong pressures that L1 activation imposes on L2 learning have often been characterized as necessitating suppression of the L1 in order to serve L2 learning. This theoretical description has been heavily influenced by how psychologists and linguists have studied the effects of language switching during naming, and the role that inhibition plays in facilitating naming in more cognitively demanding language contexts
590 J. F. KROLL, K. BICE, M. R. Botezatu, AND M. ZIRNSTEIN (e.g., speaking the L1 following the L2). The strong focus on L1 inhibition has also been influenced by the assumption that most L2 learning occurs later in life, once the L1 has become heavily entrenched. While more diverse learning and acquisition circumstances are clearly of interest (e.g., simultaneous, early, and even bimodal acquisition), the study of these extreme situations (when L2 and L1 acquisition are highly temporally distinct) have clearly shed light on how bilingual cognition can be impacted by the additional constraints these situations impose. In doing so, they model the cognitive and linguistic processes that are likely to be experienced across diverse circumstances of L2 attainment. In addition to the effect that it can have on late L2 acquisition, L1 dominance can also strongly influence the use of the L2 when the highest of L2 proficiency has been reached—and it can do so, not only when single words are being processed in isolation but also during more naturalistic and resource-demanding tasks like reading for comprehension. In a study by Zirnstein, Van Hell, and Kroll (2018, Experiment 2), for example, Mandarin-English bilinguals were asked to read sentences in English, their less dominant, but still highly proficient L2, while answering simple true or false statements about what they had read. During this task, participants’ EEG was also recorded and time-locked to specific words in the sentences. Of particular importance was a contrast between the event-related potentials (ERPs) that were elicited in response to words that were highly plausible continuations of a sentence but differed in whether they were semantically expected or unexpected. This distinction was achieved by manipulating the semantic constraint of the sentence context (i.e., words prior to the target word) and whether the word following met the expectations laid out in the context or was an unexpected but plausible option. The goal of the study was to isolate instances of mis-prediction or prediction error by examining the ERPs elicited by unexpected words. Previous research (e.g., Martin, Thierry, Kuipers et al., 2013) had suggested that bilinguals reading in the less dominant L2 did not have sufficient resources to support the generation of a semantic prediction or expectations during online sentence processing. The assumption was that L2 comprehension was inherently non-native, and therefore prediction generation would either not occur at all (i.e., passive integration strategies would be employed in their stead) or would occur too slowly to impact processing in a native-like fashion (i.e., similar to the processing of monolingual comprehenders). Contrary to the earlier research by Martin and colleagues (but see Foucart, Martin, Moreno, and Costa, 2014), Zirnstein et al. (2018) discovered that a select subset of the bilingual participants in the study showed the same ERP profile to prediction error violations as monolingual readers. Specifically, all readers, whether monolingual or bilingual, who were capable of generating predictions and therefore experiencing prediction error brain responses when those predictions were not fully manifested, showed evidence for a late, frontally-distributed positivity in their brain waves. In addition, for all speakers who showed this brain response, better domain general inhibitory control ability (as indexed by the interference version of the AX-CPT; Braver, Barch, Keys et al., 2001) attenuated the response. Importantly, there was one distinguishing characteristic
On The Dynamics of Lexical Access In Two or More Languages 591 between the bilinguals who showed this sensitivity to prediction error in the L2 and those who did not—their L1 dominance and its relationship to the participants’ immersion context. In addition to administering reading comprehension and cognitive control tasks, Zirnstein et al. (2018) also assessed the verbal fluency of the bilingual participants using a semantic category production task. Both Mandarin and English were tested separately, and participants were asked to produce as many exemplars as possible within a 30-second time frame for multiple semantic categories in each language (e.g., tools, animals, etc.). All of the bilingual participants were tested and had been immersed for at least one year in a strongly L1-dominant language environment in the US, providing significant support for use and access to the L2. However, the student community of Mandarin-English speakers in this environment also maintained daily use of their L1, often by forming communities on and off campus. These immersion characteristics and the relationship between the students’ community-based support of their native L1 both had an important impact not only their verbal fluency performance but also ultimately on what constraints were imposed on their L2 comprehension. For all participants, L2 fluency performance remained relatively stable, as expected from a strong L2-supportive immersion context. In addition, L1 fluency performance was always higher than for the L2, indicating L1 dominance at the group level. Zirnstein and colleagues asked whether L2 fluency or L1 fluency would be a better predictor of sensitivity to prediction error in the L2. Classic views of bilingualism and L2 processing would suggest that L2 fluency should be the strongest predictor, but that was not the case. Instead, L1 fluency predicted whether the frontal positivity was observed in response to unexpected words, indicating that prediction generation and prediction error updating had occurred. While this finding may appear counterintuitive, previous research with older adult monolinguals had already demonstrated that native language fluency could be used as an individual difference predictor for which comprehenders were more likely to attempt predictive reading strategies, even in the face of the cognitive constraints present during healthy aging (Federmeier, Kutas, and Schul, 2010). Older adult monolinguals, who, by definition, were only reading in their native language and therefore had no immersion- related cognitive pressures to overcome, were more likely to generate a frontal positivity for prediction errors if they had also performed well on a semantic category fluency task. Thus, across these two extremes—healthy aging and bilingual immersion—we can observe patterns where constraints to L1 access are present. As a result, when those pressures are overcome, we see evidence in language processing for the suppression or inhibition of a prior semantic expectation and mental model updating required to learn from that experience. Only when individuals are capable of flexibly using the dominant language, despite circumstances that would otherwise have made use of the L1 more difficult, do we see clear evidence for reading comprehension behaviors that rely on language skill and cognitive control ability. A remaining question, following the work reported by Zirnstein et al. (2018), is how we can characterize the influence of L1 fluency in the strongly L2-supportive immersion context that the bilingual participants found themselves in daily. In previous sections of
592 J. F. KROLL, K. BICE, M. R. Botezatu, AND M. ZIRNSTEIN this chapter, we framed how necessary it is for language learners to inhibit the dominant L1 in order to support L2 acquisition, but how long is this inhibition process maintained? When do language learners shift from being learners to being bilinguals with some level of proficiency, and how does this change the influence that L1 activation has on L2 use? If L1 inhibition were necessary for successful L2 prediction to occur, then the relationship between L1 fluency and L2 prediction error sensitivity should have been the reverse in the study we have described. Instead, L1 fluency needed to be significantly greater in order to facilitate L2 comprehension. How can these two situations of L1 on L2 influence be adjudicated? We propose that, rather than viewing the influence of the L1 on the L2 as being purely one of interference (and therefore requiring some form of L1 inhibition) that bilinguals must learn to flexibly deploy not only inhibition but also disinhibition of the more dominant language in order to use the less dominant language. First coined by Zirnstein et al. (2018) as bilingual “language regulation,” this skill relies on a bilingual individual’s ability to employ a host of executive functions in order to shift the activation state of the L1 and L2 to more appropriately meet the demands of their environment. Environment, in this case, refers to anything from the macro-level (e.g., immersion, language status) to the micro-level (e.g., individual-level proficiency, conversational partner, or context). Instead of merely inhibiting the native or dominant L1, the hypothesis is that multiple activation states may be achieved by bilinguals in order to more appropriately accomplish their linguistic goals. One possibility that stems from this proposal is that the idea of “immersion” may be able to be experimentally induced. In a recent study, Zirnstein, Jaranilla, Fricke, and colleagues (2018) presented Mandarin-English bilinguals with sentences like those in the study by Zirnstein et al. (2018) while their eye movements were recorded. As a key distinction between this and the prior ERP study, the bilingual participants were asked to read in their L2 while immersed in Beijing, China, where Mandarin was highly supported and therefore more accessible. While participants completed the reading task, they were also exposed to three background speech conditions: one with no speech, another with English L2 speech, and a third with speech from a language unknown to the participants (i.e., French). The speech in each case was like that in an ambient overhearing context such as being in a café. Eye tracking results indicated that the introduction of an unknown language was helpful to most participants (reduced reading times for target words) and may have served as an important cue that regulation of the dominant L1 (Mandarin) was required. Intriguingly, the L2 speech condition proved to be either hurtful (longer reading times) or helpful (reduced reading times) depending on the level of L2 proficiency of the reader. For those with higher L2 proficiency, hearing L2 as background interfered with processing (mimicking informational masking effects found in the bilingual listening comprehension literature; Van Engen, 2010). In contrast, participants with moderate levels of L2 proficiency benefited from hearing the L2 background in a manner similar to hearing an unknown language. These findings indicate that speaker-related environmental cues, like hearing a particular language even if no specific speaker is distinct, can influence the extent to which bilinguals regulate the
On The Dynamics of Lexical Access In Two or More Languages 593 dominant language to support L2 processing. The extent to which these cues support or hinder processing may change as proficiency in the L2 is gained. Bilingual language regulation appears to influence L2 processing across multiple immersion contexts, whether natural or experimentally induced. What remains to be seen is whether these patterns extend to other language modalities, immersion contexts, and for bilinguals with significant language switching, code-switching, or translation experience.
28.5 The consequences of L2 exposure for L1 stability A theme in the preceding sections is that lexical access is more dynamic than previously understood, with changes in both the native and second language as individuals acquire and use two languages. The native or dominant L1 changes as L2 is used and, as we have argued above, requires regulation to enable proficiency in the L2. In the section on new learning, we suggested that one manifestation of L1 regulation can be observed in the strategies that bilingual learners adopt to acquire an L3. In the section on individual differences, we suggested that these dynamics can be seen in the interplay of L1 regulation and domain general cognitive control during language processing. But in all of this discussion, we have said little about the locus of regulation itself. How far does it go? If the L1 is regulated to be more or less available to the learner or bilingual, is it an effect that can be seen at a global level, affecting the speed, accuracy, and efficiency of L1 processing? Or can we see evidence that the processing of the L1 itself changes in some fundamental ways? At the level of the grammar, there is evidence that parsing preferences in the native language can change following exposure to an L2 (e.g., Dussias and Sagarra, 2007). In this final section of the chapter, we address the level of native language change during L2 processing as observed in lexical phenomena that have been taken in the literature on monolingual language processing as being stable features of the L1. The studies we review are quite recent but provide exciting support for the idea that at least two features of lexical processing in the native language, the computation of spelling-to sound mappings during reading and the effects of phonological neighborhood density during the comprehension of spoken words, may be disrupted, a least momentarily, during L2 learning. A well-replicated finding about word reading is that mapping text to sound is enabled when spellings have one-to-one grapheme-phoneme correspondences, a property known as regularity (e.g., Coltheart, Rastle, Perry, Langdon, and Ziegler, 2001) and when spellings map to a single phonological representation, a property known as consistency (e.g., Jared, 2002). Violations of regularity and consistency generate competing activations of correct and incorrect pronunciations that take time to resolve, causing slower and less accurate naming of words with irregular-inconsistent spelling-sound
594 J. F. KROLL, K. BICE, M. R. Botezatu, AND M. ZIRNSTEIN mappings. These effects are largest in younger (unskilled) readers (see Treiman and Kessler, this volume), who rely on an effortful decoding strategy (e.g., Schmalz, Marinus, and Castles, 2013; Waters, Seidenberg, and Bruck, 1984), but have been reported as highly stable in adulthood (e.g., Waters et al., 1984). However, recent evidence suggests new language learning can impact the competition dynamics involved in the computation of phonology from text in the native language of skilled adult readers. Botezatu, Misra, and Kroll (under review) considered how L2 proficiency might modulate word naming performance in L1 for a group of native English speakers learning Spanish as the L2. For native English speakers, the L1 has deep orthography, with many exception words. In contrast, the L2 Spanish has shallow orthography, with greater regularity in spelling-to-sound mappings. Botezatu et al. found that the magnitude of the regularity-consistency effect in English word naming was reduced at low levels of L2 proficiency but re-emerged at high levels of L2 proficiency. This counterintuitive finding suggests that during early stages of L2 learning, native L1 speakers of English adjust their reading strategy in English to accommodate the ambiguity that the emergent L2 may add to L1 grapheme-phoneme correspondences. The observation that the regularity- consistency effect re-emerges in high-proficiency L2 learners suggests that the temporary reduction of the effect may capture a period of instability in the native language. A question that emerges is whether the disruption of this fundamental L1 process is particular to readers of two alphabetic orthographies (e.g., Nosarti, Mechelli, Green, and Price, 2010), or whether the effect might be observed in readers of different writing systems for the L1 and L2. A study by Botezatu, Kroll, Trachsel, and Guo (in press, a) examined this question by asking whether the same changes to English observed for native English speakers learning Spanish would also be observed with native English speakers learning Chinese. They found that short-term immersion in a Chinese-speaking environment did not impact the magnitude of the regularity-consistency effect in English word naming for English learners of L2-Chinese living in Beijing, China. The result suggests that the consequences of L2 proficiency on lexical processing in L1 may depend on the degree of cross-language overlap between the L1 and L2. But Botezatu et al. reported a second finding that qualifies the generality of that conclusion. The same native English learners of Chinese showed reduced L1 access on measures of spoken lexical production and comprehension compared to native English learners acquiring Spanish in a classroom setting. These findings suggest that there may be a range of different consequences of new language learning for the native language that depend not only on cross-language similarity but also on modality. In another set of studies, Botezatu and colleagues have exploited the presence of the phonological neighborhood density effect (e.g., Luce, 1986) in spoken word recognition to examine the consequences of language experience on lexical processing (e.g., Botezatu, Landrigan, Chen, and Mirman, 2015; Botezatu and Mirman, 2019). Spoken word recognition requires listeners to map an ambiguous and transient speech signal to a unique lexical form (for reviews, see Magnuson and Crinnion, this volume; Magnuson, Mirman, and Myers, 2013; Mirman, 2016). This mapping involves parallel activation of lexical candidates (i.e., lexical neighbors) that match the input. As the unfolding speech moves toward a point of uniqueness, listeners engage in the resolution of the
On The Dynamics of Lexical Access In Two or More Languages 595 lexical competition thus created. Spoken words with many phonological neighbors are recognized more slowly and less accurately than words with few phonological neighbors (e.g., Luce and Large, 2001; Luce and Pisoni, 1998; Magnuson, Dixon, Tanenhaus, and Aslin, 2007). As speakers gain proficiency in an L2, they are tasked with resolving lexical competition not only from within-language competitors but also from cross-language competitors that may be temporarily co-activated (e.g., Bradlow and Pisoni, 1999; Spivey and Marian, 1999). Botezatu et al. (in press, a) reported an effect of L2 experience on the phonological density effect in L1 that was modulated by a set of factors related to L2 proficiency. In this study, native English speakers were learning Chinese as the L2 while immersed in China. The study used a spoken-to-written word matching task in which participants identified spoken English words presented in noise. The words varied in phonological neighborhood density and were selected such that they would not overlap phonologically with Chinese. Non-native language proficiency was assessed using a range of production measures (i.e., discourse fluency, picture naming, and semantic and phonemic fluency). Botezatu et al. reported reduced L1 access during immersion in Chinese that was associated not only with slower English discourse rate of speech and slower and less accurate English picture naming performance but also with a reduction in the magnitude of the phonological neighborhood density effect in English spoken word recognition. Botezatu, Kroll, Trachsel, and Guo (in press, b) further reported that more fluent L1 and L2 discourse production in native English speakers is associated with more efficient resolution of competition among phonological neighbors in English. The results show that variation in language experience affects the efficiency of lexical retrieval. To go a step further, we can ask at a finer-grain level whether L2 experience differentially impacts the processing of onset (cohort) neighbors and offset (rhyme) neighbors in L1 spoken word recognition. Past research has shown that the sequential nature of spoken word recognition causes onset neighbors (e.g., CALM, CARD) to be activated more strongly than offset neighbors (e.g., JAR, FAR; Allopenna, Magnuson, and Tanenhaus, 1998; Magnuson et al., 2007). Chen and Mirman (2012) have shown that strongly-active neighbors (i.e., cohort neighbors) have a net inhibitory effect, while weakly-active competitors (i.e., rhyme neighbors) have a net facilitative effect. Botezatu, Kroll, Wong, Garcia, and Cheung (in preparation) evaluated the relationship among phonological neighborhood density, competitor type (cohort versus rhyme) and Spanish proficiency in bilinguals who were English dominant heritage speakers of Spanish. These are individuals who were first exposed to Spanish as the home language and then became dominant in English as the language of the majority community. The results showed that English words from denser cohort neighborhoods are more inhibitory with higher levels of Spanish proficiency, whereas English words from denser rhyme neighborhoods are more facilitatory with higher levels of Spanish proficiency. This pattern of performance was not observed in English monolinguals, who showed inhibitory effects for words in both cohort and rhyme conditions. The studies we have reviewed in this section are recent, with many implications that have yet to be examined. Taken together, they demonstrate that the act of learning and
596 J. F. KROLL, K. BICE, M. R. Botezatu, AND M. ZIRNSTEIN using an L2 comes to have consequences for lexical processing in the native language that are specific to the form of L2 experience, the modality of lexical processing, and to variation in L2 proficiency. Counter to the claim that the L2 simply derails the L1 globally, the findings we have reported in this section illustrate a range of phenomena that reflect the dynamics of lexical processing for visual and auditory word recognition in learners at early stages of acquiring an L2 and in more proficient speakers navigating the competition imposed by the presence of the two languages.
28.6 Conclusions In this chapter, we have reviewed new findings about bilingual lexical processing. Contrary to the traditional view that the goal of L2 learning was to acquire stable representations for the L2 that would enable performance comparable to the L1, the emerging evidence suggests that there is a dynamic interplay between the bilingual’s two languages that requires regulation and control. Critically, the data from a range of studies on language processing and on new language learning show that it is the L1 that is regulated, determining the cognitive resources that are recruited to enable a high degree of proficiency in the L2. In the past, we might have hypothesized that these mechanisms would come online exclusively for learners early in the process of acquiring the L2, for whom there is a massive asymmetry in their knowledge and ability to use the two languages. The new research shows that the processes that are revealed early in learning are, in fact, present for highly proficient bilinguals as well, with fluctuations in the relative access to the two languages depending on whether the context of language use is strongly biased to one or the other language. This work reveals the plasticity of the language system in coping with a changing linguistic and cognitive landscape. Features of lexical processing that were historically considered stable in adulthood and based on models of monolingual or native speaker attainment, have given way to a more fluid account, in which bilinguals and monolinguals alike may change as a function of the cognitive resources that are available and the degree to which the environment supports them. Learning and using a second language is as much about the L1 as the L2. On this account, second language learners and bilinguals become a lens for observing interactions between language processing and cognition, and the neural mechanisms that support them, in ways that are opaque in speakers of one language alone.
Acknowledgments The writing of this chapter was supported in part by NIH Grant HD082796 and NSF Grants BCS-1535124 and OISE-1545900 to J. F. Kroll, a Washington Research Foundation
On The Dynamics of Lexical Access In Two or More Languages 597 Innovation Postdoctoral Fellowship in Neuroengineering to K. Bice, and by a Catalyst Award and a Richard Wallace Faculty Incentive Grant from the University of Missouri, and an Advancing Academic-Research Careers Award from the American Speech- Language-Hearing Association, to M. R. Botezatu.
Chapter 29
L ex ical repre se ntat i on a nd ac ces s i n si g n l anguag e s Rachel I. Mayberry and Beatrijs Wille
29.1 Introduction The words of spoken languages have been studied for centuries. By contrast, the words of sign languages have been studied for less than one. In the 4th century, signed words, known as signs, were not thought of as words, but rather as the “natural gestures made by deaf people” (Plato, 1892). At the beginning of the 20th century, Long (1910) published one of the earliest collection of signs with photographs and explanations on how to produce them with pantomimic descriptions, such as, “Smell—Move the palm up before the tip of the nose, as if presenting something to be smelled.” (p. 30). Another half century would pass before Stokoe, Casterline, and Croneberg (1976/1965) analyzed the dynamic patterning of signs in an effort to create a writing system for a dictionary of American Sign Language (ASL). Their linguistic analyses provided the first evidence that signs are not like pantomime or pictures because they are assembled from articulatory units. This discovery was groundbreaking because it meant that the architecture of signed words is like that of spoken words and characterized by two levels of patterning, articulatory-perceptual units and meaning. The insight broadened our understanding of words, namely that lexemes are characterized by the same architecture independent of the sensory-motor channel through which they are transmitted and received. Thus, when we ask what lexical representation and access look like in sign languages in relation to spoken languages, we need to frame the comparison in terms of lexical structure. Lexical representation and access are guided by the mapping of dynamic articulatory and perceptual patterning onto linguistic structure (Kroll, Bice, Botezatu, and Zirnstein, this volume; Marslen-Wilson, 1989). For this reason, we begin our exploration of lexical representation and access in sign languages with a brief description of the lexical
Lexical representation and access in sign languages 599 architecture of signs. Doing so allows us to distinguish sign language lexemes from gestures, co-verbal or otherwise, and sketch what lexical representation and access look like in the visual-manual modality.
29.2 Sign sublexical structure Every sign consists of a handshape (and palm orientation), movement, and place of articulation on the body referred to as sign parameters. A key property of sign parameters is that their associated features and phonotactic rules vary across sign languages. Each sign language uses a subset of the universe of possible sign parameters. Take handshape, for example. Handshape refers to the fingers that are used to form the configuration of the hand, that is, which fingers are selected to be extended or closed in a curved or straight fashion, among other features (Brentari, 1998). Some handshapes are common across sign languages, especially those that are easily produced and acquired early by infants (i.e., unmarked handshapes), as such as the closed fist or an open hand with extended fingers (Ann, 1996; Boyes-Braem, 1990; Rozele, 2003). Sign movement can be readily described by noting the joints responsible for the movement. Joints more proximal to the torso, such as the shoulder and elbow, produce larger and more visually sonorous movements than joints more distal from the torso, such as the wrist or knuckles (Perlmutter, 1992). Sign movement can be made with a single joint, as in extending the elbow, or a combination of joint flexions, such as forearm extension with finger closing, for example. Sign handshapes are articulated with varying movements in various locations on the body, primarily the head, torso, and arms, and in neutral signing space in front of the torso. These locations, also referred to as “place of articulation” are modulated by phonotactic constraints arising from the morpho- syntactic context in which the sign participates (Lucas, Bayley, Ross, and Wulf, 2002). Whether a particular parametric feature is contrastive or not varies across sign languages.1
29.2.1 Phonological systems Evidence that the features of sign parameters form a phonological system comes from the phenomenon of minimal lexical pairs. In minimal pairs, the two lexemes differ from one another by one articulatory feature. For example, the ASL sign FATHER is made with an open hand with the thumb contacting the forehead while the same handshape contacting the lower side of the face, on the chin, is the sign MOTHER.2 The ASL sign 1
Palm orientation was not originally considered a sign parameter, but subsequent work has found it to be contrastive in some phonemic contexts in ASL. 2 English glosses for ASL signs are given in all caps, as is the convention in sign language research. Many on-line resources are available for various sign languages. The ASL signs noted here can be searched by their English glosses on www.handspeak.com.
600 Rachel I. Mayberry and Beatrijs Wille RED is made on the chin with a bending index finger, while the sign CUTE is made at the same place with the same movement but with a different handshape, namely the index and middle fingers (Klima and Bellugi, 1979). The analysis of minimal pairs is a useful tool for analyzing the phonological structure of a sign language (Morgan, 2017). Computational analyses suggest that the ratio of minimal pairs to the size of the lexicon may be similar for signed and spoken languages (Kaplan and Morgan, 2017). As is the case for spoken phonology, some combinations of phonological features are permissible in a given sign language but semantically empty, that is, nonce signs, while other combinations are ungrammatical because they violate the phonological structure of the given sign language.
29.2.2 Symmetry and dominance Unlike the articulation of words with one vocal tract, signs could, in principle, be produced with two separate articulators, the two hands and arms. Although a majority of lexemes in ASL are articulated with one hand, the dominant one, many other signs are made with two hands. Early work on ASL discovered that the two hands and arms work together in a rule-governed fashion as a single articulator. Phonological rules specify how the two arms and hands and ten fingers interact to articulate single lexemes. For example, if both hands move, then their movement is symmetrical or alternating and their handshapes are mirror images. If only one hand moves, then the non-dominant one takes the location parameter and its handshape is limited to a small set of alternatives (Battison, 1978). The symmetry and dominance principle of sign articulation has been observed across a number of sign languages, including newly emerged ones (Morgan and Mayberry, 2012). The phenomenon clearly demonstrates how fine motor and perceptual patterning evolve to form the articulatory contrasts that underlie a phonological system (Frishberg, 1975).
29.2.3 Sign segments and prosody The sublexical structure of signs, specifically the parameters of handshape, movement, and location, was originally thought to be produced simultaneously, unlike the sequential articulation of spoken phonemes. However, later work showed that the beginning and ending locations of sign movement are articulated sequentially, one after the other, and importantly form segments that can be contrastive (Liddell, 1984). Just as the acoustic energy of spoken phonemes ebbs and flows with vowels being more sonorous than consonants, so too the parameters of signs vary in sonorous weight, with movement being significantly more sonorous than either the location or handshape parameters. Vowels form the nucleus of spoken syllables (Goldsmith, 2011) and movement forms the nucleus of sign segments (Brentari, 1998; Sandler, 1986). Secondary movements, such as those made with the fingers and wrist co-occur with primary movement segments in
Lexical representation and access in sign languages 601 highly rule-governed ways (Perlmutter, 1992). Importantly, the timing of sign segments appears to be similar to that of spoken syllables and consistent across languages (Berent, Bat- El, Brentari, Dupuis, and Vaknin- Nusbaum, 2016; Wilbur and Nolkn, 1986). Evidence of a timing constraint on the articulation of lexical structure suggests that cognitive and neural factors play a shaping role in the evolution of linguistic structure (Klima and Bellugi, 1979).
29.2.4 Morphology The structure of ASL, and other sign languages studied to date, is characterized by a rich and complex morphological system. Sign morphemes more often appear as meaning units within a lexeme rather than appended to them. Derivational and inflectional morphologies are marked by changes in single or multiple features of sublexical structure. For example, nouns are derived from some verbs in ASL by a doubling movement feature (Supalla and Newport, 1978). Some verbs indicate subject and object by contrasts in beginning or ending locations, palm-orientation features, or both (Padden, 1988). Changes in the handshape feature mark pronoun case, while changes in pronoun movement mark number (Lillo-Martin and Meier, 2011). Signs with related meanings often form clusters of signs, or lexical families, with overlapping sublexical features. For example, the ASL signs, FAMILY, CLASS, GROUP, share the same location and movement but vary in handshape. As another example, the signs DREAM, IMAGINE, WONDER share the same location (Fernald and Napoli, 2001). An example of a highly productive morphological process observed in many sign languages is numeral incorporation. The handshapes of number signs become the handshape of base unit signs, such as those for time, calendric terms, currency, or age among others (Sagara and Zeshan, 2013). However, the way in which number signs are combined into the structure of the base sign is highly rule-governed and varies across sign languages. Not all number signs can be incorporated into all base signs, and not all base signs can take numeral signs. These phonological rules vary as a function of the phonological system of the given sign language. For example, number signs in Russian Sign Language (RSL) are two-handed but are one-handed in ASL. Despite being formed with two hands, number signs in RSL are merged with base unit signs even when the base unit is also made with two hands. The result is highly complex phono- morphological forms (Semushina and Mayberry, 2019).
29.2.5 Signs versus gestures Although signs and gestures are expressed and understood using the same sensory- motor modality, they can be readily distinguished by the presence or absence of sublexical structure. Gestures are mostly free to vary in motoric form, created de novo, and primarily accompany speech (Goldin-Meadow, 2003). By contrast, the structure of
602 Rachel I. Mayberry and Beatrijs Wille signs is unrelated to speech and, crucially, varies in linguistically rule-governed ways. This is so despite the fact that sign languages emerge from gesture when it becomes the primary means of interpersonal communication (Goldin-Meadow, 2005; Goldin- Meadow, Brentari, Coppola, Horton, and Senghas, 2015; Senghas and Coppola, 2001; Senghas, Kita, and Özyürek, 2004).
29.3 The sign lexicon In ASL and British Sign Language (BSL), signs with the phonological structure described above are referred to as the core lexicon (Sutton-Spence and Woll, 1999). The ASL lexicon also consists of additional lexical forms, namely fingerspelling and classifier constructions (Brentari and Padden, 2001). Fingerspelling is a manual representation of orthography, where each letter of the alphabet is represented by a single handshape in ASL or French Sign Language (LSF) and has been documented as a means of educating deaf children as early as the 1500s in Spain (Eriksson, 1998). ASL uses a one-handed alphabet derived from the one-handed alphabet used in old LSF, which played a significant role in the emergence of ASL (Van Cleve and Crouch, 2002). BSL, which is unrelated to ASL, uses a two-handed alphabet reflecting the different historical roots of the two sign languages. The articulatory form of fingerspelled words also differs radically from that of signs. Multiple handshapes are produced sequentially to create a single word, one handshape for each letter of the alphabet, as the word would be written. This structure clearly violates the phonological structure of signs that can take only one to two handshapes. Not all sign languages use fingerspelling, however.3 The frequency with which fingerspelled words are used varies widely across sign languages for reasons that are only partially understood. Unlike fingerspelling, classifier constructions are poly-morphemic and represent semantic categories, such as people or how objects are handled (Supalla, 1986; Zwitserlood, 2012). The lexicons of sign languages, like those of spoken languages, are dynamic with new lexemes appearing regularly. Both fingerspelling and classifier constructions are rich, but not the sole, sources of new additions to the core lexicon of ASL. Compounding, for example, is another means of new sign creation. Fingerspelled loan words are core signs that began as fingerspelled words and have undergone a lexicalization process in ASL, such as JOB or DOG. Their original fingerspelled forms became submerged as articulatory constraints pressured manual motor patterns to conform to the canonical phonological structure of ASL signs (Battison, 1978). Frequently used classifier constructions can also undergo a lexicalization process; their morphemes become ‘frozen’ or ‘fossilized’ as they become lexicalized and mono-morphemic, such as the ASL signs
3
For example, Kata Kolok, a village sign language of Bali.
Lexical representation and access in sign languages 603 SUITCASE or IRON (Hwang, Tomita, Morgan, Ergin, Ilkbasaran, Seegers, Lepic, and Padden, 2016; McDonald, 1982).
29.4 The psychological reality of sign structure Most of what we know about lexical representation and access comes from studies investigating signs from the core lexicon. Early research showed that phonological structure is an integral part of lexical representation and access in ASL. Signers make mistakes during spontaneous production known as “slips of the hand” which are re- arrangements of the articulatory features of adjacent signs (Klima and Bellugi, 1979). Signers exhibit a “tip of the finger” phenomenon when retrieving part of a sign’s phonological structure in near recall (Thompson, Emmorey, and Gollan, 2005). Signers make lexical errors in working memory based in phonological structure rather than meaning showing that phonological structure mediates working memory for signed words, just as it does for spoken words. This discovery was made when signers were asked to write in another language, English, the signs they remembered from a list of videotaped ASL signs (Bellugi, Klima, and Siple, 1975). The organizational function of the phonological structure of ASL signs has been observed during direct cortical stimulation of the brain language areas in an epileptic signer, in the paraphrastic errors of aphasic signers, and the expressive errors of signers with Parkinson’s disease (Brentari, Poizner, and Kegl, 1995; Corina, McBurney, Dodrill, Hinshaw, Brinkley, and Ojemann, 1999; Poizner, Klima, and Bellugi, 1990). These phenomena provide evidence that the phonological structure of signs mediates visual-motor patterning and lexical meaning in the mental lexicon. Psycholinguistic evidence indicating that the architecture of sign structure is characterized by two separable levels of structure is further provided by studies of children’s early lexical development and studies of lexical processing in adults.
29.5 Sublexical structure and lexical acquisition 29.5.1 Phonological perception and development Infants can discriminate prosodic structure in Japanese Sign Language (JSL) (Masataka, 1992). Child-directed JSL is characterized by slower and exaggerated modulation of sign movement, akin to the exaggerated pitch modulations of child-directed speech, corroborating the hypothesis that the movement parameter carries the sonorous weight
604 Rachel I. Mayberry and Beatrijs Wille of sign segments. Infants show a strong preference for child-directed as compared with adult-directed JSL before they can express any words (Masataka, 1996). ASL- exposed infants can discriminate phonological structure in ASL signs (Baker, Idsardi, Golinkoff, and Petitto, 2005) and babble manually with features of sign phonological structure (Petitto, Holowka, Sergio, Levy, and Ostry, 2004; Petitto and Marentette, 1991). Moreover, the phenomenon of perceptual attunement observed in hearing infants, that is, narrowing of the ability to discriminate the articulatory features of all languages over the first year of life to just those features of the ambient language (Kuhl, Williams, Lacerda, Stevens, and Lindblom, 1992; Swingley, this volume; Werker and Tees, 1984), has also been observed in sign language. For example, hearing infants show the ability to discriminate handshape at 4 months of age, but this ability disappears by 14 months because they have not been exposed to sign language. By contrast, ASL-exposed infants are able to discriminate handshape at 14 months of age (Palmer, Fais, Golinkoff, and Werker, 2012). Although they show variability in the onset, rate, and size of vocabulary acquisition as a function of age, infants exposed to ASL or BSL follow a trajectory of vocabulary development comparable in timing and sequence to that of observed for English-speaking children (Anderson and Reilly, 2002; Fenson, Dale, Reznick, Bates, Thal, Pethick, Tomasello, Mervis, and Stiles, 1994; Woolfe, Herman, Roy, and Woll, 2010). Just like children acquiring many spoken languages, young signers across sign languages initially learn the signs for objects and people. Later, they acquire signs for descriptors and actions. The ratio of nouns to verbs in young signing children’s vocabulary development decreases as they acquire grammatical morphology (Anderson and Reilly, 2002; Hoiting, 2006; Rinaldi, Caselli, DiRenzo, Gulli, and Volterra, 2014). Young children’s early sign productions are characterized by phonological errors explained in part by their developing motor control (Meier, Mauk, Mirus, and Conlin, 1998). Beginning with the torso, arms, and whole hand, and continuing to the fingers, early motor control of the torso and arms may explain why young children appear to articulate the location parameter more accurately early in language development relative to their later developing accurate articulation of sign movement and handshape (Boyes- Braem, 1990; Cheek, Cormier, Rathmann, Repp, and Meier, 1989; Conlin, Mirus, Mauk, and Meier, 2000; Marentette and Mayberry, 2000; Morgan, Barrett-Jones, and Stoneham, 2007). This shows that children’s development of sign vocabulary interacts with their phonological development in addition to their conceptual and morpho- syntactic development (Anderson and Reilly, 1997; Devescovi, Caselli, Marchione, Pasqualetti, Reilly, and Bates, 2005; Reilly and Bellugi, 1996; Reilly, McIntire, and Bellugi, 1990; Schick, De Villiers, De Villiers, and Hoffmeister, 2007). Phonological development in sign language, measured by how accurately children can imitate pseudo signs of varying phonological complexity, correlates with their language skills in BSL (Mann, Marshall, Mason, and Morgan, 2010). The semantic organization of school- aged children’s vocabulary in BSL is similar to that of their hearing peers’ semantic organization of spoken English vocabulary (Marshall, Rowley, Mason, Herman, and Morgan, 2013).
Lexical representation and access in sign languages 605
29.5.2 Modality effects on lexical representation and access Despite the numerous similarities between the lexical architecture and acquisition of signed and spoken words, two sources of possible differences have been proposed. One potential difference arising from the visual-manual channel is the hypothesis that iconicity facilitates the access of signs in adults and vocabulary learning in children. The other potential source of difference between the lexical representation and access of signed versus spoken words arises from the child’s insufficient access to language, in either a visual or auditory form, during early life.
29.5.2.1 Iconicity and lexical access Iconicity refers to a motivated relation between form and meaning and is often defined in opposition to an arbitrary relationship. In semiotic theory, the relation between form and meaning is characterized as being motivated for icons but arbitrary for symbols (Peirce, 1955). Under this definition, the relation between form and meaning is arbitrary for words because they are symbols and has been a linguistic tenet of linguistics since the early 1900s (Saussure, 1916/1966). Another early observation, however, was that people tend to associate nonce words containing high vowels with pictures of spikey things and low vowels with pictures of roundish things (Sapir, 1929). This demonstrates that the sound patterns of words can evoke specific and similar meanings across individuals. The phenomenon has been replicated often in adults and even in children as young as two and a half (Maurer, Pathman, and Mondloch, 2006; Westbury, 2005) and has been proposed to be a possible mechanism of language evolution (Ramachandran and Hubbard, 2001), by providing a link between perceptual experience and sound patterning. Contrary to the long-standing tradition that the forms of words and their meanings must be arbitrary, there are numerous examples of spoken words where this proposal does not hold. In onomatopoeia, the word-form is related to meaning, albeit within the phonological structure of the given language. For example, the forms of English words used by adults with children for the sounds animals make, such as meow, moo, woof- woof, or cock-a-doodle-doo might be perceived as being similar to the animal sound itself. Many spoken languages in Africa, South India, Australia, as well as Korean, Vietnamese, and Japanese, among others, have large inventories of sound symbolic words, or ideophones, where the phonological form of the word iconically represents perceptual aspects of events (Nuckolls, 1999). These examples show that iconicity can and does characterize the form-meaning relationship of some spoken words. Words in the visual-manual modality have been described as being more iconic than those in the auditory-vocal modality. For example, the ASL sign EAT is made at the mouth with a hand with closed fingers. However, as Klima and Bellugi (1979) argued in detail, iconicity is not an all or none phenomenon. Instead, there are gradations of iconicity depending upon how it is defined. For instance, can the sign meaning be
606 Rachel I. Mayberry and Beatrijs Wille guessed accurately by looking at the unknown sign without context? This is what Klima and Bellugi called “transparency” and constitutes a low proportion of ASL vocabulary. Another form of iconicity is the perception of a motivated relation between the phonological form of a sign and its meaning, but on a post-hoc basis only after both the form and meaning of the sign are already known, what Klima and Bellugi called “translucency.” Experimentally controlling for the multitude of factors well known to influence lexical representation and access while investigating the possible effects of iconicity is extremely tricky business because the metrics for lexical frequency, age of lexical acquisition, concreteness and abstractness, phonological complexity, neighborhood density, lexical category and type, and so forth, are not readily available for sign languages. Given this state of affairs, it is unsurprising that studies asking whether iconicity affects lexical representation and access in sign languages have produced contradictory results. In Lingua Italiana dei Segni (LIS), phonological similarity between signs facilitated picture naming as did iconic signs, when iconicity was operationalized by subjective judgments by hearing speakers of Italian (Navarrete, Peressotti, Lerose, and Miozzo, 2017). In BSL, sign iconicity facilitated response time on picture-sign matching and phonological decision tasks, along with production time in picture naming, but primarily for later acquired signs (Vinson, Thompson, Skinner, and Vigliocco, 2015). However, when adult signers of ASL performed a lexical decision task, sign meaning rather than iconicity facilitated recognition time (Bosworth and Emmorey, 2010). Iconicity does not show effects on the neural processing of signs (Emmorey, Grabowski, McCullough, Damasio, Ponto, Hichwa, and Bellugi, 2004). The disparate reported effects of iconicity on lexical access and production across sign languages are likely due to the fact that iconicity is not an objective feature of signs, like phonological complexity or frequency which are enumerable, but rather a perceived relationship between form and meaning. And perception of lexical form and meaning is product of prior experience, especially linguistic experience. This is illustrated by cross- linguistic comparisons of subjective iconicity ratings. For example, deaf signers rated signs from their own language, Deutsche Gebärdensprache (DGS) as being significantly more iconic compared with those from a foreign sign language they did not know, ASL. Likewise, deaf ASL signers rated signs from their own language as being significantly more iconic than those from DGS, a sign language foreign to them (Occhino, Anible, Wilkinson, and Morford, 2017). Studies asking if iconicity affects children’s vocabulary development also yield somewhat contradictory results that can be explained by Klima and Bellugi’s iconicity continuum and linguistic experience. A longitudinal study of deaf children’s ASL vocabulary development uncovered no effects of iconicity (Orlansky and Bonvillian, 1985). By contrast, the subjective iconicity ratings of adult deaf signers correlated with sign vocabulary development in BSL learning children, but primarily when they were older, not younger (Thompson, Vinson, Woll, and Vigliocco, 2012). Here, sign vocabulary was measured with the BSL version of the MacArthur Bates Communicative Development Inventory, CDI (Woolfe et al., 2010). The fact that children with greater language and cognitive abilities show effects of sign “translucency” whereas younger children do not,
Lexical representation and access in sign languages 607 demonstrates that iconicity effects on the lexical learning of signs arise from linguistic experience and a developing meta-linguistic awareness of the dual structure of signs, form vs meaning. Consistent with hearing children’s acquisition of spoken words, a larger study of children using the ASL version of the CDI (Anderson and Reilly, 2002) found that neighborhood density and lexical frequency were also important factors in children’s acquisition of signs (Caselli and Pyers, 2017). Young signers of Turkish Sign Language, TÏD, have been observed to use more verbal than nominal classifier constructions than adult signers, which has been interpreted as showing the effects of iconicity on TÏD vocabulary learning (Ortega, Sümer, and Özyürek, 2017). However, a study of ASL development of similarly aged children suggests that this pattern may arise from the later acquisition of nominal versus verbal morphology (Brentari, Coppola, Jung, and Goldin-Meadow, 2013). Thus, iconicity, or translucency, is evident in the lexicon of sign languages but is not a fixed characteristic of signs, but rather a perceived relationship between form and meaning that that is heavily grounded in the linguistic experience of the perceiver.
29.5.2.2 Visual learning and language input Children with normal hearing learning a spoken language do not have to look at the speaker to hear what they are saying. By contrast, young children learning a sign language must look at signer or risk missing the message entirely. Thus, hearing children have no need to control visual attention in order to learn spoken words, but deaf children must do so in order to learn signed words. Very young children whose parents sign to them from birth solve the visual engineering problem of using one sensory modality to perceive both language and the visual scene being talked about with the help of their parents. Signing parents have been observed to guide their child’s visual attention alternatively between the visual scene and the signed utterance to create joint attention (Swisher, 1992; Wille, Van Lierde, and Van Herreweghe, 2019), a central phenomenon of early lexical development (Tomasello and Farrar, 1986). In ASL, for example, 12-to 24-month-old deaf children shift their gaze as often as 13 times/minute between their mother’s signing and the visual scene she is talking about. Gaze shifting doubles to 22 times/minute when these sign-exposed children are 42 months of age (Lieberman, Hatrak, and Mayberry, 2014). This early parental scaffolding of joint attention and language within the visual modality produces precocious gaze following in deaf signing infants between the ages of 8 to 20 months compared with age-matched, non-signing hearing infants (Brooks, Singleton, and Meltzoff, 2020). Between the ages of 26 to 53 months, deaf and hearing signing children require decreasingly less phonological information to recognize signs. As their ASL vocabulary grows, they more quickly shift their gaze away from the sign to the target picture it represents, even before it is fully articulated, demonstrating a developing predictive ability to perceive and comprehend signed words with only partial phonological information (MacDonald, LaMarr, Corina, Marchman, and Fernald, 2018). In addition to perceiving both the visual environment and visual language through a single modality, another source of difference between spoken and signed lexical
608 Rachel I. Mayberry and Beatrijs Wille acquisition is how sensory and motor modalities are employed for the purposes of lexical production and perception. Speakers hear themselves as they speak, meaning that they use the auditory modality both to perceive spoken language and to monitor their production of it, although sensorimotor feedback is clearly involved as well. By contrast, signers use vision to perceive lexical items but not to monitor their production of them, relying instead on sensorimotor and proprioceptive feedback. We know this because signers do not look at their hands when signing, but rather visually focus on the hands and face of the sign interlocutor. This explains why reduced visual feedback marginally affects the signer’s articulation but does not render it unintelligible, both experimentally and in the case of Usher’s Syndrome where the visual field of congenitally deaf individuals shrinks to a pinpoint (Emmorey, Bosworth, and Kraljic, 2009). Whether this sensory-motor split between how signs are perceived in comprehension and monitored during production affects lexical access and its development is currently unknown. The research available to date suggests that the peripheral, sensory-motor channel through which words are expressed and comprehended alters neither the basic architecture of lexical representation nor the processes of lexical access and acquisition. There is, however, another important factor that must be taken into account. The inaccessibility of language from birth can radically reduce the quantity and quality of infants’ language development. Most deaf children are born into families who neither know nor use any sign language (Mitchell and Karchmer, 2004). The linguistic environment of deaf infants can be rich and complex (Anderson and Reilly, 2002), significantly reduced (Lu, Jones, and Morgan, 2016), or grossly impoverished (Cheng and Mayberry, 2019; Ferjan Ramirez, Lieberman, and Mayberry, 2013). Children born deaf can acquire sign language as a native first language from their family, as a near-native language from their deaf peers, as a non- native late L2 after having acquired a significant amount of another sign language or spoken language in infancy, or as a very late L1 after having experienced little language in early childhood (Boudreault and Mayberry, 2006; Mayberry and Kluender, 2018; Mayberry, Lock, and Kazmi, 2002). Studies of children’s sign language acquisition, like those discussed above, have primarily focused on native deaf learners because their early language environment most closely resembles that of hearing children in speaking families. Studies of lexical processing in adults, by contrast, often investigate the lexical processing of signs in signers who have had varying childhood language experiences.
29.6 Psycholinguistic studies of lexical representation and access 29.6.1 Control factors The experimental paradigms used to study lexical representation and access in sign language are largely similar to those used to study lexical processing in spoken language.
Lexical representation and access in sign languages 609 Adapting these paradigms to sign language is not always straightforward, however. Some studies lack crucial information about the linguistic characteristics of the sign stimuli or the language background of the participants. Currently, large databases of signs do not exist, so that the factor of lexical frequency is difficult to control. Limited data for ASL and BSL do exist, however (Caselli, Sehyr, Cohen-Goldberg, and Emmorey, 2017). Some lexical controls come from subjective ratings (Vinson, Cormier, Denmark, Schembri, and Vigliocco, 2008), which have been found to be comparable to objective frequency ratings from large lexical databases in spoken languages (Mayberry, Hall, and Zvaigzne, 2014). Another important control factor to note in comparing studies of lexical access in sign versus speech is that most studies of the latter use written rather than spoken lexical stimuli, which clearly engages the visual system. The extent to which lexical access in speech is altered when words are experimentally presented in a written form is an important topic beyond the scope of the present chapter. Early studies of lexical access in sign language asked whether phonological structure mediates sign access and, if so, whether the sublexical parameters of signs contribute equally to the process.
29.6.2 Parameter recognition and identification studies To the naked eye, the production of signs appears to be simultaneous, that is, the phonological structure of signs looks like it is perceptually available in one fell swoop. However, linguistic analyses discovered that sign articulation actually unfolds over time (Liddell and Johnson, 1989; Sandler, 1989). Psycholinguistic studies support this analysis. In gating studies, signers are asked to guess the identity of a stimulus sign presented in brief snippets containing incomplete phonetic information with the duration of the stimulus sign increasing over successive experimental trials. Deaf native signers recognize the location and handshape parameters of ASL target signs simultaneously and sign movement last (Clark and Grosjean, 1982; Grosjean, 1981; Morford and Carlson, 2011). This finding holds when the target signs are extracted from ASL sentence contexts, but not when the target signs are filmed in a list-like fashion of individual signs, each sign beginning from and ending in the lap. In this case, native deaf signers recognize the location parameter before the handshape parameter with movement being recognized last (Emmorey and Corina, 1990). This is likely due to the fact that the location parameter of adjacent signs is heavily influenced by the phonotactics of the preceding and following lexical context in contrast to when there is no linguistic context at all. Sign location is highly constrained by phonotactic rules, such as metathesis, while handshape is more auto-segmental in nature (Sandler, 1986). Supporting this interpretation is the finding that the location parameter is radically compressed in “whispered” sign where signs are produced in an altered frame of body reference (Wilbur and Schick, 1987). The developmental timing of language experience appears to alter the relative weight, or attentional focus, given to sign parameters during lexical access. On a parameter monitoring task, native deaf and hearing L2 signers recognized location faster than handshape. By contrast, non-native deaf signers, who experienced reduced language in
610 Rachel I. Mayberry and Beatrijs Wille childhood, patterned differently. They recognized handshape faster than location. They also tended to give gestures in response to a target sign with incomplete phonetic information rather than an actual sign from the ASL lexicon (Morford and Carlson, 2011). This remarkable finding indicates that language experience postponed until after early childhood significantly alters the organization of lexical representation. This finding has been corroborated with an eye-tracking study. Native deaf signers used the phonological structure of signs to recognize sign meaning, whereas non-native deaf signers did not (Lieberman, Borovsky, Hatrak, and Mayberry, 2015). Clearly, some underlying aspect of phonological structure is organized via infant language experience and learning that, importantly, transcends the sensory-motor modality of the original linguistic signal (Petitto, Langdon, Stone, Andriola, Kartheiser, and Cochran, 2016).
29.6.3 Parameter similarity judgment studies Similarity judgments have been used to decipher the relative weight signers allocate to the parameters of sign structure during lexical access. When asked to select the two most similar-looking items from four simultaneously presented ASL nonce signs, native deaf signers judged those sharing movement to be more alike than those sharing either handshape or location. Non-native deaf signers weighted sign parameters differently, judging sign pairs sharing handshape as being more alike than signs sharing either movement or location (Hildebrandt and Corina, 2002). Delving more deeply into the movement parameter, native deaf signers discriminated the movement parameter of ASL signs with clusters of features that were orthogonal in spatial dimension to the movement discrimination features used by non-native deaf signers (Poizner, 1981).4 Across studies, a consistent and notable finding is that native deaf signers pattern with hearing L2 signers in sign phonological similarity judgments, whereas non-native deaf signers show unique patterns. In addition to giving prominence to the handshape of signs during lexical access, non-native deaf signers also tend to focus more on the semantic than phonological characteristics of lexical items compared to either native deaf or hearing L2 signers (Hall, Ferreira, and Mayberry, 2012). Comprehension of ASL narratives has been found to improve when non-native L2 learners of ASL are instructed to focus on the face, which allows peripheral visual processes to do more of the perceptual work in lexical access (that is, movement perception), suggesting that knowing exactly where to visually focus on the sign signal for optimal perceptual processing and lexical access may be learned in early life but amenable to instruction at older ages (Bosworth, Stone, and Hwang, 2020). The most straightforward explanation of these findings is that infant language experience has a greater impact on the development of lexical representation than the
4 Filming
signs in the dark with point-lights on the joints captures sign movement while masking handshape and location.
Lexical representation and access in sign languages 611 sensory-motor modality within which lexemes are sent and received. An equally apparent implication of these findings is that lexical representation, at least at the level of phonology, is supramodal in nature. Models of sign recognition invoking spreading activation along a network of sign phonological features are consistent with this assumption (Caselli and Cohen-Goldberg, 2014). These findings also indicate that the duality of patterning characteristic of lexical architecture, while common across modalities, arises from infant language experience and is not robust to a paucity of language experience during early life. This generalization is supported by perceptual studies of signs.
29.6.4 Categorical perception studies Some research has asked whether the phenomenon of categorical perception governs phonemic boundaries in a visual-manual language, as it does for spoken language. Studies find that ASL native deaf signers show more category-like discrimination compared with non-native deaf, and hearing L2 signers. As might be predicted, non- native signers showed over-discrimination within the end-points of a location discrimination task (chin vs. chest), or within the category boundaries for a handshape discrimination task, namely finger extension (Morford, Grieve-Smith, MacFarlane, Staley, and Waters, 2008). Likewise, on a handshape feature discrimination task, closed vs. open fingers [U]vs. [V], non-native deaf signers showed better within boundary discrimination compared with native deaf signers and hearing L2 signers (Best, Mathur, Miranda, and Lillo-Martin, 2010). These findings suggest that sign exposure produces a more category- like organization of phonemic representation when language is experienced in infancy as compared to when it is first experienced later in life and thus suggests one possible mechanism for the atypical patterns of language processing shown by late L1 learners.
29.6.5 Lexical decision and identification studies Given the repeated finding that lexical representation in sign is organized by sublexical, or phonological, structure, the results of lexical decision and identification studies would be expected to find corroborating evidence, but this is not entirely the case. For example, in a lexical decision study of BSL, native deaf signers were faster to recognize signs than non-native deaf signers. However, the phonological priming results were ambiguous. Native deaf signers were somewhat faster to recognize signs that shared movement and location, but not when the prime was a nonce sign. By contrast, non-native deaf signers were primed by shared movement regardless of the lexical status of either the prime or target (Dye and Shih, 2004). These findings are consistent with the above-described findings showing non-native deaf signers to have weaker lexical boundaries and show greater discrimination within phonemic boundaries relative to native deaf signers (Best et al., 2010; Morford and Carlson, 2011; Morford et al., 2008).
612 Rachel I. Mayberry and Beatrijs Wille In an experimental paradigm called “sign spotting,” BSL signers were asked to identify and reproduce real signs in the context of a preceding stimulus sign that was either a sign or a nonce sign, the latter being either a possible or an impossible configuration of BSL sign parameters.5 Lexical recognition was faster in the context of sign phonology showing, again, that it mediates lexical access in BSL. Native and non-native deaf signers patterned differently with respect to their reproduction errors, however. BSL signers with the oldest ages of acquisition (> 12 years) tended to make more errors on handshape relative to those who learned BSL at younger ages consistent with the finding that non- native signers give more perceptual weight to the handshape parameter (Orfanidou, Adam, McQueen, and Morgan, 2009).6 As explained above, research into the nature of lexical representation and access in sign languages is hampered by the lack of statistical characterizations of any given sign language’s lexicon. Native deaf signers recognized signs faster than the non-native deaf signers in two experiments that crossed the factors of subjective familiarity ratings of a sign (as a lexical frequency measure) and neighborhood density (by counting the number of signs in dictionaries of Spanish Sign Language (LSE) that shared a given handshape or location). The performance of native and non-native signers diverged as a function of these lexical factors, also consistent with the lexical processing results from studies of other sign languages (Carreiras, Gutiérrez-Sigut, Baquero, and Corina, 2008).
29.6.6 Picture naming and picture-sign matching studies Both picture naming and picture-sign matching have been used to investigate the degree to which lexical access in sign language is mediated by phonology. When asked to sign the name of pictures overlaid with static drawings of signs that were related semantically, phonologically, or unrelated to the name of the picture in Catalan Sign Language (LSC) semantically related distractors slowed naming time for all signers, native and non-native alike. Distractors sharing location tended to also slow naming time (Baus, Gutiérrez-Sigut, Quer, and Carreiras, 2008). Neuroimaging studies using a picture- sign matching priming paradigm have discovered that native deaf and hearing L2 signers show neural activation patterns for lexical processing that are remarkably similar to those of native English speakers listening to words (Leonard, Ferjan Ramirez, Torres, Hatrak, Mayberry, and Halgren, 2013). By contrast, individuals who had sparse language experience throughout childhood showed highly atypical neural activation patterns for signs, consistent with the numerous psycholinguistic findings described above of altered lexical 5
The authors report creating the non-sign stimuli by changing “multiple” parameters, rather than a single parameter. 6 Unlike other studies where a native signer is defined as someone having had sign language exposure beginning from birth, this study defined native signers as having an age of sign exposure ranging from birth to age 5.
Lexical representation and access in sign languages 613 representations when language is not experienced in early life (Ferjan Ramirez, Leonard, Torres, Hatrak, Halgren, and Mayberry, 2014; Mayberry, Davenport, Roth, and Halgren, 2018).
29.7 Bilingual lexical representation and access Finally, confirmatory evidence that the phonological structure of signed words organizes lexical representation and mediates lexical access in the mental lexicon comes from studies of lexical processing in individuals who are bilingual in a signed and written language. Proficient ASL signers who read at or above the 8th grade level recognized printed English word pairs more quickly when the ASL translations of the words were semantically related in ASL. However, the presence of a phonological relationship between the ASL translations, that is, minimal phonological pairs, slowed recognition time for the English words (Kroll et al., this volume; Morford, Wilkinson, Villwock, Pinar, and Kroll, 2011). Clearly, the lexical representation of ASL signs mediates not only lexical access in a signed language, but across the mental lexicon even for multiple languages in one mind. Another study replicated these findings with proficient signers of DGS when reading pairs of printed German words, and demonstrates the universality of the phenomenon (Kubus, Villwock, Morford, and Rathmann, 2014). Similar results have been obtained with hearing ASL-English bilinguals in an eye-tracking study. When engaged in on-line comprehension of spoken English words, these bilinguals activated lexical representations in ASL (Shook and Marian, 2012). In summary, enormous strides have been made during the recent past in describing and understanding the lexical representation and access of signed words. This line of research has clarified the role of sensory-motor modality in these fundamental cognitive processes. Linguistic and psycholinguistic studies of lexical representation and access across sign languages have amply demonstrated that duality of patterning is a hallmark of lexical structure. These studies also indicate that the architecture of lexical representation is a fundamental property of words and unaltered by the sensory and motor channel through which they are sent and received. The dual architecture of lexical representation, phonological and semantic structure, mediates both the comprehension and production of signed words. In addition to deepening our understanding of lexical representation and access vis à vis sensory-motor modality, recent research also shows that the organization of the mental lexicon arises from infant language development, independent of its sensory and motor form. Clearly much more research is required to more fully understand lexical representation and access in sign languages. The future looks promising, however. As we accrue increasingly sophisticated linguistic understanding of the structure of sign languages and as increasingly sophisticated experimental paradigms and tools become available for sign language research, correspondingly
614 Rachel I. Mayberry and Beatrijs Wille sophisticated advances in our understanding of the nature of the mental lexicon should follow.
Acknowledgments The preparation of this chapter was supported by NIH grant DC012797 to R. Mayberry and by a postdoctoral fellowship at the University of California San Diego funded by the Belgian American Educational Foundation, BAEF, to B. Wille. We thank Marla Hatrak and an anonymous reviewer for helpful comments on previous versions of this chapter.
Chapter 30
Dis orders of L e x i c a l Ac cess And Produ c t i on Daniel Mirman and Erica L. Middleton
30.1 Introduction The hallmark of lexical access deficits is that lexical representations are intact but access to those representations is ineffective, inefficient, or inconsistent. The strongest form is observed in individuals with aphasia following left hemisphere stroke, though some degree of lexical access difficulty can occur in a wide variety of neurogenic disorders (traumatic brain injury, brain tumor resection, epilepsy), developmental language disorders (e.g., Nation, 2014; Messer and Dockrell, 2006), and typical aging (e.g., Barresi, Nicholaas, Connor, Obler, and Marting, 2000; Burke and Shafto, 2011). In this chapter we focus on the aphasic presentation of lexical access deficits because it is the most distinctive, well-studied, and clinically relevant. Inconsistent lexical access is most evident when a stimulus is tested multiple times. Across a variety of word comprehension and production tasks, participants with access deficits will sometimes respond correctly and sometimes incorrectly. In addition to this overall inconsistent access, there are several factors that reliably improve performance of individuals with access deficits by facilitating lexical access. These include presentation rate (better performance when there are longer gaps between trials), cueing (better performance when given a phonemic cue), and number and strength of competitors (better performance in the presence of few and unrelated distractors than in the presence of many and semantically related distractors). All of these phenomena stand in stark contrast to “storage deficits” in which the lexical-semantic store itself is impaired. For example, individuals with semantic dementia (SD; also known as the semantic variant of primary progressive aphasia, svPPA) exhibit highly consistent performance for a given stimulus, with little to no influence of presentation rate, cueing, or number and strength of competitors (for a comprehensive review of how access deficits differ from storage deficits see Mirman and Britt, 2014). In the domain of storage deficits, there has been substantial convergence between behavioral studies, neuroimaging studies, and computational implementations of
616 Daniel Mirman and Erica L. Middleton theories, leading to elegant and comprehensive neurocomputational models of how lexical-semantic representations are stored (e.g., Chen, Lambon Ralph, and Rogers, 2017; Lambon Ralph, Jefferies, Patterson, and Rogers, 2017). In contrast, lexical “access” can refer to any combination of activation, retrieval, and selection processes, which implicates three broad sets of cognitive processes: processes specific to the language system, cognitive control processes, and memory processes. Since lexical representations are inherently language-specific, lexical activation is generally considered a language- specific process. Cognitive control processes include inhibition of lexical competitors and selection among lexical competitors, and there is some disagreement about whether these are domain-general cognitive control or language-specific control processes (see below). In the context of access deficits, the key memory process is retrieval and it is particularly important for word production where retrieval failure produces anomia—one of the most common complaints and treatment targets in aphasia. In access deficits, the representation itself is thought to be intact, but individuals seem to have impaired retrieval or activation of the correct representation and/or impaired management of the co-activation of various related representations. A key challenge is that theories and models of language, cognitive control, and memory have developed largely independently, which makes it difficult to explain phenomena that occur at the intersection of these systems. With that in mind, the next two sections discuss these intersections, how lexical access disorders shed light on them, and how theories or models can influence treatment efforts. Section 30.2 considers how the cognitive control related to lexical competition—inhibition of competitors and selection among competitors—may explain some lexical access deficit phenomena in word comprehension and production, as well as some key gaps in these accounts (for discussion of competition in the context of other lexical access processes, for word comprehension see Magnuson and Crinnion, this volume, and Rodd, this volume; and for word production processes see Kilbourn-Ceron and Goldrick, this volume, and Nozari, this volume). Section 30.3 considers learning and retrieval processes in lexical access, focusing primarily on word production for two reasons: (1) this has been the primary context of basic research on the link between retrieval and lexical access deficits; (2) properties of learning and retrieval have particular relevance to aphasia treatment, which more frequently focuses on production than comprehension, and provides a unique opportunity for both basic research and translational application.
30.2 Cognitive control and lexical access 30.2.1 Inhibition, competition, and selection Parallel activation is a core aspect of lexical processing: related or similar lexical representations are co-activated during processing. In single word recognition tasks,
Disorders of Lexical Access And Production 617 phonological and orthographic similarity are the primary forces driving activation; in word production tasks, semantic or visual similarity is the primary driving force. As activation cascades through the system, these (and other) levels of processing interact to determine which lexical representations will be active and the time course of that activity (e.g., Chen and Mirman, 2015). This co-activation is generally thought to be resolved through some form of competition or inhibition among the activated representations. That is, the input activates a set of lexical candidates that then compete for recognition or production. This broad perspective is consistent with the long-standing interactive activation and competition (IAC) framework (e.g., McClelland and Rumelhart, 1981; McClelland, Mirman, Bolger, and Khaitan, 2014) and the large body of computational modeling work built on that framework. The IAC framework and related specific computational models can account for a wide variety of phenomena related to lexical access, including effects of lexical neighborhood density (e.g., Chen and Mirman, 2012) and the precise time course of lexical competition during spoken word recognition (e.g., Allopenna, Magnuson, and Tanenhaus, 1998). The notion of abnormal inhibition, competition, or selection dynamics offers an intuitively compelling account of such access deficit phenomena as effects of presentation rate and number of distractors. There have been some computational implementations that have examined effects of presented distractors (e.g., Mirman, Yee, Blumstein, and Magnuson, 2011) as well as of activation and inhibition dynamics more generally (e.g., Mirman, Britt, and Chen, 2013). Nozari (2020) reported a comparative case study showing a double dissociation between impaired semantically driven lexical activation (patient XR) and impaired inhibition of lexical competitors (patient QD) in word production. However, a satisfactory account of access deficits based on the IAC framework remains elusive. First, lexical competition and inhibition are typically framed (explicitly or implicitly) as occurring between localist representations. That is, the (sometimes unstated) assumption is that there is some discrete representation of lexical item A and another, categorically distinct representation of related lexical item B. Given an input that is partially consistent with both, lexical items A and B are both activated and compete or inhibit one another. This kind of discrete localist representation is a useful abstraction that is sufficient for investigating a wide range of phenomena, but it is not biologically plausible. Neural representations are generally distributed across functional systems, graded, and context-sensitive. Co-activation and competition are emergent properties of distributed representations: activation is initially partially consistent with multiple representations (co-activation); as the activation pattern approaches one specific representation, it necessarily moves away from other patterns/representations, which can be considered a form of competition (e.g., Gaskell and Marslen-Wilson, 1997). Damage to distributed lexical-semantic representations tends to produce storage-type deficits (e.g., Rogers, Lambon Ralph, Garrard et al., 2004). Gotts and Plaut (2002) showed that disruption to how activation and inhibition build up (input gain and synaptic depression) can produce some lexical access deficit phenomena, particularly phenomena that have to do with patterns over multiple trials (e.g., effects of presentation rate). However, their
618 Daniel Mirman and Erica L. Middleton model does not account for effects that occur within individual trials, such as sensitivity to strength and number of competitors (for a detailed review of models of lexical access deficits, see Mirman and Britt, 2014). Therefore, the nature of impaired inhibition, competition, and selection in the context of distributed lexical-semantic representations remains an open question. Second, the standard IAC approach features inhibition within each level of processing, suggesting distinct level-specific (or domain-specific) inhibition systems rather than a more general top-down inhibitory control system. This is not a fundamental assumption of the IAC framework, it just happens to be the typical implementation. Based on the existing evidence, it is not clear to what extent lexical inhibition is related to domain-general inhibitory control, which is part of executive function. If more general inhibitory control is involved, then, at a minimum, access deficits should correlate across levels of language processing. Consistent with this claim, there are reports of individuals with access deficits that span lexical-semantic comprehension and production tasks (Warrington and Cipolotti, 1996; Jefferies and Lambon Ralph, 2006), and extend to syntactic (Novick, Trueswell, and Thompson-Schill, 2005) and non-language inhibitory control tasks (Novick, Kan, Trueswell, and Thompson-Schill, 2009). A role for domain-general inhibitory control is also reflected in an association between resolving lexical-semantic competition during word recognition (as indexed by size of neighborhood density effect) and fluency of narrative language production (Botezatu and Mirman, 2019). There is even evidence that higher levels of anxiety are associated with difficulty resolving lexical-semantic competition, as predicted by a computational model with a domain-general mechanism for resolving competition (Snyder, Hutchison, Nyhus et al., 2010). Lexical access deficits appear to correlate with deficits in non-linguistic executive function tasks such as Raven’s Matrices and the Wisconsin Card Sorting Task (Baldo, Dronkers, Wilkins et al., 2005; Jefferies and Lambon Ralph, 2006; Corbett, Jefferies, and Lambon Ralph, 2009a; Corbett, Jefferies, Ehsan, and Lambon Ralph, 2009b). A domain-general disruption of inhibitory control could account for the association between lexical access deficits and (non-linguistic) executive function deficits1. The degree to which people with aphasia exhibit domain- general inhibitory control deficits, and the possible role of these deficits in rehabilitation, are active areas of research. However, there is also evidence of a distinct (or partially distinct) neural basis for inhibitory control across tasks and lack of cross-task transfer of control (see discussion in Nozari, 2020). Further, the positive evidence of an association between lexical access and domain-general cognitive control must be interpreted with caution. First, the counter-evidence may be largely invisible: it is unknown how many studies found no association between lexical access and domain-general cognitive control because such non-associations tend not to be published. This comes about especially when, as is typical for patient-based research, the sample is small and the lack of association could 1 Though
it is also important to consider the opposite causal direction: that lexical access deficits could impair performance on non-linguistic tasks. This is plausible because object naming plays
Disorders of Lexical Access And Production 619 easily be a result of low power. (A false positive association can also be a result of low power, but publication bias favors false positives over false negatives; see also Gelman, 2019, and Timm, 2019.) Second, the reported association is mostly based on small group studies with participants selected to have broad lexical-semantic deficits after stroke. Larger-scale case series with broader selection criteria tend to find that semantic deficits and executive function deficits dissociate rather than associate (e.g., Schumacher, Halai, and Lambon Ralph, 2019; Mirman and Thye, 2018). It may be useful to distinguish different forms of domain-generality (Nozari and Novick, 2017). The cognitive control involved in lexical access may be “domain-general” in the sense that it relies on a shared set of computational principles (e.g., inhibition/ competition, conflict monitoring, and prediction using simulations from a forward model), even if those computations are implemented in separate domain-specific systems. A stronger form of domain-generality would be if the actual implementation of those computations is shared across tasks. Alternatively, a very different form of domain-general cognitive control is conflict adaptation (Duthoo and Notebaert, 2012): reduced interference effects when a high-interference trial is preceded by another high- interference trial rather than a low-interference trial. For example, it is possible that cross-task correlations in competition effects are due to individual differences in ability to leverage available cognitive/neural resources in order to deal with challenging tasks, even if the specific resources and computations are very different across tasks.
30.2.2 Neural correlates There is extensive research on the neural basis of inhibitory control with evidence pointing to inferior frontal regions (inferior frontal gyrus [IFG] or ventrolateral prefrontal cortex [VLPFC]) being particularly important for lexical-semantic selection (e.g., Thompson-Schill, D’Esposito, Aguirre, and Farah, 1997; Thompson-Schill, Swick, Farah et al., 1998; Schnur, Schwartz, Kimberg et al., 2009; Mirman and Graziano, 2013). The computational model developed by Snyder et al (2010) implemented this by separating lexical-semantic activation (thought to be supported by temporal and parietal cortical regions) from selection (supported by VLPFC), consistent with a broad theory of inhibitory control in which PFC is the source of domain-general inhibitory control modulation of domain-specific representations in temporal and parietal cortices (Munakata, Herd, Chatham et al., 2011). Unfortunately, the lexical access deficit data are not entirely consistent with this elegant framework. Naming pictures that have strong competitors (e.g., gift-present, turtle-tortoise) is slower and less accurate than pictures without such strong competitors (Britt, Ferrara, and Mirman, 2016), but that competition effect was no different between participants with IFG damage and those with
a strategic role in many tasks, including ostensibly non-verbal tasks that do not require naming (e.g., Lupyan, 2012; Lupyan and Mirman, 2013; Lupyan and Clark, 2015).
620 Daniel Mirman and Erica L. Middleton damage elsewhere in the left hemisphere. Individuals with lexical access deficits tend to have large lesions encompassing both prefrontal and temporal-parietal regions (e.g., Thompson, Robson, Lambon Ralph, and Jefferies, 2015), and difficult lexical-semantic access is associated with activation in both prefrontal and temporal-parietal regions (e.g., Noonan, Jefferies, Eshan, Garrard, and Lambon Ralph, 2013). In a large sample of neurologically intact adults, the “multiple demand” network that is critical for executive functions did not seem to be critical for sentence comprehension (Diachek, Blank, Siegelman, Affourtit, and Fedorenko, 2020). Therefore, the precise neural mechanisms of lexical access remain unclear.
30.3 Learning and lexical access 30.3.1 Semantic context effects in lexical access Across various domains within psycholinguistic research, there is increased attention to learning mechanisms that interface with the language system. Evidence is amassing that even in adult speakers, the use of language engages learning mechanisms that impact language performance (i.e., use-dependent learning; for discussion, see Dell and Chang, 2014). The notion of use-dependent learning applied to language production means that every speech act confers long-lasting changes to the processes or representations that were engaged to perform the speech act. In word retrieval for speech production, use-dependent learning has most extensively been examined in studies on semantic context effects, which refer to findings of persisting, decreased accessibility of words (i.e., competitors; e.g., cow, zebra, donkey) following retrieval of a semantically-related word (i.e., target; e.g., horse). To illustrate, in studies on blocked cyclic naming (e.g., Damian, Vigliocco, and Levelt, 2001), a set of pictures is repeatedly named in successive cycles, where the set contains items from the same category (i.e., homogeneous condition such as bear, duck, mouse, horse, skunk, cheetah) or from different categories (i.e., mixed condition such as bear, radio, ear, glove, cheese, crib). Across cycles, the homogeneous condition is typically associated with slower naming latencies and/ or enhanced naming error compared to the mixed condition (i.e., semantic context effect; e.g., Rahman and Melinger, 2007; Belke and Meyer, 2005; Damian et al., 2001; Harvey and Schnur, 2015; Schnur, Schwartz, Brecher, and Hodgson, 2006). It is generally accepted that the decreased accessibility of competitors involves learning as opposed to arising from short-lived, interfering activation from prior naming trials of related words. Evidence for this includes observations that the interleaving of filler trials between target trials does not eliminate or diminish the semantic context effect (Damian and Als, 2005; Navarette, Del Prato, and Mahon, 2012), and that the magnitude of the semantic context effect grows across cycles in some circumstances (for review and discussion, see Belke and Stielow, 2013).
Disorders of Lexical Access And Production 621 Semantic context effects in word retrieval have also been examined in the continuous naming paradigm (e.g., Howard, Nickels, Coltheart, Cole-Virtue, 2006). In continuous naming, several members from each of several semantic categories are presented in a pseudorandom, intermixed fashion in a long list of naming trials and specific items do not repeat. In this design, the semantic context effect manifests as a cumulative increase in naming latency (typically by a fixed amount, e.g., 20 ms) with each additional same- category exemplar (e.g., Howard et al., 2006; Navarrete, Mahon, and Caramazza, 2010; Schnur, 2014). Generally, the amount by which naming latency increases per category member is not affected by the number of intervening other-category trials when the number is small (e.g., from 2 to 8; Schnur, 2014), which favors a learning explanation rather than one that appeals to residual, interfering activation from prior related trials. In Schnur (2014), cumulative interference even persisted across larger numbers of unrelated intervening trials (from 8 to 14) if short lag (2) trials were included, suggesting a role for attention or some other biasing mechanism that enhances learning in contexts where semantic relationships are salient. Various learning mechanisms that have been proposed and formalized in computational models to explain semantic context effects in word retrieval include weakened connections between competitor words and semantic representations following target retrieval (Oppenheim, Dell, and Schwartz, 2010), increased competition from previously strengthened targets (Howard et al., 2006), and conceptual bias against competitor concepts (Roelofs, 2018). Though most blocked cyclic naming and continuous naming studies have involved neurotypical speakers, semantic context effects have also been reported in people with aphasia (PWA) using these paradigms (e.g., Harvey, Traut, and Middleton, 2019; Hsiao, Schwartz, Schnur, and Dell, 2009; McCarthy and Kartsounis, 2000; Ries, Karzmark, Navarrete, Knight, and Dronkers, 2015; Scott and Wilshire, 2010; Wilshire and McCarthy, 2002; Schnur et al., 2006). Studies with PWA have shown that the magnitude of semantic context effects in blocked cyclic naming is insensitive to manipulations of the time between naming trials (e.g., Schnur et al., 2006; Hsiao et al., 2009), pointing to a role for learning as with neurotypical speakers. Some studies have examined semantic context effects in PWA using the continuous naming paradigm and have reported cumulative interference effects with successive same-category naming in latencies (Ries et al., 2015) and errors (Harvey et al., 2019). Harvey et al. found that lexical-stage errors, specifically semantic errors (e.g., zebra for horse) but not other types of naming errors (e.g., phonological errors), accumulated with successive same- category naming, providing direct evidence of a use-dependent learning system that uniquely impacts the mapping between semantic representations and words. The literature on semantic context effects has predominantly focused on the deleterious consequences of target retrieval on subsequent retrieval of a semantically related word. However, blocked cyclic naming affords examination of a complementary effect, specifically, word retrieval facilitation. In blocked cyclic naming, naming is increasingly facilitated (e.g., faster) over cycles because specific items are repeatedly named, an effect generally more apparent in the mixed compared to the homogeneous condition (e.g., Schnur et al., 2006). Such facilitation is a form of repetition priming, or a
622 Daniel Mirman and Erica L. Middleton benefit that accrues to a word’s accessibility from prior presentations or retrievals of that word. Repetition priming is understood to reflect learning rather than transient boosts in retrievability because of observations of long-lasting repetition priming effects (e.g., Mitchell and Brown, 1988; Wheeldon and Monsell, 1992). As we review in the next section, the manner with which a word is processed can have radically different effects on its future retrievability.
30.3.2 Use-dependent learning in the clinic: retrieval practice and errorless learning naming treatment in aphasia As mentioned at the start of this chapter, improved naming when provided with a cue is a standard indicator of impaired lexical access because cues facilitate access rather than representation. A voluminous literature has examined the use of cues to facilitate immediate word retrieval as well as to enhance future word retrieval success in aphasia. Investigations have examined cues of different types including form-based cueing, such as word repetition or phonological onset cueing (e.g., Best, Herbert, Hickin, Osborne, and Howard, 2002; Lorenz and Nickels, 2007; Lorenz and Ziegler, 2009; Meteyard and Bose, 2018; Miceli, Amitrano, Capasso, and Caramazza, 1996), semantic feature cueing (e.g., Boyle, 2004; Boyle and Coelho, 1995; Howard, Patterson, Franklin, Orchard-Lisle, and Morton, 1985; Lorenz and Ziegler, 2009; Patterson, Purell, and Morton, 1983), sentence prompt cueing (e.g., Pease and Goodglass, 1978; Rochford and Williams, 1962), or graphemic cueing, such as the initial letter (e.g., Best et al., 2002; Lorenz and Nickels, 2007). Word retrieval is typically facilitated by cueing to some degree in the PWA studied, but who will benefit from what kind of cue and how different types of cues relate to future word retrieval remains unclear (for relevant findings and discussion, see e.g., Best et al., 2002; Cheneval, Bonnans, and Laganaro, 2018; Heath, McMahon, Nickels et al., 2012; Lorenz and Ziegler, 2009; Wambaugh, Linebaugh, Doyle et al., 2001). Perhaps an understanding of how one might leverage cues effectively to treat word retrieval deficits in aphasia can be advanced by appeal to use-dependent learning principles. Imagine a use-dependent learning mechanism that operates at each of the two general stages of lexical access: mapping from semantics to a lexical representation (i.e., word; this is “Stage 1”) followed by mapping from the word to its phonology (“Stage 2”). In use-dependent learning models, the act of mapping between representations also drives persistent changes in that mapping (e.g., Oppenheim et al., 2010; Howard et al., 2006). Applied here, whether and to what degree production of a word involves Stage 1 and/or Stage 2 should dictate whether and to what degree each stage is strengthened. For example, when the full word is presented aurally and the patient repeats the name (i.e., word repetition task) for a depicted object, production involves activation of the word representation followed by largely the same phonological output processes (Stage 2) as if the word were produced via naming (for relevant evidence, see Nozari, Kittredge, Dell,
Disorders of Lexical Access And Production 623 and Schwartz, 2010). Because word repetition circumvents or only weakly activates semantically driven retrieval (Stage 1), use-dependent learning will be largely limited to Stage 2, with very little effect on Stage 1. In contrast, in naming, word production requires, and thus can refine, the mappings at both Stage 1 and Stage 2. The efficacy of word repetition compared to naming is closely related to the application of “errorless learning” principles to naming treatment (for reviews, see Fillingham, Hodgson, Sage, and Lambon Ralph, 2003; Middleton and Schwartz, 2012). In the naming treatment literature, the typical treatment experience in errorless learning involves word repetition, where the patient hears/sees the object name at picture onset, then orally repeats the name in the presence of the picture. Interest in errorless learning naming treatment derived from two core features. First, because the target word is provided, the patient is given the best chance for oral production of the target word in the presence of the picture on every trial, a procedure that may capitalize on benefits from Hebbian learning (e.g., Hebb, 1949; McCandliss, Fiez, Protopapas, Conway, and McClelland, 2002). Second, presentation of the target word at picture onset preempts a naming attempt, which was expected to decrease word retrieval errors during treatment and prevent PWA from learning their errors. Early studies in aphasia compared errorless learning to so-called errorful treatments, where naming attempts (and hence errors) are permitted (Fillingham, Sage, and Lambon Ralph, 2005a, 2005b, 2006). The general approach in this body of work was to assess performance on items given errorless or errorful naming treatment, with the effects of the treatments typically measured soon after treatment (e.g., one week) and at treatment follow-up after a delay (e.g., one month). In a series of single-subject controlled comparisons, both errorless and errorful treatment (naming with a phonological cue, e.g., first letter/sound) improved naming for most participants (Fillingham et al., 2005a, 2005b, 2006). However, group- level analyses were not reported in these studies, rendering relative efficacy of the two methods unknown. Available group studies showed either a trend (Conroy, Sage, and Lambon Ralph, 2009) or a reliable advantage (McKissock and Ward, 2007) for errorless over errorful methods. However, important differences between errorless and errorful methods in these experiments may have undermined the efficacy of errorful treatment. For example, in the errorless condition in Conroy et al., correct target names were produced as many as five times more often per item as in the errorful condition (i.e., 20 versus 100 productions per item over the course of treatment). In McKissock and Ward (2007), experience producing target information was better equated. However, error production in the errorful condition was emphasized, which may have deleteriously affected its efficacy—no cueing was provided to assist successful naming; and guesses were strongly encouraged on each errorful trial, forcing participants to produce errors when otherwise they may have refrained from responding. Engaging patients in training conditions under which word retrieval generally fails flies in the face of prescriptions derived from the vast psychological literature on retrieval practice effects. Hundreds of studies on retrieval practice (a.k.a. testing effects or test-enhanced learning) have shown that the act of retrieving information from long-term memory powerfully bolsters the learning and retention of that information (for reviews, see
624 Daniel Mirman and Erica L. Middleton Rawson and Dunlosky, 2011; Roediger and Butler, 2011; Roediger and Karpicke, 2006; Roediger, Putnam, and Smith, 2011). Empirical demonstrations of retrieval practice effects typically start with an initial period of study, allowing participants to become familiar with to- be- learned (i.e., target) information. Participants are then given opportunities to attempt to retrieve the information from long-term memory (“retrieval practice condition”), or they are given the information for more study (“restudy condition”). A retrieval practice effect is demonstrated when the retrieval practice condition outperforms the restudy condition on subsequent measures of performance. Retrieval practice effects typically become more pronounced at longer retention intervals and when target information is presented for additional study after retrieval practice (i.e., feedback; Rowland, 2014). It is generally accepted that the act of retrieval itself confers learning and that the learning is potent because retrieval practice effects are still observed when feedback is not provided after retrieval (for a meta-analytic summary, see Rowland, 2014) and when the retrieval practice condition is at a marked disadvantage in terms of amount of exposure to target material, compared to restudy (e.g., Carpenter and DeLosh, 2005; Carrier and Pashler, 1992; Karpicke and Roediger, 2007a; Karpicke and Roediger, 2008). In a recent study, Middleton, Schwartz, Rawson, and Garvey (2015) adapted the standard retrieval practice paradigm to a naming treatment context to assess the potential clinical relevance of retrieval practice for improving word retrieval in people with aphasia with naming impairment, and to compare the benefits of retrieval practice naming treatment to errorless learning. Prior to the experiment, a large picture corpus of common everyday objects (e.g., waffle, robot, pumpkin) with high name agreement was administered to each PWA to select items for training that the PWA experienced difficulty naming. These items were assigned in matched fashion into three conditions: (1) cued retrieval practice, in which the picture and initial sound/letter of the name was presented and the PWA attempted to name the picture; (2) noncued retrieval practice, in which the picture alone was presented for a naming attempt; (3) errorless learning, in which at picture onset the full name was seen/heard and repeated by the participant. During the experiment, pictured objects and their names were first presented for initial study, as in the retrieval practice paradigm. After ten minutes, each item was administered for one trial of cued retrieval practice, noncued retrieval practice, or errorless learning, and all trials ended in feedback. Results across the group showed that during training, the rate of successful word production was superior in the errorless learning condition compared to the other conditions (see Figure 30.1). On a next-day test of naming, both retrieval practice conditions outperformed the errorless learning condition, and on a one-week test of naming, the cued retrieval practice condition outperformed the errorless learning condition. Such retrieval practice effects in naming were attributable to changes in lexical access because of characteristics of the stimuli and the participants. Specifically, the stimuli involved common, everyday objects with high name agreement, which assured the learning effects reflected changes in access to existing linguistic representations rather than acquisition of new lexical knowledge. Furthermore, background testing pointed to lexical access deficits as a major
Disorders of Lexical Access And Production 625 contributor to the naming impairment in the sample of PWA studied. Overall, the study provided original evidence that the act of retrieving a word from long-term memory strengthens its later access more potently than hearing/seeing and repeating the word. Another well-established trait of retrieval practice is that effortful but successful retrievals confer more persistent benefits than easy and successful retrievals. A well- studied means to manipulate retrieval effort involves altering the lag (i.e., number of interleaved unrelated trials) between repeated training trials for an item. The relationship between lag and retrieval effort and concomitant benefits to learning has been demonstrated in a number of studies (Karpicke and Bauernschmidt, 2011; Karpicke and Roediger, 2007b; Pashler, Zarow, and Triplett, 2003; Pyc and Rawson, 2009). For example, Pyc and Rawson (2009) found long (lag-29) versus short (lag-4) lag was associated with greater effort during retrieval practice as revealed by greater time to successfully retrieve target information as well as better performance on final outcome measures (see also Karpicke and Bauernschmidt, 2011; Karpicke and Roediger, 2007b). However, increasing lag at training can also increase retrieval failure at training. Successful retrieval practice confers more learning than unsuccessful retrieval practice (Rowland, 2014), so increases in lag beyond some point may be associated with diminishing returns. Effortful retrieval effects have also been documented in lexical access in aphasia (Middleton, Schwartz, Rawson, Traut, and Verkuilen, 2016). In Middleton et al., items were administered for multiple training trials of noncued retrieval practice or errorless learning with varied lag between repeated trials per item. The retrieval practice
*** *** 1.00 0.95
*
Response Accuracy
0.90
n.s.
**
0.85
**
0.80 0.75 0.70 0.65 0.60 0.55 0.50 Training Errorless learning
Next-day test Cued retrieval practice
One-week test
Noncued retrieval practice
Figure 30.1 Response accuracy during training, at a next-day test of naming, and a one-week test of naming as a function of two types of retrieval practice naming treatment and errorless learning naming treatment. Note: ***, **, * correspond to