Linguistic Intuitions. Evidence and Method 9780198840558


219 19 6MB

English Pages [315] Year 2020

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Frontmatter
Contents
Acknowledgements
List of abbreviations
The contributors
1 Introduction (Brøcker et al.)
PART I. ACCOUNTS OF LINGUISTIC INTUITIONS
2 Linguistic intuitions (Gross)
2.1 Introduction
2.2 Devitt on the "Voice of Competence" and his modest alternative
2.3 Clarifying the options and locating the current proposal
2.4 Candidate monitoring mechanisms
2.5 Error signals and linguistic intuitions
2.6 Error signals as the Voice of Competence
2.7 Other linguistic intuitions, other sources
2.8 Why not intuitions elsewhere
Acknowledgments
3 A defense of the Voice of Competence (Rey)
3.1 A "Voice of Competence"
3.1.1 Devitt's skepticism about (non-)standard models
3.1.2 I-languages vs. E-languages
3.1.3 Devitt's alternative proposal
3.2 Grammar, parsing, and perception
3.2.1 Linguistic perception, phonology, and parsing
3.2.2 Non-ceptual content: NCSDs
3.2.3 Having vs. representing linguistic properties
3.2.4 How NCSDs would help
3.3 The evidence
3.3.1 Involuntary phonology
3.3.2 "Meaningless" syntax
3.3.3 Syntax trumping "the message": Garden paths, structural priming, and "slips of the ear"
3.4 Conclusion
Note
4 Linguistic intuitions again (Devitt)
Part I: VoC and ME
4.1 Introduction
4.2 Voice of Competence
4.3 Modest explanation
4.4 The rejection of VoC
Parti II: Gross, Rey, and the defense of VoC
4.5 Background
4.6 Intuitive linguistic usage versus intuitive metalinguistic judgment (= linguistic intuition)
4.7 Rey's bait and switch
4.8 Rey's arguments for VoC
4.9 Gross' argument for VoC
4.10 Conclusion
5 Do generative linguists believe in a Voice of Competence (Brøcker)
5.1 What is the received view?
5.2 Dissecting VoC
5.2.1 Competence or experience with reflecting on sentences?
5.2.2 Acceptability or grammaticality?
5.2.3 Role of mental grammar: Supplying dta or content?
5.2.4 Falibility/direct access?
5.2.5 Mentalist view of grammar?
5.2.6 Implemented structure rules?
5.3 The study
5.3.1 Materials and aprticipants
5.3.2 Results
5.4 The received view assessed
5.5 Appendix
Acknowledgments
6 Semantic and syntactic intuitions (Collins)
6.1 Introduction
6.2 Semantics and syntax: Linguistic phenomena
6.3 The distinction between syntactic and semantic intuitions
6.4 Semantic intutions
6.5 Why the attitude view is wrong
6.6 The right view of semantic intuitions
6.7 Syntactic intuitions
6.8 Grammaticality without interpretation
6.9 Interpretations without grammaticality
6.10 Conclusion
Acknowledgments
7 Intuitions about meaning, experience, and reliability (Drożdżowicz)
7.1 Introduction
7.2 Intuitive judgments about meaning and their use
7.3 Intuitions and the phenomenology of language understanding: The experience-based strategy
7.4 Intuitive judgments and the monitoring of speech comprehension: The reliabiilist strategy
7.5 Comparing the two strategies
7.5.1 Descriptive adequacy
7.5.2 Addressing recent criticisms
7.5.3 Epistemological assumptions
7.6 Objections and replies
7.7 Concluding remarks
Acknowledgments
8 How can we make good use of linguistic intuitions, even if they are not good evidence (Santana)
8.1 The evidential role of linguistic intuitions
8.2 Fruitfulness
8.3 Apt etiology and the ontology of language: The social account
8.4 The etiological defense: Mentalistic
8.5 Reliability
8.6 The non-evidential role of linguistic intuitions
8.7 Conclusion
PART II. EXPERIMENTS IN SYNTAX
9 The relevance of introspective data (Newmeyer)
9.1 Introduction
9.2 A critique of "Transitivity, clause structure, and argument structure: Evidence from conversation"
9.3 The grammatical complexity of everyday speech
9.4 Some general issues regarding conversational corpora
9.4.1 Some positive features of conversational corpora
9.4.2 Some negative featurs of conversational corpora
9.5 Concluding remarks
Acknowledgments
10 Can we build a grammar on the basis of judgments? (Featherston)
10.1 Introduction
10.2 The quality of armchair judgments
10.2.1 Armchair judgments are less sensitive
10.2.2 Armchair judgments are noisy
10.3 Reassessing SA12 and SSA13
10.3.1 What counts as success?
10.3.2 Do binary oppositions make a grammar?
10.3.3 Do armchair judgments make too many distinctions?
10.4 Towards a better grammar
10.4.1 Crime and punishment
10.4.2 Gathering quantified judgments
10.5 Summing up
10,6 Appendix
11 Acceptability ratings cannot be taken at face value
11.1 Introduction
11.1.1 Motivation
11.1.2 The approach
11.1.3 Roadmap
11.2 Background
11.2.1 What we did
11.2.2 What we found and what we can and cannot conclude
11.2.3 Why we should not be surprised
11.3 Experimental methods
11.3.1 Properties common to all three experiments
11.3.2 Properties unique to individual experiments
11.4 Results
11.4.1 General observations
11.4.2 Case studies
11.5 Conclusions and future directions
Acknowledgments
12 A user's view of the validity of acceptability judgments as evidence for syntactic theories (Sprouse)
12.1 Introduction
12.2 A theory of acceptability judgments
12.2.1 What is the goal of syntactic theory?
12.2.2 What is the cognitive source of acceptability judgments
12.2.3 What is the logic that is used to convert acceptability judgments into evidence
12.2.4 What are the criteria for evaluating the success (or failure) of acceptability judgments?
12.3 The emprical properties of acceptability judgments
12.3.1 The reliability of acceptability judgments
12.3.2 Theoretical bias in acceptability judgments
12.3.3 The sensitivity of acceptability judgments
12.4 The choice to continue to use acceptability judgments
12.4.1 The scientific question
12.4.2 The practical question
12.5 Conclusion
13 Linguistic intuitions and the puzzle of gradience (Häussler & Juzek)
13.1 Introduction
13.1.1 The process of community agreement and acceptability judgment tasks
13.1.2 Acceptability and grammaticality
13.2 Linguistic intuitions
13.2.1 Acceptability intuitions
13.2.2 Source intuitions
13.2.3 Grammatical reasoning
13.2.4 A proposal concerning diacritics
13.3 Our experiments
13.3.1 Experiment 1: A differnent quality of gradience
13.3.2 Preliminary discussion
13.3.3 Experiment 2: Scale effects
13.4 The puzzle of gradience
13.4.1 Extra-grammatical factors
13.5 Concluding remarks
Acknowledgments
14 Experiments in syntax and philosophy (Schindler & Brøcker)
14.1 Introduction
14.2 Experimental syntax
14.3 Is XSyn methodoligcally superior?
14.3.1 Better reliability of the data gathered through XSyn?
13.4.2 Better validity through XSyn?
14.3.3 Richer data through XSyn?
14.3.4 Is XSyn more scientific?
14.4 Experimental philosophy: Common motivations
14.5 Lessons for XPhi from XSyn
14.5.1 Better reliability of data gathered by XPhi?
14.5.2 Better validity of XPhi?
14.5.3 Richer data through XPhi?
14.5.4 Is XPhi more scientific?
14.6 Conclusion
Acknowledgments
References
Index
Recommend Papers

Linguistic Intuitions. Evidence and Method
 9780198840558

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Linguistic Intuitions

Linguistic Intuitions Evidence and Method Edited by

SAMUEL SCHINDLER, ANNA DROŻDŻOWICZ, AND KAREN BRØCKER

1

3

Great Clarendon Street, Oxford, OX2 6DP, United Kingdom Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries © editorial matter and organization Samuel Schindler, Anna Drożdżowicz, and Karen Brøcker 2020 © the chapters their several authors 2020 The moral rights of the authors have been asserted First Edition published in 2020 Impression: 1 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by licence or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above You must not circulate this work in any other form and you must impose this same condition on any acquirer Published in the United States of America by Oxford University Press 198 Madison Avenue, New York, NY 10016, United States of America British Library Cataloguing in Publication Data Data available Library of Congress Control Number: 2020932181 ISBN 978–0–19–884055–8 Printed and bound by CPI Group (UK) Ltd, Croydon, CR0 4YY Links to third party websites are provided by Oxford in good faith and for information only. Oxford disclaims any responsibility for the materials contained in any third party website referenced in this work.

Contents

Ack11ow/edgme11ts List of abbreviations The contributors I. Introduction

vii ix

x.i

I

Karen Brecker, A1111a Droidiowicz, a11d Samuel Schi11dler PART I: ACCOUNTS OF LINGUISTIC INTUITIONS 2. Linguistic intuitions: Error signals and the Voice of Competence Steven Gross

13

3. A defense of the Voice of Competence Georges Rey

33

4. Linguistic intuitions again: A response to Gross and Rey Michael Devitt

51

5. Do generative linguists believe in a Voice of Co1npetence? Kare11 Brecker

69

6. Sen1antic and syntactic intuitions: Two sides of the sa1n e coin John Collins

89

7. Intuitions about n1eaning, experience, and reliability A11na Droi:dzowicz 8. How we can n1ake good use of linguistic intuitions, even if they are not good evidence Carlos Santana

109

I 29

PART II: EXPERIMENTS IN SYNTAX 9. The relevance of introspective data Frederick f. Newmeyer

149

I 0. Can we build a gramn1ar on the basis of judgments? Sam Feathersto11

165

11. Acceptability ratings cannot be taken at face value

189

Carso11 T. Schutze

VI

CONTENTS

12. A user's view of the validity of acceptability judgments as

evidence for syntactic theories /011 Sprouse

215

13. Linguistic intuitions and the puzzle of gradience Jana Hii11ssler and Tom S. J11zek

233

14. Experin1ents in syntax and philosophy: The 1nethod of choice?

255

Samuel Schindler and Karen Brecker References Index

275 297

Acknowledgments This volume was made possible by a generous Sapere Aude grant (4180-00071) that Samuel Schindler received from the Independent Research Fund Denmark for the project “Intuitions in Science and Philosophy,” which ran from 2016 to 2019 at Aarhus University. The project investigated intuitive judgments in thought experiments in physics and acceptability judgments in linguistics and drew lessons for debates about the use of intuitive judgments in philosophy. The workshop “Linguistic Intuitions, Evidence, and Expertise,” which was held at Aarhus University in October 2017 as part of the project, initiated the idea for the present volume. The workshop brought together linguists and philosophers interested in the methodological foundations of linguistics to discuss whether intuitions can legitimately be used as evidence in linguistics. We are grateful to the participants for fruitful discussions and inspiration that has led to this book. We would like to thank all contributors to our volume for their hard work and dedication. We are also very grateful to our expert reviewers, who provided anonymous comments for the individual chapters, as well as to the anonymous reviewers for Oxford University Press, who provided feedback on the full manuscript. Our editors at the press, Julia Steer and Vicki Sunter, have been of great help in the process. We also thank the copy-editor for the press, Manuela Tecusan, Christoffer Lundbak Olesen for his help with compiling the book’s bibliography, and Lori Nash for doing the index. Anna Drożdżowicz also acknowledges support in late 2018 and 2019 from her postdoctoral project “Language, Meaning and Conscious Experience” (grant: 275251); this support was received from the Mobility Grant Fellowship Programme (FRICON) and funded by The Research Council of Norway and The Marie Skłodowska-Curie Actions.

List of abbreviations ACD AMT c-command CP DP ECM EEG EPP ERP FC LS LTL ME NCSD NP PP SD SLE TP UG VoC VP

antecedent-contained deletion Amazon Mechanical Turk constituent command complementizer phrase; central processor determiner phrase exceptional case-marking electroencephalography extended projection principle event-related potential forced choice Likert scale learned theory of language modest explanation; magnitude estimation non-conceptual structural description noun phrase prepositional phrase structural description standard linguistic entity tense phrase universal grammar Voice of Competence verb phrase

The contributors Karen Brøcker holds a PhD in science studies and an MA in linguistics from Aarhus University. Her research focuses on theoretical linguistics and philosophy of linguistics, in particular the theoretical assumptions underlying the use of linguistic intuitions as evidence for theories of grammar. Her PhD was part of the project “Intuitions in Science and Philosophy” at the Centre for Science Studies, Aarhus University. John Collins is Professor of Philosophy at the University of East Anglia. He has written widely on the philosophy of language and mind, with especial reference to generative linguistics and its impact on analytical philosophy, and the concept of truth and the nature of propositions. He is the editor (with Eugen Fischer) of Experimental Philosophy, Rationalism, and Naturalism (2015) and (with Tamara Dobler) of The Philosophy of Charles Travis (2018), and the author of many articles and the monographs: Chomsky: A Guide for the Perplexed (2008), The Unity of Linguistic Meaning (2011), and Linguistic Pragmatism and Weather Reporting (2020). Michael Devitt is Distinguished Professor of Philosophy at the Graduate Center of the City University of New York. His main research interests are in the philosophy of language, philosophy of linguistics, philosophy of mind, realism, and methodological issues prompted by naturalism. He is the author of the following books: Designation (1981); Realism and Truth (1984/1991/1997); Language and Reality (with Kim Sterelny, 1987/ 1999); Coming to Our Senses (1996); Ignorance of Language (2006); Putting Metaphysics First (2010); Biological Essentialism (forthcoming). He has co-edited (with Richard Hanley) The Blackwell Guide to the Philosophy of Language (2006). Anna Drożdżowicz is postdoctoral researcher at the University of Oslo. She received her PhD from the University of Oslo in 2015. From 2016 to 2018 she was a postdoctoral researcher in the project “Intuitions in Science and Philosophy” at the Centre for Science Studies, Aarhus University. She works primarily in the philosophy of mind and language, but has also published papers in philosophical methodology, psycholinguistics, and the philosophy of psychiatry. Sam Featherston studied German and social anthropology at the University of London before teaching English as a second language, German, and French in London schools. From 1991 to 1994 he was a Voluntary Service Overseas development volunteer in Tanzania, where he married his wife Véronique. Both carried out in-service training for English teachers in secondary schools. On his return he completed an MA in applied linguistics at Essex University, studying under Andrew Radford and Harald Clahsen, who then supervised his PhD thesis Empty Categories in Language Processing. He came to Tübingen in 1999, joined the English department in 2013, and obtained his professorship in 2014. His major research interests are grammar architectures, empirical approaches to

xii

 

syntax, and the boundaries of processing and grammar. He is currently working on the relationships between clause types and binding across clause boundaries. Steven Gross is Professor of Philosophy at Johns Hopkins University, with secondary appointments in Cognitive Science and in Psychological and Brain Sciences. Gross has published on a variety of topics in philosophy of language, philosophy of mind, and the foundations of the mind–brain sciences. His most recent publications have focused on perceptual consciousness and on cognitive penetration. Current projects include “antiBayesian” updating in vision and whether linguistic meaning is perceived or computed post-perceptually. Jana Häussler works in the Fakultät für Linguistik und Literaturwissenschaft, Bielefeld University. She obtained her PhD from the University of Konstanz in 2009 with a dissertation on agreement attraction errors in sentence comprehension. Her research interests include the empirical foundation of syntactic theory, morphosyntax, the syntax– semantics interface, and sentence processing. Her current research centers on errors and deviations—how they affect comprehension, when listeners notice a deviation and, when they consider it an error, what the grammatical status of errors and deviations is. Tom S. Juzek is postdoctoral researcher at Saarland University. Tom takes an interest in the nature of morpho-syntactic knowledge, how it interacts with other linguistic layers such as semantics, and how this can be modeled. In his PhD thesis, with Mary Dalrymple (Oxford) and Greg Kochanski (Google), Tom was looking into various methodological questions surrounding acceptability judgment tasks. He currently works on a project with Elke Teich on how diachronic changes in information density surface at various linguistic levels. Frederick J. Newmeyer has been Professor of Linguistics at the University of Washington, where he had taught since his 1969 PhD until his retirement June 2006. He is now Professor Emeritus at Washington and Adjunct Professor at the University of British Columbia and at Simon Fraser University. Newmeyer is the author or editor of twelve books, including Linguistic Theory in America, Language Form and Language Function, and Possible and Probable Languages. In 2002 Newmeyer was president of the Linguistic Society of America. Georges Rey (PhD Harvard, 1978) is Professor of Philosophy at the University of Maryland. Rey has previously held positions at SUNY Purchase and the University of Colorado, and visiting positions at the MIT, at Stanford University, and at the Centre de Recherche en Épistémologie Appliquée (CREA) and the École Normale Supérieure (ENS) in Paris. With Nicholas Allott, Carsten Hansen, and Terje Lohndal, he runs a regular summer institute on language and mind in Oslo. He is the author of Contemporary Philosophy of Mind (1997), as well as of innumerable articles in this area (available at sites.google.com/site/georgesrey) and is presently completing Representation of Language: Philosophical Issues in a Chomskyan Linguistics (forthcoming). Carlos Santana is Assistant Professor of Philosophy at the University of Utah and works primarily in the philosophy of linguistics, as well as in the philosophy of the environmental sciences.

 

xiii

Samuel Schindler is Associate Professor of Philosophy of Science at the Centre for Science Studies at Aarhus University in Denmark. He holds a BSc in Cognitive Science and received his MA and PhD in Philosophy from the University of Leeds (UK). His research focuses on methodological and epistemological issues in the history and philosophy of science. His publications include Theoretical Virtues in Science: Uncovering Reality through Theory (CUP, 2018). He has won two multi-year research grants from two different national funding bodies, one of which was for the project “Intuitions in Science and Philosophy”. Carson T. Schütze is Professor of Linguistics at the University of California, Los Angeles. He has done research in syntax, morphology, first language acquisition, and language processing, as well as in linguistic methodology, on which he has (co-)authored ten encyclopedia and handbook chapters and review articles, in addition to his monograph The Empirical Base of Linguistics (1996, reprinted 2016). He has published on case, agreement, expletives, and auxiliaries, often with a focus on Germanic, in journals such as Linguistic Inquiry, Syntax, Language, Lingua, and Language Acquisition. He has also co-edited three large volumes on psycholinguistics. Jon Sprouse is Professor of Linguistics at the University of Connecticut. His research focuses on experimental syntax—the use of formal experimental measures to explore questions in theoretical syntax—and particularly on acceptability judgments. He is co-editor of Experimental Syntax and Island Effects (with Norbert Hornstein, 2013) and editor of the forthcoming Oxford Handbook of Experimental Syntax.

1 Introduction Karen Brøcker, Anna Drożdżowicz, and Samuel Schindler

In recent years there has been an increased interest in the evidential status and use of linguistic intuitions in both linguistics and philosophy. This volume offers the most recent cutting-edge contributions from linguists and philosophers who work on this topic. In this introductory chapter we present the two main questions that have been at the core of these debates and that will be systematically covered in this volume; then we provide a synopsis of the forthcoming chapters. Modern linguists, particularly in the Chomskyan tradition, regularly use native speakers’ intuitive judgments about sentences as evidence for or against their linguistic theories. These judgments are typically about morphosyntactic, semantic, pragmatic, or phonetic aspects of sentences. In particular, judgments of morphosyntactic well-formedness are often called acceptability judgments in the literature. It is common practice for linguists to informally use their own or their colleagues’ intuitions as evidence for theories about grammar. This practice of relying on linguistic intuitions as evidence—and, in particular, the practice of relying on linguists’ intuitions as evidence—raises two questions. (1) What is the justification of using linguistic intuitions as evidence? We can call this the justification question. (2) Are formal methods of gathering intuitions epistemically and methodologically superior to informal ones? We can call this the methodology question. The present volume brings together philosophers and linguists from these two strands of the debate in order to shed light on the two questions. The more specific questions discussed in this volume are: What is the etiology of linguistic intuitions? In other words what are linguistic intuitions caused by, speakers’ linguistic competence or speakers’ linguistic experience? How big is the risk of bias and distortion when linguists use their own intuitions as evidence? Can the evidential value of linguistic intuitions be improved by systematically studying the intuitions of non-linguists? Or are there good reasons for preferring the judgments of expert linguists? Is the gradience of acceptability judgments indicative of gradient grammar, or rather of performance factors? Do theoretical reflections improve or worsen the quality of one’s intuitions? In what follows, we give a brief introduction to the justification question and the methodology question. We also summarize the individual chapters that make up the two parts of this volume. Let us start with the justification question. : : Karen Brøcker, Anna Drozdzowicz, and Samuel Schindler, Introduction In: Linguistic Intuitions: Evidence and Method. : : First edition. Edited by: Samuel Schindler, Anna Drozdzowicz, and Karen Brøcker, Oxford University Press (2020). : : © Karen Brøcker, Anna Drozdzowicz, and Samuel Schindler. DOI: 10.1093/oso/9780198840558.003.0001

2

  ø ,  ˙ ˙ ,   

For linguistic intuitions to be usable as evidence in the construction of theories of grammar, it must be the case that such intuitions are actually informative about the grammar of the speaker’s language. But why should we think that to be the case? On the face of it, using intuitions as evidence seems highly unscientific. Physicists, for example, don’t use their intuitions when they figure out the laws underlying the behavior of a physical system. On the contrary, they often go against them. Likewise, doctors don’t just rely on their intuitions when trying to determine the disease that matches a patient’s symptoms, but instead use medical tests. And engineers don’t build a bridge on the basis of their intuitions that it might hold up. We certainly would not be inclined to trust the intuitions of lay subjects when doing these things. Why should we then think that linguistics is somehow privileged when it comes to the use of intuitions as evidence? A widely discussed account of why linguistic intuitions can provide evidence comes from the Chomskyan tradition, according to which linguistics is a branch of cognitive psychology (Smith and Allott 2016, ch. 3). On this view, the objects of linguistic study are the aspects of the mind or brain that are responsible for our language abilities. Linguistic intuitions can provide evidence about the computational operations of these mind–brain mechanisms because—and to the extent that—their etiology involves those mechanisms. Importantly, these mechanisms typically do not exhaust the etiology of linguistic intuitions. Other mental systems are also involved in their production (Maynes and Gross 2013). A classic example is that of intuitions that sentences with center embeddings are unacceptable. Such intuitions are widely considered to reflect memory constraints on parsing. Recently the justification question has been brought to the fore by the philosopher Michael Devitt in several publications (see e.g. Devitt 2006b, 2006c, 2010a, 2010b, 2013b). He attributes to generative linguists a view according to which linguistic intuitions are what he calls the Voice of Competence (VoC): the informational content of the intuition is supplied by the speaker’s linguistic competence. According to Devitt, this view is epistemologically highly immodest in the cognitive mechanisms it postulates. Devitt himself argues for a view on which linguistic intuitions are every-day judgments about sentences—judgments with no special etiology, made according to the speaker’s (folk) theoretical concept or theories about “grammaticality.” Critics have argued that this view accounts for the wrong kind of linguistic intuitions, as it entails that experts with more relevant experience and better concepts than the ordinary speaker will, overall, make better intuitive judgments than lay subjects. There is evidence that this is not the case (Gross and Culberston 2009). The attribution of the VoC view to generative linguists has also been heavily criticized, and several philosophers of linguistics have suggested alternative, competence-based views. On these views, competence plays a central role by supplying some special input for linguistic intuitions without directly supplying the informational content of intuitions. On Textor’s (2009) account, the



3

competence provides linguistic “seemings,” meaning that sentences simply “present themselves” as well formed or not to the speaker. On Rey’s (2014b; forthcoming-b; this collection) account, the speaker’s linguistic competence provides structural descriptions of sentences that then form the basis for intuitive judgments. On the : : accounts by Maynes and Gross (2013), Gross (this volume), and Drozdzowicz (2018), the speaker’s linguistic competence provides error signals or some other output from a monitoring mechanism and these then become, one way or another, the basis for linguistic intuitions. In reply, Devitt has criticized several of these accounts, either for lacking detail about how the special input from competence is transformed into the informational content of linguistic intuitions or for essentially reducing to his own position, according to which linguistic intuitions are “central processor” judgments (Devitt 2013b). These accounts of how linguistic intuitions are generated also offer different answers to the justification question. On the account that Devitt argues for, we are justified in using linguistic intuitions as evidence to the degree that the person making them has a good amount of experience with his or her language and a good (folk) linguistic theory to apply when making his or her judgment. On both VoC and Devitt’s critics’ views, we are justified in using linguistic intuitions as evidence because of the causal role that the speaker’s linguistic competence plays in the etiology of linguistic intuitions. The justification question may be approached in other ways than by appealing to the etiology of linguistic intuitions, however. Alternative options include calibrating linguistic intuitions with other sources of data (known or suspected to be justified themselves) and appealing to the fruitfulness of relying on linguistic intuitions as evidence. A critical discussion of these options can be found in Santana’s chapter (this volume). In this volume, the following chapters address the justification question. Steven Gross starts by summarizing the VoC view as characterized by Devitt as well as Devitt’s own preferred account of the etiology of linguistic intuitions. He then argues that these two accounts do not exhaust the possibilities for the etiology of linguistic intuitions. He presents an alternative view, according to which linguistic intuitions are based on error signals that are produced by monitoring mechanisms that are constrained by the speaker’s mental grammar. It is due to this feature, according to Gross, that linguistic intuitions provide relatively reliable evidence about speakers’ languages. In support of this view, Gross reviews the literature on error signals and monitoring mechanisms. He notes that error signals could plausibly explain some features often associated with linguistic intuitions such as their negative valence, their motivational force, and their gradedness. This account is meant to work most straightforwardly for judgments of unacceptability, but Gross examines some ways in which it might be extended to judgments of acceptability and other types of linguistic intuitions as well. He argues that, if correct, this account might support a view on which linguistic intuitions are the VoC, depending on what exactly the content of error signals turns out to be.

4

  ø ,  ˙ ˙ ,   

Another defense of a version of the VoC view can be found in the chapter by Georges Rey. On his view, linguistic intuitions provide special evidence of the speaker’s linguistic competence because they have a special etiology. Rey’s account of the etiology of linguistic intuitions is closely tied to linguistic parsing. A linguistic intuition is based on a structural description of the sentence that is provided by the speaker’s parser, which, in turn, is constrained by the speaker’s mental grammar. Linguistic intuitions can provide reliable evidence about the speaker’s language, because the speaker’s linguistic competence is involved in their production. On Rey’s version of VoC, the subject gives a report about how the sentence sounded to him or her on the basis of a structural description provided by the parser. The structural descriptions are non-conceptual, which explains why ordinary speakers have no conscious awareness of interpreting structural descriptions. Furthermore, Rey presents some empirical evidence that he takes to count in favor of his account that linguistic intuitions are based on structural descriptions. Michael Devitt’s chapter is a reply to the previous two, by Gross and Rey. Devitt first summarizes the VoC view as he has characterized it and his own preferred account, the modest explanation. On VoC, speakers’ linguistic competence supplies the informational content of linguistic intuitions. On the modest explanation, linguistic intuitions are, like any other intuitive judgments a person may make, immediate and unreflective reactions formed in a central processing system. Devitt then responds to Rey’s defense of a version of VoC (chapter 3). According to Devitt, Rey’s evidence shows that parsing involves structural descriptions, not that linguistic intuitions also involve them. Devitt also questions whether structural descriptions could provide the informational content of linguistic intuitions. He considers the possibility that the version of the VoC view that Rey argues for does not require that structural descriptions provide the informational content of linguistic intuitions, in which case, he argues, Rey’s view would not be a version of VoC. Devitt also responds to the account presented by Gross (chapter 2), in particular to the claim that it could provide a defense of VoC. He questions the idea that the parser will output a state with content that explicitly evaluates the string in question, which is what is needed if the competence is to provide the informational content of intuitions. According to Devitt, neither Rey nor Gross present enough evidence to support VoC. He claims that both accounts rely on novel assumptions about the mind that the modest explanation does not require and that, for reasons of simplicity, we should prefer the latter. The previous chapters are all part of the debate over the VoC view and alternative ways in which to justify the evidential use of linguistic intuitions. The chapter by Brøcker focuses on another question that has been central to the debate surrounding the VoC view: whether or not it is the received view among generative linguists. As mentioned, Devitt characterizes VoC as the view that



5

linguistic intuitions are reliable evidence because the informational content of judgments is supplied by the speaker’s linguistic competence. He attributes this view to generative linguists; but whether generative linguists in fact subscribe to it or not has been much debated. Brøcker presents data from a questionnaire study that suggest that this is not the case. According to her findings, generative linguists do subscribe to a competence-based view, but one that does not entail that the informational content of linguistic intuitions is provided by the speakers’ competence. With this question answered, the debate on the justification for the evidential use of linguistic intuitions can focus on the normative issue of what view we ought to adopt. In his chapter, John Collins develops a conception of linguistic intuitions in order to support their evidential role in linguistics. Collins’ account covers both syntactic intuitions, that is, intuitions that reveal the conditions for a sentence to have an interpretation, and semantic intuitions, that is, intuitions that reveal the constraints on what can be said with a sentence that has such a fixed interpretation. He proposes that syntactic and semantic intuitions should be treated as two aspects of the same phenomenon, in other words as intuitions about what can be said with a sentence. According to Collins, language users do not have direct access to linguistic facts, be they syntactic or semantic. Rather, he argues, it is the theorist’s task to figure out what such intuitions reveal about semantics or syntax (or both). In support of this account, Collins presents several observations about how syntactic and semantic intuitions are typically interpreted by theorists. The proposed view is also meant to accommodate and explain some of the puzzling cases of linguistic intuitions, where one appears to have interpretation without grammaticality and grammaticality without interpretation. Collins argues that, so described, both cases are illusory and proposes that in such situations grammaticality is actually aligned with interpretability. The scope of the volume goes beyond issues concerning syntactic intuitions. : : The chapter by Anna Drozdzowicz focuses on linguistic intuitions about meaning. Speakers’ intuitive judgments about meaning are commonly taken to provide important data for many debates in philosophy of language and pragmatics. : : Drozdzowicz discusses two strategies that aim to explain and justify the evidential role of intuitive judgments of this sort. The first strategy is inspired by what is called the perceptual view on intuitions, which emphasizes the experience-like nature of intuitions. The second strategy is reliabilist in that it derives the evidential utility of intuitions about meaning from the reliability of the psycho: : logical mechanisms that underlie their production. Drozdzowicz argues that we have strong reasons to favor the reliabilist view. In support of her claim, she presents evidence suggesting that the reliabilist strategy fares better than the experience-based one on three parameters: it can better capture the practice of appealing to judgments about meaning; it can respond to recent criticisms from experimental philosophy concerning the diversity of such judgments; and it requires fewer epistemological commitments.

6

  ø ,  ˙ ˙ ,   

An entirely different strategy in the debate concerning the evidential status of linguistic intuitions is developed by Carlos Santana. In his chapter, Santana discusses critically three approaches to justifying the evidential use of intuitions in linguistics: the first one claims that linguistic intuitions lead to fruitful scientific discourse, the second one that they have a close causal relationship with language, and the third one that they are reliable. After examining the shortcomings of each of these approaches, he argues that linguistic intuitions do not actually play any evidential role in linguistic theories. Rather, Santana argues, they do frequently play a non-evidential role by delimiting what belongs to the shared background and which questions are currently debated in linguistics. On this account, when linguists use intuitions, they appeal to shared assumptions or established theories. As Santana argues, this conception allows for a role of intuitions in linguistics, but encourages caution in cases of complex or unusual sentences that may result in judgments that might not be part of the consensus among linguists. Unlike the justification question, which has been debated predominantly among philosophers, the methodology question has been thoroughly discussed among generative linguists themselves. This interesting difference may be due to several reasons. One important reason may have to do with the fact that the methodology question is of more immediate practical interest to linguists, whereas the justification question may seem just too far removed from their day-to-day work. Be that as it may, linguists such as Carson Schütze and Wayne Cowart have made significant contributions to debates concerning the methodology question, which started in the 1990s. Following Chomsky’s groundbreaking publication of Syntactic Structures in 1957, syntactic intuitions were standardly collected informally, from just one or a few native speakers, in some cases from the linguist and their colleagues themselves. Schütze (1996) provides an early critical and comprehensive discussion of the use of linguistic intuitions as evidence for grammatical theories. Cowart (1997), also critical of informal methods, makes several suggestions as to how linguists could gather syntactic intuitions more systematically and within properly controlled experimental designs. The move away from the informal method of data collection within syntax is often referred to as experimental syntax. The experimentalists argue for a methodological reform. They believe that linguistic intuitions should be collected in carefully designed studies, preferably using large numbers of lay subjects, large samples of test items, an appropriate study design, and statistical tests. Some proponents of experimental syntax have also argued that grammars ought to accommodate the widely accepted gradience of acceptability judgments and that the traditional, strict dichotomy between grammatical and ungrammatical is mistaken (see e.g. Featherston 2007). In response to this methodological challenge posed by experimentalists, some linguists have recently argued that linguistic intuitions produced in the armchair do in fact live up to reasonable methodological standards. Some of the most



7

prominent works in this area are the studies conducted by Jon Sprouse and his colleagues (e.g. Sprouse and Almeida 2012a; Sprouse et al. 2013), which show that the results of formal and informal studies overwhelmingly coincide with each other, undermining at least some of the worries concerning the traditional method that were raised by the experimentalists. Here is now an outline of the chapters that address the methodology question in this volume. Frederick Newmeyer focuses on a topic that relates to both the justification and the methodology question. His goal it to investigate good methodological practices in linguistics; and he argues that corpus data can be used to validate the evidential use of linguistic intuitions. Proponents of conversational corpus data sometimes argue that heavy reliance on intuition data in generative linguistics has led to wrong grammatical generalizations (Thompson and Hopper 2001). These claims are usually backed by failed attempts to replicate, with conversational corpus data, specific results that are based on intuition data. Newmeyer tests some of these claims and finds that, if one uses a sufficiently large corpus, the investigated results based on intuition data are replicated with conversational corpus data. He concludes that grammars built on linguistic intuitions do not differ markedly from the ones built on conversational corpus data and that, as evidence for grammars, linguistic intuitions are no less relevant than interactional data. Newmeyer also outlines what he takes to be general benefits and drawbacks of using corpus data. In his chapter, Sam Featherston criticizes some syntacticians for an alleged “dataphobia” and argues for the power of experiments to generate data of better quality. In particular, Featherston takes issue with the idea that the traditional, informal way of generating data in linguistics, which he calls “armchair linguistics,” has been vindicated by two recent studies—namely the ones already mentioned, by Sprouse and Almeida (2012a) and by Sprouse, Schütze, and Almeida (2013). These studies conclude that the judgments found respectively in one of the leading linguistics journals and in linguistic textbooks do not substantially diverge from the judgments made by the lay subjects. Featherston argues that armchair judgments are in general less sensitive and more noisy than data gathered experimentally from a large number of subjects. Featherston criticizes the two studies for using a relative scale rather than the standard categorical scale of grammaticality. According to him, the use of categorical scales considerably increases error rates. Moreover, Featherston argues that the studies by Sprouse and colleagues are not suitable for building grammars, since they compare only pairs rather than multiple items. Finally, Featherston emphasizes that these studies only give an indication of the false positive rate of acceptability judgments made by professional linguists, in other words they can only show that linguists make grammatical “distinctions” not warranted by the data. However, for Featherston, the more important issue is that linguists do not draw enough grammatical distinctions.

8

  ø ,  ˙ ˙ ,   

Experimental testing of empirical syntax claims with non-linguists is the topic of Carson Schütze’s chapter. Schütze reports three experiments where the acceptability judgments given by naïve subjects were probed and argues that previously reported high convergence rates with expert or linguists’ judgments (e.g. Sprouse et al. 2013) may be less informative than it has been assumed. According to Schütze, the current methodology of computer-based acceptability experiments allows naïve subjects to give ratings that do not truly reflect their acceptability judgments. Schütze supports this claim with results from the second part of his study, where experimenters conducted follow-up interviews in which they asked participants about the ratings they gave to particular items; the aim was to determine what interpretation or parse they had assigned, whether they had missed critical words, and so on. Schütze concludes that if the experimental results are to be informative for linguistic questions, the reasons behind subjects’ responses have to be better understood. The chapter presents an interesting challenge to the current experimental approaches to syntax. Schütze suggests that progress can be made in this domain by improving current experimental designs and by systematically applying the method of structured follow-up interviews. He appeals here to an interesting but possibly controversial idea—namely that language users can have some kind of conscious access to why they are making the judgments they do. This idea can be assessed by readers themselves by consulting this collection’s chapters on the justification question. A defense of the current methodological practice of appealing to acceptability judgments in linguistics comes from another key figure in recent methodological debates. In his contribution to this volume, Jon Sprouse discusses the theoretical underpinnings of acceptability judgments, the empirical properties of acceptability judgments, and whether the theory and the empirical data warrant the continued use of acceptability judgments in linguistics. Sprouse’s answer to this question is a qualified “yes”: pending any future empirical evidence to the contrary, acceptability judgments are at least as good evidence as other data types in language research. More specifically, Sprouse argues that, on the empirical front, acceptability judgments are reliable across tasks and participants, sufficiently sensitive, and relatively free from theoretical bias. On the theoretical front, Sprouse argues that syntacticians have (i) a plausible theory of the source of acceptability judgments (perception, not introspection), (ii) an “experimental logic” for (informally) generating reliable acceptability judgments, and (iii) a set of evaluation criteria that are similar to the evaluation criteria used for other data types. Nevertheless, Sprouse cautions that there is no “scientific” reason to prefer acceptability judgments over other types of data. According to him, the continued use of acceptability judgments rather than other types of data is, therefore, mostly a pragmatic choice. One of the key interests in the methodological debate has been the question whether gradience in acceptability judgments implies gradience in grammar.



9

The chapter by Jana Häussler and Tom Juzek investigates expert and lay acceptability judgments, as well as their gradience. They report the results of experiments in which they presented their lay subjects with sentences they extracted from papers published in the established journal Linguistic Inquiry. They found that a substantial number of sentences that were deemed ungrammatical by linguists in their LI publications were rated acceptable (to some extent or another) by the lay subjects. Häussler and Juzek argue that these sentences cannot be accounted for by known “grammatical illusions.” They also rule out that their results are artifacts caused by the experimental design they used; and they don’t think that the gradience in acceptability was determined by performance factors. They conclude that the assumption usually made, that grammar is categorical, should probably be given up. The use of intuitions as evidence is not unique to linguistics. The chapter by Samuel Schindler and Karen Brøcker compares the debates concerning the methods for collecting linguistic intuitions to similar debates that are going on in philosophy about the collection of philosophical intuitions. In both fields, it has been argued that the traditional, informal method of collecting intuitions is unscientific and yields results that lack reliability, validity, and sensitivity. In fact some philosophers have appealed to experimental syntax in order to motivate the use of experimental methods in philosophy (Machery and Stich 2012). Schindler and Brøcker critically assess claims from the experimental syntax debate that experimental methods are superior to the traditional armchair method on all these counts. They find that, while experimental methods work well for avoiding theoretical bias for example, using the traditional method has its benefits as well, for instance in reducing the risk of confounding performance factors. On the basis of these qualifications, Schindler and Brøcker conclude that experimental syntax cannot unconditionally serve as a model for how to collect intuitions in philosophy. Schindler and Brøcker’s chapter should also be relevant to readers interested in a critical evaluation of experimental syntax. It can profitably be read together with the chapters by Featherston and in particular by Häussler and Juzek. In sum, the chapters of this volume shed new light on whether and how linguistic intuitions can be used in theorizing about language. Hence it is hoped that they will help advance recent debates on the nature and methodological roles of linguistic intuitions.

PART I

ACCOUNTS OF LINGUISTIC INTUITION S

2 Linguistic intuitions Error signals and the Voice of Competence Steven Gross

2.1 Introduction A substantial portion of the evidential base of linguistics consists in linguistic intuitions—speakers’ typically non-reflective judgments concerning features of linguistic and language-like stimuli. These judgments may be elicited, for example, in answer to such questions or requests as: Is the following sentence natural and immediately comprehensible, in no way bizarre or outlandish (cf. the gloss on “acceptability” in Chomsky 1965)? She likes chocolate anymore. Just going by how it sounds, /ptlosh/ is not a possible word in English, but /losh/ is. Please rate the following candidates on a scale from 1 (definitely not possible) to 5 (definitely possible): /fant/, /zgant/, . . . Do the bolded terms in this sentence co-refer (Gordon and Hendrick 1997)? John’s roommates met him at the restaurant. Which phrase are you most likely to use with a friend when leaving (Labov 1996)? (a) goodbye (b) bye (c) bye-bye (d) see you (e) so long Because of their evidential centrality, linguistic intuitions have been the focus of much methodological reflection. There are well-known worries concerning both how they are collected and how they are used: for example, linguists often use themselves as subjects, risking confirmation bias; they may gather too few intuitions to enable statistical analysis; and they may rely on intuitions too much, failing to seek converging (or disconfirming) evidence of other sorts. There are also well-known replies to these worries. For example, intuitions are now often gathered from a statistically well-powered number of naïve subjects in a controlled

Steven Gross, Linguistic intuitions: Error signals and the Voice of Competence In: Linguistic Intuitions: Evidence and Method. First edition. Edited by: Samuel Schindler, Anna Droz˙ dz˙ owicz, and Karen Brøcker, Oxford University Press (2020). © Steven Gross. DOI: 10.1093/oso/9780198840558.003.0002

14

 

setting; the comparisons that such work has enabled with linguists’ own intuitions have tended to validate the latter; and there is an ever-growing exploration of other sources of evidence. Much more can be said on these matters. (For reviews with further discussion and references, see e.g. Schütze 1996, 2011; Sprouse and Schütze forthcoming.) I mention these familiar debates in order to set them aside and to distinguish them from this chapter’s main question. All parties agree that linguistic intuitions can be and often are a good source of evidence. Why are they? What about their etiology enables them to be a good source of evidence? This chapter suggests that error signals generated by monitoring mechanisms play a role. It will not establish that this is so, but aims instead to render it a plausible, empirically motivated hypothesis and to consider some of its philosophical consequences. There exists a sizable body of psycholinguistic research on language-related monitoring. But its potential relevance to the etiology, and thus to the evidential status, of linguistic intuitions has not been much explored.¹ It is not intended that the proposal should extend to all linguistic intuitions. Methodological discussions of linguistic intuitions often focus on acceptability judgments as evidence in syntax, but judgments concerning other features—in the examples above, pronounceability, co-reference, and likelihood of use, but not only these—play a significant evidentiary role as well. It is far from obvious that the same account can be given for each. Thus, after exploring the possible role of error signals in generating some judgments of unacceptability, I suggest that linguistic intuitions in fact do not form a natural kind with a shared etiology, discussing in particular the role of utterance comprehension. It is also no part of my proposal that, in those cases where error signals do play some role, there are no other significant causal factors or sources of warrant. The etiology of linguistic intuitions is of interest for several reasons, beyond the intrinsic interest of better understanding any instance of the mind–brain’s goings-on. For one, progress in this specific case contributes to our understanding of intuitive judgment more generally, a topic of significance for both psychologists and philosophers (DePaul and Ramsey 1998). For another, there is the aforementioned question of the evidential status of linguistic intuitions. While their ranking as good evidence may not require a deeper knowledge of their etiology (Culbertson and Gross 2009), such knowledge can certainly clarify and further secure it. Finally, a better understanding of linguistic intuitions’ etiology enables us to answer more fully a challenge raised by Michael Devitt (2006b, 2006c) to mentalist conceptions of linguistics—conceptions according to which linguistics is a branch of psychology investigating mental mechanisms and processes implicated specifically in language acquisition and linguistic behavior. Indeed, it is this challenge—and its bearing on broader questions in

¹ Important exceptions include Sprouse (2018) and especially Matthews (n.d.).

 

15

the philosophy of linguistics—that motivates and frames the present study. Accordingly, I begin by providing some background on Devitt’s views and the discussion it has elicited.

2.2 Devitt on the “Voice of Competence” and his modest alternative Why can linguistic intuitions serve as evidence in linguistics? Devitt (2006b, 2006c) contrasts two answers.² According to the “Voice of Competence” view that he rejects, linguistic intuitions are the product of a modularized language faculty that alone delivers the relevant information, or content, to mechanisms responsible for judgment. Judgments with such an etiology, on this view, can provide fruitful evidence for linguistic theorizing, because they directly reflect constraints built into mechanisms specifically implicated in language acquisition and linguistic behavior, and thus give speakers privileged access to linguistic facts. This is the view that Devitt ascribes to proponents of a mentalist conception of linguistics.³ According to Devitt’s own “modest view”, while linguistic competence may give access to the phenomena that linguistic intuitions are about, it does not supply the content of these intuitions. Rather intuitions are arrived at via ordinary empirical investigation, by using the mechanisms responsible for judgment more generally (“central systems”). Linguistic intuitions, thus produced, can provide evidence for linguistic theorizing because experienced language users, having been immersed in language, make fairly reliable judgments about many linguistic matters, just as those immersed in their local flora may be fairly reliable about aspects of it. Devitt calls his view “modest” because it need not advert to any mental states or processes beyond those to which any account of judgment is committed. Importantly, according to Devitt, linguistic intuitions, because they are empirical judgments, are theory-laden, as all such judgments are. Devitt argues that his view provides a better answer to the question “Why are linguistic intuitions a good source of evidence?” Among his main arguments is this: not only do we lack a positive account of how a module embodying grammatical constraints might generate intuitions suited to play the evidential ² Devitt has developed and defended his views in a large number of subsequent papers, which can be found on his webpage. See also his reply in this volume. ³ Context makes clear that Devitt is here using the word “information” not in the informationtheoretic sense, but to indicate representational content. Henceforth I use “content,” in order to avoid confusion. Devitt uses “modular” in the Fodorean sense (Fodor 1983). Mentalism about linguistics does not require accepting all aspects of Fodorean modularity (see e.g. Collins 2004 for differences between Fodor and Chomsky on linguistic competence and modularity); and there is now a variety of conceptions of modularity on the market (e.g. Carruthers 2006). I will attempt to bracket these matters.

16

 

role the mentalist requires, but it is hard to see how such an account might go; we lack so much as “the beginnings of a positive answer” (Devitt 2006b: 118). (Indeed, linguists themselves sometimes lament our relative ignorance of aspects of the etiology of linguistic intuitions; see Schütze 1996 and Goldrick 2011.) It is this challenge that the present chapter aims to address. (I return below to some other considerations that Devitt raises; still others are addressed in Maynes and Gross 2013.) It might illuminate why Devitt thinks that there is a problem in the first place if one notes that he raises this challenge specifically for conceptions of linguistic modules according to which grammatical constraints are embodied in computational operations rather than explicitly represented. If grammatical constraints were explicitly represented, then—Devitt suggests—linguistic intuitions might be derived within the language module in a quasi-deductive fashion. (Devitt assumes that the relevant intuitions are judgments of grammaticality. But in current practice judgments of grammaticality are, typically, not sources of evidence but rather reflective judgments made by theorists to explain judgments of acceptability and other sources of evidence; see e.g. Myers 2009; we return to this shortly.) Devitt’s challenge is raised in reply to those who reject the explicit representation of grammatical constraints—arguably the vast majority of researchers in the field. It asks how else such intuitions could arise in a way that affords the speaker privileged access to the linguistic facts; and it suggests that there may not be any other way. (Devitt rejects the Voice of Competence view also for conceptions on which grammatical constraints are explicitly represented—albeit he does so on other grounds.) But more is at stake than just the source and epistemic status of linguistic intuitions. Devitt’s argument for his modest view is part of a larger argument against mentalist conceptions of linguistics. Recall that, according to such conceptions, linguistics is a branch of psychology that investigates mechanisms specifically implicated in language acquisition and linguistic behavior. According to Devitt, linguistics is not, or ought not to be, so conceived. Rather its object is, or should be, linguistic reality: the facts about language, or about specific languages—which exist, independently of any specific speaker, as conventions among populations (as opposed to as Platonic abstracta, à la Katz 1981). Devitt thus endorses an E-language rather than an I-language conception of what linguistics is, or ought to be, about (Chomsky 1986b). His view of linguistic intuitions fits into his larger argument as follows: if the Voice of Competence view best explained why linguistic intuitions can be evidence, that would supply a consideration in favor of the mentalist conception. But, argues Devitt, it does not best explain it; hence it does not supply such a consideration. Answering Devitt’s challenge thus speaks to this element of his abduction in favor of his antimentalist conception of linguistics.

 

17

2.3 Clarifying the options and locating the current proposal In fact matters are more complicated than deciding between the Voice of Competence view and Devitt’s modest alternative. These two views do not exhaust the possibilities, and indeed, in previous work I have argued against both options. Briefly reviewing those arguments will help clarify the claims of the current chapter. Against Devitt’s view, Culbertson and Gross (2009) argue that one doesn’t find in linguistic intuitions the divergence this view predicts. Devitt maintains that linguistic intuitions are theory-laden and so can diverge across speakers with different relevant background beliefs, including different commitments concerning linguistic theories. Indeed, Devitt, far from worrying about confirmation bias, argues that linguists should prefer their own linguistic intuitions to those of native speakers, who are naïve about linguistics; for the better (more reliable) linguistic intuitions will be those of speakers with better theories. But we found a high degree of consistency among subjects with very different degrees of expertise in linguistics—subjects ranging from total non-experts to practicing syntacticians. This suggests that linguistic intuitions—at least of the sort we elicited—may be fairly stable across changes in relevant background beliefs and experience, and thus are not theory-laden in a way or to a degree that matters to linguistic inquiry. They may rather reflect their pre-judgmental etiology to a particularly robust degree.⁴ On the other hand, Maynes and Gross (2013) argue, inter alia, against the Voice of Competence view—or at least they reject the idea that mentalists should see themselves as committed to it. Recall that Devitt builds into his characterization of the view the idea that the language faculty itself outputs the content of the intuition (henceforth the “content requirement”). But there is nothing about mentalism that requires this. Consider the judgment that some string is unacceptable. Mentalists need not commit themselves to the view that the language faculty itself outputs a state with the content That string is unacceptable. It can suffice that the parser fails to assign a structural description to the string and that the absence of a parse can in turn play a causal role in the process that leads the speaker to judge that the string is unacceptable. The inclusion of the content requirement stems from Devitt’s emphasis on speakers’ privileged access to linguistic facts. For, if the language module supplies the content of linguistic intuitions, that might explain the source of this privilege. ⁴ Devitt (2010a) replies and Gross and Culbertson (2011) respond. “Reliable” is used here not in the psychologist’s sense of being consistently produced in similar circumstances, but in the philosopher’s sense of tending to be accurate (as with a reliable thermometer); this is what psychologists would call validity. Note that, although relative expertise in linguistics did not matter in our experiment, one group—those with no formal exposure to the mind–brain sciences at all—was an outlier. Culbertson and Gross (2009) hypothesize a deficiency in task knowledge.

18

 

Recall, however, that judgments of grammaticality (as opposed, for example, to judgments of acceptability) are not, or no longer, typical of the metalinguistic judgments linguists rely on as evidence. Mentalists, in relying on the kinds of linguistic intuitions they in practice do, thus need not assume that speakers have privileged access to whether strings are grammatical. (Perhaps speakers have defeasible privileged access regarding acceptability.) Mentalists need only maintain that linguists’ theorizing involves an abduction from linguistic intuitions— and from any other available considerations—to claims about a language faculty. Thus they might, for example, elicit acceptability judgments under a variety of conditions and with a variety of stimuli, intending to control for alternative explanations. This does not require that speakers have privileged access to the ground or causal source of their judgments—in particular, privileged access to why they judge a sentence (un)acceptable. Indeed, sentences can be unacceptable for any number of reasons. To take a classic example, multiply center-embedded sentences may be judged unacceptable owing to memory limitations instead of a grammatical violation. If linguistic intuitions are not theory-laden in the way Devitt expects, and if mentalists may reject the content requirement, then the positions Devitt discusses are not exhaustive. Thus, with Devitt’s alternative rejected in Culbertson and Gross (2009), Maynes and Gross (2013) defend a mentalist conception of linguistic intuitions sans the content requirement. This conception rejects as well the idea that a mentalist conception of the evidential status of linguistic intuitions requires that speakers possess privileged knowledge regarding grammaticality, while allowing that the special role grammaticality constraints can play in the generation of linguistic intuitions may enable those intuitions to serve as evidence for those constraints, in a manner relatively unaffected by changes in relevant belief and expertise. Against this background I can clarify the aims of the present chapter. The suggestion bruited above, that a failure to parse can cause a judgment of unacceptability, is a rather bare etiological claim, even if “sufficient unto the day” in the context of Maynes and Gross’ (2013) response to Devitt. In what follows I buttress this reply by developing further suggestions concerning the etiology of linguistic intuitions. In particular, I suggest that error signals generated by monitoring mechanisms may play a role in some cases. I also suggest, more briefly, that in other cases the intuition’s etiology may amount to little more than the etiology of comprehension itself. Interestingly, these further suggestions provide some grounds for entertaining a stronger thesis than the one I previously defended. For, although mentalism per se need not build in the content requirement, the error signal story, as we shall see, may allow the content requirement to be satisfied, at least by some intuitions— and similarly, in some cases, for the comprehension account. The Voice of Competence view (or something like it, as we shall see) may thus be true after

 

19

all, at least in those cases! But it’s important to note that this is indeed a further claim: one can parry Devitt’s etiological challenge without endorsing satisfaction of the content requirement. Devitt might reply that it is essential to the Voice of Competence view, as he conceives it, that speakers have privileged access to whether strings are grammatical. If so, satisfaction of the content requirement doesn’t suffice for the Voice of Competence view, even if it provides for a view that is otherwise like it. Likewise, it is possible that Devitt sees his etiological challenge as presupposing a mentalist endorsement of the speaker’s privileged access concerning grammaticality. If so, my reply to the etiological challenge sans this presupposition is a reply to a variant of Devitt’s challenge, one suggested by it and worth addressing. But, in considering just what content error signals may have, I will also mention the even more speculative possibility that some error signals have content more specifically about grammaticality. The error signal story may thus even offer resources to someone attracted to a Voice of Competence view with some such privileged access component built in. (Some parallel indications are marked but not developed for comprehension cases.) Again, this would be a further claim, one that would go beyond maintaining that error signals have an etiological role, and also beyond adding that they enable satisfaction of the content requirement.⁵

2.4 Candidate monitoring mechanisms There’s no consensus regarding the mechanisms involved in monitoring language use (for a brief survey, see Nozari and Novick 2017, to which this section’s summary is indebted). But that some such mechanisms are involved in the prevention, detection, and correction of linguistic errors is a widespread view; and the correct details do not matter for my main point. Nonetheless, it’s worth indicating some of the more specific extant ideas, both for the sake of concreteness and to underscore that my suggestion isn’t ad hoc, but rather adverts to ongoing, independently motivated theorizing. (In that sense, my suggestion is thus also modest.) In addition, though the mere existence of error signals generated by monitoring mechanisms might suffice for my reply to Devitt’s challenge, the details do matter for some more specific questions flagged in what follows.

⁵ Henceforth I use the Voice of Competence label for the view that linguistic intuitions are the product of a modularized language faculty that alone delivers the relevant content to mechanisms responsible for judgment; and I take up the question of privileged access as a possible further requirement rather than building it in. It’s of course less important which positions are allowed the label, so long as the positions themselves are clear. Note that Rey’s (this volume; forthcoming-b) mentalist defense of the Voice of Competence view involves dropping the content requirement. He thus rejects Devitt’s characterization of the view, whereas Maynes and Gross (2013) deploy Devitt’s characterization (after all, it’s his term) in rejecting Devitt’s ascription of the view to the mentalist.

20

 

Initial ideas in this area were developed in theorizing about monitoring for speech production errors, and so we start with some examples of these. Production monitoring might seem not directly relevant to my topic, since linguistic intuitions are elicited in response to presented stimuli. But monitoring mechanisms have been posited in comprehension as well, and, according to some, monitoring mechanisms in production and comprehension are intimately related (Pickering and Garrod 2013). Maynes and Gross (2013) cite Levelt’s (1983, 1993) perceptual loop theory, according to which we monitor our production via comprehension. The basic idea is simply that we listen to what we ourselves say. Evidence for this includes the observation that blocking auditory feedback with ambient white noise negatively affects our ability to catch production errors (Oomen et al. 2005). But it seems that this is not the only, and perhaps not even the central, mechanism: among aphasic patients there is a double dissociation between comprehension and self-speech error detection (reviewed in Nozari et al. 2011). An alternative view hypothesizes that an efference copy of the motor command is sent to a forward model that generates, for checking, an expectation concerning future states of motor or sensory systems (or both) (Tian and Poeppel 2010). This temporally more plausible approach applies to linguistic production a widely held view of motor control more generally (Wolpert 1997; Shadmehr et al. 2010). It is less clear, however, how classic versions of such views work or are motivated for higher level, more abstract linguistic features such as syntactic structure, since such features are “upstream” from motor commands. Conflict models (Nozari et al. 2011) do not require efference copies. On such views, what is monitored is the comparative activation level of candidate linguistic representations (lexical items, phonemes, etc.) with a conflict signal produced when the difference in activation is insufficient for there to be a clear winner. In production, this might occur, for example, when /b/ and /c/ both get activated, to a sufficiently close degree, when you are trying to say “cat.” In comprehension, an ambiguous signal, for example, might likewise lead to competing candidate representations with insufficiently differentiated activation levels. Finally, there are many models of comprehension that incorporate prediction (Kutas et al. 2011). For example, an incremental parser, governed by various grammatical constraints, may generate expectations regarding syntactic features of lexical items to come. A mechanism might then monitor discrepancies between what is expected and the incoming signal. Forward models are examples of monitoring mechanisms that incorporate prediction on the production side. (See Pickering and Garrod 2013 for an attempt to integrate prediction in production and comprehension.) Research in this area is active and ongoing. But we need not place bets. We can consider the consequences, should some such model pan out.

 

21

2.5 Error signals and linguistic intuitions Whatever the model, suppose that problems generate an error signal. Consider, for instance, a failure to parse owing to ungrammaticality. Perhaps, on a conflict model, no structural description is activated at a level sufficiently greater than the rest to “win” the competition among candidates; or, with a predictive parser, perhaps the signal’s completion fails to meet the expectations generated by previous material—perhaps even after attempted reanalysis. If monitoring mechanisms in such situations generate an error signal, then we are not limited to suggesting, as Gross and Maynes (2013) did, that a failure to parse yields an absence of output. The parser may issue an error signal.⁶ The error signal, if it can in turn play a role in generating a judgment of unacceptability, enables me to elaborate my reply to Devitt’s etiological challenge. One can bolster the suggestion by noting various features such signals can have that mesh with features commonly associated with judgments of unacceptability. First, there is negative valence. It is an error signal after all. Likewise, subjects may express their negative judgment of a string by saying that it sounds bad: compare Pietroski’s (2008) use of “yucky” (or “icky,” another technical term I recently encountered in a linguist’s talk). Second, error signals can have motivational force—corrective in production, corrective or aversive when interpreting others. Maynes and Gross (2013) note a possible connection to social cognition and in-group–out-group identification. Linguistic intuitions are typically divorced from actual use; they concern rather what one could or would say or understand; but they still may be associated with an off-line or dispositional motivational force. Third, error signals might suggest the violation of a norm—indeed, perhaps they are a source of linguistic prescriptivism. Linguistic intuitions likewise may be associated with a sense of wrongness. Fourth, error signals may be graded—that is, they may come in various strengths. Defeated probabilistic predictions, for example, can be associated with varying degrees of surprisal. Similarly, the gradedness of linguistic intuitions has long been noted, whether in linguists’ use of varying marks (*, **, ?, *?) to record their own judgments or in the graded results of formally collected judgments ranked on a scale. Finally, each of these features may have an associated phenomenology: a felt sense of badness, motivation, and norm violation of some particular strength. Often they may not, or not to a noticeable degree; otherwise the signals generated by the prevalent disfluency in ordinary speech might get in the way of ⁶ Indeed, a possibility—and not the only one (here is a place where the details do matter)—is that the absence of a parse, given the activation of the parser by the string, itself plays this very role. The monitoring system may be so constructed as to construe this absence (given the cueing up of the parser by a language-like stimulus) as a signaling of error. This would not allow, however, for gradedness. Alternatively, the absence may partially constitute the error signal, or the error signal may be a completely distinct state.

22

 

conversational flow. But perhaps they may if attention is appropriately deployed, as may well happen when one is a subject performing an acceptability judgment task. Conscious or not, the signal may play a causal role in the generation of a linguistic intuition and may be relevant to the epistemic status of the judgment. But noting the possibility of error signals rising to consciousness lends some measure of introspective plausibility to the proposal. Moreover, states capable of consciousness arguably can play a particular sort of epistemic role that states in principle unconscious cannot: they can serve as justifications for the epistemic agent (as opposed to serving, at best, as warrants unavailable to the agent)—in the case at issue, justifications for the person’s forming metalinguistic judgments.⁷

2.6 Error signals as the Voice of Competence? The parallels between error signals and judgments of unacceptability do not establish that error signals play a role in the generation of linguistic intuitions, but they help render it a plausible hypothesis, worth exploring. Even harder to establish is the more speculative claim that such signals might satisfy the content requirement. Nevertheless, the possibility is worth serious consideration, and one can provide some motivation. The error signal is a signal. It functions to deliver information to monitoring systems concerning what is occurring in language-related mechanisms, so as to initiate repair, reinterpretation, or some other corrective measure or appropriate response (perhaps including asking for clarification: see further, section 2.8). Moreover, the states it functions to deliver information about are representational states, concerning the structure of the presented string, for example. Given the signal’s functional role, with these sorts of causes and these sorts of downstream effects, it is a natural thought that the signal itself might have representational content. Whether that thought is correct or not depends on the criteria for a mental state’s bearing content, a highly contentious matter. But a leading view is that assigning a mental state content on the basis of what causes it and how it is consumed is warranted when doing so yields explanatory illumination (e.g. Shea 2007, 2012). Arguably, this condition is met by error signals: our understanding of what a monitoring system does involves our seeing the system as capable of being informed of a problem, and it is the error signal that does the informing. Note, moreover, that it seems in principle possible for a monitoring system to itself be in error (see n. 10), so that a signal can misrepresent the state of the system being monitored. The possibility of misrepresentation is often considered a necessary condition of intentionality (Dretske 1986). ⁷ The nature and types of justification and their relation to conscious access is too large and controversial a topic to develop here. For some discussion and pointers, see Pappas (2017).

 

23

Just what would the content of an error signal be? There are various natural candidates. But all seem to enable the satisfaction of at least some version of the content requirement. Consider again a failed parse and a subsequent judgment of unacceptability. The candidate gloss that most obviously would make the case is This string is unacceptable—or perhaps such close relatives as This string has unsurmountable problems or Something is wrong with this string (or variants that refer to the utterance). If the error signal’s content is thus glossed, then clearly, at least for such intuitions, one could defend the Voice of Competence view, even with its content requirement. In these cases, speakers’ judgments can reflect constraints built into a modularized language faculty that outputs the content contained in the judgment. This is the clearest and best case, so far as satisfying the content requirement goes. But a consideration of alternatives only lends further support. Our opening gloss makes reference to the string. But perhaps the state is more purely interoceptive, indicating only how things are with the subject. If so, a better gloss might be I have—or this mechanism has—encountered unsurmountable problems. It is not obvious that the error signal itself makes no reference to the string as the locus of the problem; but, even if this is so, the content requirement may be satisfied. For it remains the case that linguistic processes yield a representation of the stimulus (minimally, a representation as of these phonemes and lexemes, in this order). Plausibly this representation, together with the error signal, can supply the content of the unacceptability judgment. But perhaps the error signal indicates, not that the problem has become unsurmountable, but only that problems persist: There is (still) a problem. The further information that there are no options left, or that further efforts are not worth the cost, might require further states—for example, perhaps a monitoring state that indexes effort expended. Yet it would remain the case that error signals not only play an etiological role in the generation of the unacceptability judgment, but also supply content to monitoring states that, perhaps in concert with the representation of the string itself, supply the content of the judgment. Thus the content requirement would still be satisfied. Similar remarks apply to the suggestion that error signals have imperatival content (see Klein’s imperatival view of the content of pain states in Klein 2007)—perhaps glossed as Appropriately respond to this problem! (Suppose they only have such content. If this is just additional content, no issue is raised.) The persistence of such states can play a role in generating states that indicate that there is nothing more that is worth doing. An intriguing—still more speculative—possibility is that the signal carries more specific information about the source or nature of the problem. Recall that we motivated the assignment of content to error signals by reflection on their functional role, which is to inform a monitoring system of a processing problem so as to cause an appropriate response. But, just as a fire department needs information concerning where an alarm is coming from if it is to do anything

24

 

about it, so may the monitoring system require more specific information about the signal’s source, or about the kind of problem. Suppose the signal itself carries that information. In our example, given the source of the problem (a failure to parse owing to embodied grammatical constraints), one might argue that the signal has a content that reflects this specificity: not just that the string is unacceptable, but that it is ungrammatical (as opposed to unpronounceable, pragmatically unacceptable, etc.). Indeed, perhaps an error signal could in principle indicate the nature of the violation even more specifically—for example as subjacency violation. Some such suggestion might be particularly tempting if the monitoring mechanisms that consume such signals are domain-general, since a domain-specific monitoring system might not need to sort among signals’ varying causes. Whether the monitoring systems implicated in language use are domaingeneral or domain-specific (or some of each) is an unresolved empirical question, and one where the details matter (Nozari and Novick 2017; but see Dillon et al. 2012 for some event-related potential (ERP) evidence that error signals can encode their cause).⁸ Suppose error signals do have more specific content. It would not follow that judgments caused by or based on them also have this more specific content. It is one thing for a state to have some content and another for the subject to conceptualize it as such, or even to be able to. Compare the representation of color features. The visual system might represent some object as having certain color features; but the color categories available to conceptual systems may be much coarser than those available to vision. (This could be so even if color perception is categorical, though see Witzel forthcoming for arguments that it isn’t.) Similarly, even if error signals represent more specifically the source or kind of error, it is a further question whether the judgments subjects make on their basis do so as well. Thus, in principle, an error signal with the content That string is ungrammatical could cause and be the basis for a judgment with the more generic content That string is unacceptable. If so, should we say that the content requirement is satisfied? The signal and the judgment do not have the same content. But the signal’s content warrants the judgment’s in a way that seems analogous to how perceptual contents more generally can warrant perceptually based judgments with different but closely related contents (see Peacocke 2004 on canonical conceptualizations). I suggest that this should suffice for a content requirement worth preserving. The possibilities entertained so far are all consistent with denying speakers privileged access to grammaticality facts. Even if the signal’s content is supplied by

⁸ In support of more specific content, one might also advert to subjects’ often being able to provide some indication of where the problem lies and to suggest fixes. It is unclear, however, whether this on its own favors more specific error signal content over a Devitt-like specification in judgment, in light of a less specific error signal.

 

25

linguistic competence (whatever level of specificity that content may have), it doesn’t follow that the speaker knows that it is. We generally have unreliable introspective access to the causal source of our judgments, and content not conceptualized as such is not available to judgment. Thus, the “voice” of competence may be “heard”—in the sense of its causing and supplying a basis for a judgment—without the subject’s knowing that it is competence that is “speaking.” Theorists would still need to engage in the hard work of inference to the best explanation—as, for example, is the case in trying to sort out whether the judgment data concerning binding phenomena reflect syntactic or pragmatic constraints (e.g. Chomsky 1981 and Reinhardt 1983). But suppose we contemplate the possibility that error signals about ungrammaticality cause and support linguistic intuitions with corresponding content: metalinguistic judgments with the content That is ungrammatical. This could help provide grounds—at least regarding some intuitions—for a Voice of Competence view that also requires privileged access to grammaticality facts. (Privileged access might require reliability as well. Perhaps cases such as sentences with multiple center embeddings could be deemed outliers.) Thus might one try to defend a limited application of the full view Devitt opposes. The application, however, would be quite limited indeed, if grammaticality judgments are in fact not often invoked as evidence, as opposed to in explanation. Still, perhaps subjects may in some cases feel and be able to express that a problem seems grammatical, or in some sense structural. Linguists, with their theoretical expertise and fuller array of concepts, may even achieve further specificity: perhaps a particular case may feel like a subjacency violation.⁹ That said, methodological caution suggests that we distinguish more solid evidence from what could be merely theoretical hunches. But we need not dismiss the possibility of such linguistic intuitions out of hand, even if they are not given much weight or play in practice. (I mark below some other avenues of possible support.) The question of specificity bears on a challenge Devitt raises to the Voice of Competence view. He asks: If our competence “speaks” to us, why does it not say more? Why do linguists not find themselves with a broader array of speakers’ intuitions to draw on—specified in the linguist’s language of c-command, heads, and so on? Given my remarks above, we can divide the question into two. Why doesn’t the language module deliver more specific content (if it doesn’t)? And, supposing it does, why is it that speakers typically cannot non-reflectively conceptualize that content in judgment? It is not clear that the mentalist’s inability to answer at present would be particularly problematic. But the questions are ⁹ It is not enough that one possess the relevant concepts, though that is necessary and is a further— and later—achievement than acquiring language itself (Hakes 1980). If the concepts are to contribute content to intuitive judgments, one must also be able to deploy them unreflectively in conceptualizing one’s experience in response to presented strings. Acquiring syntactic concepts in a linguistics course may not suffice for this.

26

 

interesting nonetheless. One way to take them is as design questions, so that we might ask what purpose would be served by having things other than they are, and whether things overall would be in some sense better if they were that way. I posed the question of the domain generality versus domain specificity of monitoring mechanisms in this functional way. It is important to bear in mind, however, first, that speakers may not achieve optimality regarding the generation of linguistic intuitions (there is certainly no reason to expect our capacities to be optimal for enabling successful linguistics!); and, second, that our capacity for yielding intuitive metalinguistic judgments may be a by-product—for example, of monitoring systems that may operate in large part unconsciously. (Questions of function briefly recur in section 2.8.)

2.7 Other linguistic intuitions, other sources I have focused on intuitive judgments of unacceptability, remarking on the possibility of intuitive judgments of ungrammaticality as well. But what of other linguistic intuitions? The various judgments that have been called “linguistic intuitions”—for example, those with which this chapter starts—may not form a natural kind. Different kinds of linguistic intuition may have different etiologies and may require different accounts of why they constitute evidence. In principle, Devitt’s view could be right about some, wrong about others; and where it’s wrong, a Voice of Competence view incorporating the content requirement may likewise be right for some, but not for others. Let us consider some cases. What of judgments that a string is acceptable? There is an obvious asymmetry here. In such cases, there is presumably no error signal to play an etiological role, so the content of the intuition would not seem to be the content of some output of the language faculty. A possibility is that, while error signals lead to judgments of unacceptability, their absence leads to judgments of acceptability. As for satisfaction of the content requirement, well, the suggestion was restricted to some linguistic intuitions; perhaps judgments of acceptability are not among them. But an alternative to invoking merely the absence of error signals would instead extend the account to encompass positive, “non-error” signals. Perhaps monitoring mechanisms, when functioning properly, should be construed as always vigilant and thus always in effect receiving a signal. The absence of an error signal would then be itself a signal of proper functioning—perhaps, further, a state with content to the effect that This string is acceptable. If so, the content requirement could be satisfied after all, and the Voice of Competence view extended to judgments of acceptability.¹⁰ ¹⁰ One might worry about cases where an unacceptable sentence is judged acceptable (at least at first, or unreflectively)—as happens with plural attraction (“The key to the cabinets are on the table”) and,

 

27

There is, however, another possibility to consider. Perhaps what is causally and epistemically most significant for judgments of acceptability is neither the absence of error signals nor the presence of a positive “no error” signal, but rather the speaker’s having comprehended what was said. As we saw, Chomsky incorporates comprehensibility into his gloss on acceptability. Indeed, having noted this possibility, one might challenge the need for error signals in accounting for unacceptability judgments as well: perhaps it suffices that one not comprehend the string. But it would not follow that error signals do not play a role. Even if one factor would suffice, both might be present. (The same source could cause both the signal and (ultimately) the failure to comprehend.) In any event, adverting to comprehension cannot serve as a complete account. For there are strings that are readily comprehended, but also readily judged unacceptable, such as “She seems sleeping.” Chomsky’s gloss does not require only comprehensibility.¹¹ That said, we can certainly assign comprehension (or the lack thereof) a significant role in the generation of and grounds for (un)acceptability judgments, without challenging the error signal suggestion. The error signal account need not exhaust the etiology of linguistic intuitions generally or of any specific kind of linguistic intuition. And allowing a role for (in)comprehension is not in tension with my aim of elaborating on Maynes and Gross’ (2013) reply to Devitt’s etiological challenge. Indeed, it meshes with it, since the fact of (in)comprehension also yields data for the linguist’s inference to the best explanation, data that can be relatively robust to variation in background theory. Note that, pace Devitt’s complaint, in this case we do have some idea of how one gets from embodied grammatical constraints to the linguistic intuition. Of course, major gaps exist in our knowledge of utterance understanding (this is what much of linguistics is about), but there is no further special gap introduced by linguistic intuitions of this sort. more persistently, with comparative illusions (“More people have been to France than I have”). Why doesn’t the error signal yield an unacceptability judgment here? If it doesn’t, this need not be an objection: error signals might not always succeed in generating appropriate linguistic intuitions; and our judgments of (un)acceptability need not be infallible. But, in any event, the question’s presupposition that there is an error signal in such cases may be mistaken. Perhaps the question should be: how do such strings get past the parser without generating an error signal? Just as visual illusions provide insight into the fine structure of visual processing, such cases can illuminate the quirks of linguistic processing (Wagers et al. 2009; Wellwood et al. 2018). ¹¹ Another, more contentious reply to the suggested alignment of comprehension and acceptability would invoke alleged cases of uncomprehended strings judged acceptable, such as lines from Jabberwocky or particularly inscrutable bits of philosophy (Rey this volume; forthcoming-b). To my knowledge, judgments concerning such cases have not been investigated in a controlled experimental setting (though see Pallier et al. 2011 for neurolinguistic investigation). But, in addition, a potential counter-reply is that they are sufficiently understood, insofar as subjects construct a metalinguistic or deferential concept to handle the problematic open-class terms (borogoves, whatever they are): cf. Higginbotham (1989). Incidentally, another possible use of such cases (if they are granted) might be to bolster the possibility of useable intuitive judgments of grammaticality (see Rey this volume; forthcoming-b).

28

 

Adverting to comprehension in such cases, however, responds to the etiological challenge without satisfying the Voice of Competence’s content requirement. Devitt (2006b: 118), in raising the challenge, maintains that what is delivered to central systems is the “message”—that is, bracketing delicate semantic–pragmatic issues, something like the content of what is said—not information that leads to arriving at it (recall that Devitt has in mind information about grammaticality). I would suggest that what is delivered is rather something to the effect that the speaker said that P in uttering S (mutatis mutandis for other speech acts).¹² Either way, the content of the intuition itself is indeed not delivered. But we can see how the comprehension story could allow a simple transition from the fact of (in)comprehension to a judgment of (un)acceptability—modulo the contribution of an error signal. Comprehension plays a significant role in other linguistic intuitions as well— for example in judgments of co-reference and in truth-value judgments. But here it’s not just a matter of whether one comprehends, but also of what one takes the content to be. For example, to answer whether the bolded terms in “John’s roommates met him at the restaurant” co-refer, one in effect reports whether one understood it to be John whom the roommates met at the restaurant. Of course, answering the question so formulated requires some metalinguistic awareness (and, more specifically in this case, possession of the concept of co-reference), as does any metalinguistic judgment. To that extent, the judgment goes beyond mere comprehension of what it said, though not much.¹³ (Parallel remarks apply to other judgments concerning what “readings” a subject gets.) Similarly, for truth-value judgments it matters, not just that one comprehends the sentence, but also what content one assigns it. In this case, the judgment goes beyond one’s comprehending the sentence in a further way: one must assess that content against some scenario. Having the capacity to do so in a certain range of cases, however, is arguably in part constitutive, or highly diagnostic, of one’s capacity for comprehending the sentence. (Of course, to form the appropriate metalinguistic judgment, one must also exploit one’s grasp of the concept of truth.)

¹² That the string and not just its content is delivered can be the case even if the string is then more easily forgotten (Sachs 1967). Note that including the string in what is delivered as output to central systems raises the question of how the string is represented at this stage. To the degree that structural information is preserved, this again might provide resources for those who would defend the possibility of privileged access to more specific linguistic features. Whether—and, if so, to what extent and in what ways—syntactic features are perceptually experienced (as phonemes, morphemes, order, etc. are) is a related question—and a delicate one. ¹³ One could eliminate the metalinguistic element in this case by telling the subject that John’s roommates met him at the restaurant (using these words) and by asking whether it was John whom they met. This task would not involve linguistic intuitions if one defines them to be metalinguistic judgments. The difference does not seem to amount to much in this case. Note, further, that, if comprehension involves delivering to central systems a content that relates what is said to the string uttered (as suggested above), then, depending on the details of what is represented, this metalinguistic element may be included in comprehension, even if not in the expressed judgment.

 

29

Moreover, in these cases where what matters is what one takes the content to be, not only do we have some understanding of the intuition’s etiology (to the extent that we understand the etiology of comprehension), but also—insofar as the intuition amounts to little beyond comprehension—the content requirement is fulfilled and arguably the Voice of Competence is vindicated.¹⁴ Note in particular that Devitt’s characterization of the role of linguistic competence in generating grammaticality judgments—as supplying material for a central system response—seems inapt in those cases where the intuition is just—or goes just a bit beyond—comprehending what was said. Do error signals play a role in these linguistic intuitions? In positive cases— where the terms do co-refer (where one gets the reading) or where what is said is judged to be true—what one understands the sentence to say seems much more significant, even if the absence of an error signal (or the presence of positive nonerror signals) is concomitant. But the matter is less clear at least in some negative cases. Suppose the terms do not co-refer, or one judges the sentence to be false. Does the subject’s understanding of what was said supply the resources necessary to arrive at these judgments without recourse to error signals? One way to suggest a role for error signals would implicate them in general processes of belief formation and maintenance. Perhaps such processes involve a monitoring system for one’s epistemic states, where clashes between an entertained thought and other beliefs can yield an error signal (perhaps sometimes a conscious, pre-judgmental feeling of wrongness) that leads to the thought’s rejection; and perhaps such processes are implicated in our subject’s negative verdicts. The suggestion would require independent motivation (why think that error signals are required, as opposed to operations that directly adjust beliefs in response to reasons?).¹⁵ In any event, these would not be error signals generated by specifically linguistic monitoring mechanisms. But it might also be suggested that error signals generated from monitoring mechanisms specifically for language could play a role, at least in some cases. When one considers whether the bolded terms in “Sally gave her the book” corefer, one might not only consider how one naturally understands the sentence, but also attempt to find a reading where they do co-refer—all the more so if one is asked whether they can co-refer. In such a case, one’s failed attempts might involve error signals (be they from a violated grammatical constraint, from a mismatch between the parse and the semantic supposition, or from a pragmatic violation) that ground one’s judgment. The possibility of a language-specific error signal is less obvious for truth-value judgments. But candidates could be ¹⁴ More fully defending this would require unbracketing the delicate semantics–pragmatics issues. But see Sperber and Wilson (2002). ¹⁵ David Pereplyotchik (personal communication) rightly points out that an analogous charge could be raised against positing error signals in language monitoring. The question is simply which models are empirically justified.

30

 

truth-value judgments concerning the special case of strings that are “analytically” false in the sense that they can be known to be false just through the exercise of one’s linguistic competence (Pietroski 2002). I have been suggesting that comprehension may play a much more significant role for some linguistic intuitions than do error signals. But there are also cases where it is clear that comprehension (in the sense of understanding) plays no role whatsoever. Consider phonological intuitions such as judgments as to whether a stimulus is a pronounceable or a possible word in one’s language. Comprehension plays no role in one’s assessment of /fant/ and /zgant/. The error signal account, on the other hand, neatly extends to such cases (though, since now the source of the signal is different, perhaps so is the signal’s content, if the content is specific). Again, an account that focuses on comprehension is incomplete. Finally, consider sociolinguistic intuitions such as judgments of leave-taking (the last of my opening examples). The processes by which these are formed may be altogether different. One might try arguing that there is a role for error signals or comprehension. Perhaps one attempts to simulate leave-taking in imagination and registers a feeling of wrongness with some candidates (though the source of this feeling may not be specifically linguistic); perhaps one “understands” some of the candidates as differing in their linguistically encoded level of formality or politeness. But another possibility is that one forms an empirical judgment on the basis of one’s memory of past usage. Such cases might fit Devitt’s model well. Linguistic intuitions thus may differ in their etiologies, and a single intuition may have several language-specific bases. I have suggested that error signals play an etiological role, but not that an account that adverts only to them is complete. Comprehension plays a significant role for some linguistic intuitions, and perhaps other bases matter as well—for example, indices and feelings of effort and fluency (which were briefly mentioned in 2.6; and compare Luka 2005).

2.8 Why not intuitions elsewhere? In considering the etiology of other linguistic intuitions, I mentioned the possibility of monitoring mechanisms for other aspects of cognition and behavior; and earlier I noted that forward models are common in accounts of motor control. The possibility that monitoring systems are ubiquitous—more specifically, monitoring that involves error signals—connects to another challenge that Devitt raises for the Voice of Competence view. In some sense, he says, we have “embodied rules” for swimming, typing, and other skills. If “embodied rules” of language yield intuitions that provide fruitful evidence for linguistic theorizing, why don’t “embodied rules” for swimming and the like do the same? Adapting this question to the present discussion, we can ask: If such monitoring is ubiquitous, why don’t

 

31

error signals concerning swimming and the like provide the basis for fruitful theorizing in those domains? I conclude by offering a reply. First, judgments analogous to linguistic intuitions do play a central evidentiary role in various domains—most obviously in perception science. So, to that extent, linguistics is not a special case. Indeed, even acceptability judgments more specifically (or something very much like them) are exploited in some domains, for instance in music cognition (Patel et al. 2008; Featherstone et al. 2013) and in moral psychology (Cushman 2015). But what about the sort of motor skills Devitt emphasizes? There is no a priori reason why intuitions could not supply fruitful evidence in these domains—and some instances can be found. For example, Ward and Williams (2003) had players of different skill levels view video clips of soccer action sequences and then answer questions about the likely direction of the next dribble or pass and about the appropriate positioning of teammates for receiving a pass. We may grant, however, that judgments about the exercise of motor skills or their products are less commonly exploited as evidence about those skills. Why might this be so? There is a variety of possible (non-exclusive) explanations. It could simply be that other methods have proven sufficiently fruitful to have occupied these fields so far. But there may also be differences that would explain why we should not expect intuitions ever to play a central role. Most relevantly to our topic, the usefulness of intuitions in linguistics, as opposed to swimming, may in part reflect the crucial role of a dedicated, more or less modular mechanism with its proprietary quirks, recursivity, and interface constraints. That is, intuitions may provide fruitful evidence for theorizing about (aspects of) language, but not about swimming, precisely because something like the Voice of Competence view is correct in this domain. Other differences may also be relevant. For example, many skills arguably lack an analogue of comprehension, in two senses. First, for most skills, to perceive exteroceptively an act of exercising a skill is not itself to exercise that skill: watching someone swim is not itself swimming, perhaps against the view of some mirror neuron theorists.¹⁶ Second, linguistic comprehension involves understanding what someone said. But, for many motor skills, their exercise does not involve understanding the actions of an agent, in particular understanding the agent’s (communicative) intentions. This is relevant to intuitions in that, as I have noted, for some linguistic intuitions, comprehending what was said just is—or is almost—the forming of the intuition itself. Finally, comprehension in this second sense is connected to the particular demands for coordination found in

¹⁶ Interoception in production may be in some sense an aspect of the exercise—and, again, in principle it could serve as a source of evidence. But, as noted above, judgments formed on the basis of interoception in production are not analogous to typical linguistic intuitions, which are responses to externally given language-like stimuli.

32

 

communication, which is among the primary functions that language subserves. To be sure, there is synchronized swimming. But, arguably, the specific forms that coordination takes in language use (asking for clarification of an ambiguous or just hard-to-hear utterance, engaging in lexical negotiation) require, or at least promote, a capacity for metalinguistic awareness that, perhaps as a by-product, renders language users especially able to provide useful intuitions in this domain. Language-related education—most obviously, years of concentrated training in reading and writing—may similarly play a role (see Schütze 1996). This is not to deny that the exercise of other skills can involve coordination with others and the comprehension of intentions. But, when they do, as with soccer, we have seen that judgment data may be useful after all. More generally, exercising different kinds of motor skill may make varying demands on cognitively driven conscious control; Devitt’s examples may simply fall at the less demanding end (cf. Montero 2016; Gregory et al. 2016). Here and above, this exercise in empirically motivated speculation deploys a fair number of “maybes” and “perhapses.” But, whether the suggestions pan out or not, they at least provide “the beginnings of a positive answer” to the questions I have posed, however transformed those questions may be from Devitt’s.

Acknowledgments This chapter stems from a brief aside in Maynes and Gross (2013)—the remains of a paragraph excised for reasons of space: my thanks to Jeff Maynes. Material related to this paper was presented at the Johns Hopkins Cognitive Science Brown Bag; the Norwegian Summer Institute on Language and Mind; the Southern Society for Philosophy and Psychology; the Society for Philosophy and Psychology; and the Aarhus Workshop on Linguistic Intuitions, Evidence, and Expertise. For help of various sorts, I thank Sam Schindler, Megan Hyska, Fritz Newmeyer, Sam Featherston, Colin Phillips, Jon Sprouse, Carlos Santana, Lila Gleitman, Mike McCloskey, David Pereplyotchik, Tal Linzen, Nick Allott, and OUP’s anonymous referee (my apologies for omissions). Special thanks to Georges Rey and Bonnie Nozari. Extra special thanks to Michael Devitt for his support and encouragement. The first person to whom I floated some of these ideas, back in 2012, was Akira Omaki, whose enthusiasm and kindness are much missed.

3 A defense of the Voice of Competence Georges Rey

3.1 A “Voice of Competence” Considerable skepticism has mounted in recent philosophy around the reliability of the intuitions about meaning on which many philosophers seem often to rely. Here I want to address what may seem a quite similar issue regarding the role of the intuitive verdicts of speakers on which Chomskyan linguists standardly rely in testing their theories of syntax and phonology. Skepticism surrounds them as well, but I will argue in this chapter that it is less warranted than in the case of semantics. As Quine (1976 [1954]) pointed out, it is difficult to see how to distinguish intuitions about semantic or purely “conceptual” issues from merely deeply held beliefs about a domain: is “Cats are animals” analytic, as Katz (1974) insisted it was, or is it merely a biological hypothesis, disconfirmed if they turn out to be robots, as Putnam (1979) claimed they could do?¹ Syntax and phonology have the advantage of not interacting so closely with a speaker’s worldly beliefs, and so the intuitions about, say, whether a given string is well formed or is pronounced in a certain way are not likely to be confounded with these beliefs. Indeed, syntax seems to me to provide a parade case of the successful appeal to the kinds of special intuitions on which philosophers might aspire to rely for claims about meaning, even if no one is likely soon to be in a similar position to seriously do so. What I will defend here is at least a version of what Michael Devitt (2006a, 2006b, 2006c, 2008a, 2014a, 2014b) has called “the Voice of Competence” (VoC) view, according to which the spontaneous intuitions of native speakers of a language provide special evidence of their linguistic competence, in particular of the phonology and syntax of their I-language (3.1).² I will sketch what I submit will be a fairly straightforward perceptual model of parsing, and so of a VoC, and a ¹ For recent discussions relevant to the issue, see the chapters by John Collins, Anna Drożdżwicz, and Carlos Santana in this volume and Rey (forthcoming-b, ch. 10). ² Since one cannot take on all disputes at once, I will assume here that Chomsky is right in thinking that at least his theory is about an underlying linguistic competence, which, he thinks, is explained by a computational system he calls an “I-language,” in opposition to “E-languages” such as English or Mandarin, which are the usual concern of ordinary thought (note that, so understood, an “I-language” is not even the sort of thing one might speak). This is not to belittle the study of these latter E-languages by socio- and historical linguists, as Chomsky might be thought to be doing. More on this in 3.1.2. Georges Rey, A defense of the Voice of Competence In: Linguistic Intuitions: Evidence and Method. First edition. Edited by: Samuel Schindler, Anna Droz˙ dz˙ owicz, and Karen Brøcker, Oxford University Press (2020). © Georges Rey. DOI: 10.1093/oso/9780198840558.003.0003

34

 

number of distinctions that its defense requires (3.2). What is crucial to intuitions serving as special evidence is their having a special etiology. Intuitions of linguists who investigate languages that are not native to them likely have a different etiology, and so might not provide any such special evidence. I make no claim that the model is true: I am not a psycholinguist remotely in a position to establish any such truths, which in any case would require much more evidence and clarity about the structure of the mind than I take to be available anywhere. I am only concerned to reply to objections that Devitt (2006b: 118; 2014: 288) raised to a VoC understood as a seriously “respectable model.” However, I will also consider evidence that seems to me to make it an empirically plausible one, more plausible than the alternative one that he proposes (3.3).

3.1.1 Devitt’s skepticism about (non-)standard models I shall not be defending what Devitt calls the “standard Cartesianism” regarding the representation of rules that serves as his main target.³ Rather I shall limit myself to what he calls a “non-standard Cartesian explanation,” committed merely to the weaker view that some intuitions are the relatively direct causal result of the representation of linguistic entities and properties that are the result of processes of the I-language that are merely “embodied” in the brain one way or the other (see Devitt 2006b: 117). The principles (and/or parameters) might, of course, turn out also to be represented; but this will not be a concern here. Devitt rejects not only a standard Cartesian view, but any non-standard one as well: Any non-standard Cartesian explanation . . . must give the embodied but unrepresented rules a role in linguistic intuitions other than simply producing [presumably external speech] data for central-processor reflection. And it must do so in way that explains the Cartesian view that speakers have privileged access to linguistic facts. It is hard to see what shape such an explanation would take. (Devitt 2006b: 118)

Indeed, he goes so far as to say: “We do not have the beginnings of a positive answer to these questions and it seems unlikely that the future will bring answers” (Devitt 2006b: 118). It is particularly this last claim that seems to me unwarranted,

³ “Cartesianism” comes from René Descartes, who claimed we had special knowledge about our own minds and about a priori domains such as mathematics. Devitt (2011b) rejects this view, defending a “Quinean” methodology, which treats a person’s beliefs as being confirmed by experience only as a whole—a methodology I criticize in Rey (forthcoming-a).

      

35

and I will describe in this chapter at least the beginnings of a perfectly naturalistic VoC, as well as provide some serious prima facie evidence for it.

3.1.2 I-languages vs. E-languages Devitt’s opposition to a VoC is part and parcel of his non-Chomskyan conception of linguistics as concerned with a largely non-psychological “linguistic reality,” which he defends at length in Devitt 2006b and 2008a. And perhaps if the intuitions on which Chomskyans rely were only about an E-language, then certainly some wariness would be in order: After all, why should speakers have any privileged access to a social conventional system largely external to them? But even if linguistics were concerned with an external “linguistic reality” of the kind we encounter in sociolinguistics and in historical linguistics,⁴ still the question would arise how anyone comes to understand this reality—particularly an infant! Whence come even the linguistic categories? At any rate, I will assume that the issue about intuitions is about the underlying, largely innate system that makes even the perception and production of an external language possible. There is one point on which there is a crucial difference between Devitt’s and a Chomskyan conception of “intuitions”: on Devitt’s view, intuitive verdicts about a string are understood as straightforward claims about the strings, rather in the way in which intuitive verdicts about arithmetic might be, to be assessed merely in terms of their truth. For Chomskyans, the verdicts might indeed be true—a speaker might correctly think that a certain string is ungrammatical in her internal grammar—but what is crucial is not their truth, but their role as evidence of the structure of the I-language. “Unacceptability reactions” need not be in the least self-conscious or metalinguistic in the way they typically are for linguists and their reflective students. It would be enough that speakers simply produce some idiosyncratic reactions to various strings: hesitation, perplexity, or just pupillary dilation would suffice as well. Linguistic intuitions are taken seriously only if they are specific sorts of causal manifestations of the I-language, and not, as they could be for Devitt, simply intelligent surmises. For the purposes of this chapter I will largely ignore this difference between truth and mere evidence, simply addressing the claims, challenged by Devitt, that the verdicts could ever be the result of a special route to knowledge, a VoC. That is, I shall be addressing merely the claim: if, as Chomskyans think, intuitive verdicts of native speakers were true and manifestations of an I-language, then that would

⁴ But what about the apparent appeals to E-languages on which Chomskyans appear often to be relying in their talk about what one can and cannot say “in English”? These appeals can be regarded as simply short for “the I-language more or less by those who are called ‘English’ speakers.” Nothing depends on their being so identified.

36

 

arguably render them epistemically privileged. Such a claim would be philosophically interesting apart from the issues of a merely Chomskyan linguistics.

3.1.3 Devitt’s alternative proposal Instead of a VoC, Devitt proposes that speakers’ intuitions are empirical central-processor responses to linguistic phenomena. They have no special authority: although the speaker’s competence gives her ready access to data it does not give her Cartesian access to truths about the data. (Devitt 2006b: 109)

Devitt has this to say about understanding speech: The language module[’s] . . . task of comprehension is . . . to deliver information to the central processor of what is said, information that is the . . . basis for judging what is said, for judging “the message,” . . . not intuitions about the syntactic and semantic properties of expressions. (2006b: 112, emphasis mine)

We will return to this claim shortly. For now, we should observe that there are two competing hypotheses: (VoC): Spontaneous linguistic intuitions at least sometimes have a special status due to their being caused in a relatively direct way by properties of the grammar (via the parser) and Devitt’s:⁵ (Dev) (the “modest explanation”): Intuitive judgments about language, like intuitive judgments in general, are empirical theory-laden central processor responses to phenomena, differing from many other such responses only in being fairly immediate and unreflective, based on little if any conscious reasoning. Although a speaker’s competence in a language obviously gives her ready access to the data of that language, the data the intuitions are about, it does not give her ready access to the truth about the data; the competence does not provide the informational content of the intuition. (Devitt 2014: 269, consolidating earlier material he quotes from Devitt 2006b: 103)

⁵ This is Devitt’s (2014) label for this view, to which he refers as ME, which would be misleading here for me to reject.

      

37

Some qualifications on how to understand these claims are in place: (i) Devitt does not provide any clear indication of precisely what he means by “the data of language” to which speakers are exposed. Judging from the passage quoted earlier (Devitt 2006b: 112), this would seem to be “the message.” But, of course, he is perfectly well aware of the complexities of distinguishing “semantic” from “pragmatic” elements in any such content, as (for starters) in the case of indexicals and lexical polysemy and ambiguity, some of which he discusses (see Devitt 2006b: §11.8). I shall assume for the sake of argument that Devitt intends here is some sort of truth-valuable message. (ii) Devitt’s view of what exactly delivers the message is also unclear. Given his dismissal of the suggestion that the hearer has any special access to metalinguistic descriptions, one wonders whether he thinks the hearer hears only noise: “What a speaker computes are functions turning sounds into messages in comprehension, and messages into sounds in production” (Devitt 2006b: 67). “Sounds” are usually understood at least as noises; but, as Devitt (2014b: 287) well knows, native speakers don’t hear speech in their language as mere noise (indeed, it’s virtually impossible for them to do so). What exactly they hear speech as is a difficult question, but one would have thought that they hear it at least in terms of the phonological categories of language. Devitt prefers not to address this issue, setting aside the question “whether language use involves representation of the phonetic and phonological properties of the sounds” (2006b: 222). He simply assumes that his “reasons for doubting that syntactic and semantic properties are represented would carry over to phonological properties (2006b: 222–3). In reply to Devitt, I want to argue that there is a perfectly reasonable, “naturalized” model of a VoC for both phonology and syntax that seems at least implicit in several proposals that have been made and that at least meets Devitt’s skeptical challenge of there being no explanation of a (VoC), nor any forthcoming.

3.2 Grammar, parsing, and perception What I and others have defended elsewhere and I will defend in more detail here is the claim that linguistic intuitions have the same status as standard reports of perceptual experience in vision experiments.⁶ In the case of language, what are ⁶ See Fodor (1975, 1983) and N. Smith (2004 [1999]: 141–2) for what I take to be earlier statements of the view.

38

 

produced by perceptual processes are structural descriptions (SDs) of at least some phonological and syntactic properties of various linguistic objects—words and phrases such as NPs, VPs, PPs, CPs, which, for brevity, I call “standard linguistic entities” (SLEs)—and the intuitions are reliable evidence insofar as those descriptions do in fact play a distinctive role in their perception and production.⁷ In discussing linguistic perception per se, a number of issues need to be sorted out.

3.2.1 Linguistic perception, phonology, and parsing An issue that Devitt addresses only in passing is the relation between a grammar and a parser. How exactly to draw the distinction is controversial, but I shall adopt the weak view that they are, at least conceptually, different entities.⁸ Thus, according to one recent textbook, The grammar constrains the parser’s structural analyses. However, the grammar does not have preferences about structural ambiguities, nor does it contain information about the resources necessary to process particular sentences. The grammar is part of the hearer’s linguistic competence, while the parser is a component of linguistic performance. (Fernández and Cairns 2011: 214)

Chomsky and Halle (1968: 24–5) proposed that “speakers ‘hear’ the phonetic shape determined by the postulated syntactic structure and the internalized rules”; and Jerry Fodor (1983: 93) claimed that linguistic perception involved an “informationally encapsulated module” that “deliver[s] representations which specify, for example, morphemic constituency, syntactic structure and logical form.”⁹ Here we may assume, at least as a tentative default position, that phonological and syntactic parsing are involved in linguistic perception and that such perception is at least weakly modular: the processes seem fast, spontaneous, mandatory and at least highly resistant to revision by central information, as for example in the case of most visual illusions and at least many of the *-ed strings Chomskyans claim are ungrammatical.¹⁰ No matter how much contextual information is ⁷ There is no need, of course, to claim that all phonological and syntactic properties are represented in perception. There may well be representations of features internal to syntactic and phonological computation that don’t surface in the output of the respective computational system. ⁸ See Devitt (2006b: 32n25, 63, 80, 200n8, 236–40). See also Chomsky (1991: 41) and especially Momma and Phillips (2018), who argue that they are in fact one system, simply accessed in different ways under different time constraints. ⁹ I leave aside the role of input from signing and from the visual reading of orthographic representations. ¹⁰ By “attention” I mean here “top-down” deliberate attention, not automatic bottom-up attention that likely occurs in infants too. I shall also assume what Devitt (2014b: 284) regards as an “online” version of all of this processing, although just which features are salient online may be a matter of the

      

39

supplied, it is virtually impossible for normal English speakers not to balk at the “movement” of Who in (1) and (2) (note that who is allowed in the mere “echo” question so long as it remains in situ): (1) *Who did Mary go with Bill and __ to the movies last week? (cf. Mary went with Bill and who to the movies last week?) (2) *Who did Susan ask why Sam was waiting for __? (cf. Susan asked why Sam was waiting for who?) —or to hear the last himself as referring to the contextually stressed self-centered John in example (3): (3) *John1 was always concerned with himself1. He always talked about himself1, would constantly praise himself1 in public, and earnestly hoped the boss liked himself1. And, again, it is impossible to hear one’s native language, normally pronounced, as mere noise. Thus, just as visual SDs are the output of a visual one, linguistic SDs provide the input to (let us suppose) a central processing system, which then processes them in combination with other representations, for example of the experimental context of an utterance, in order to produce more or less spontaneous verdicts on what has been said (or seen, in the case of vision). Attention can then be drawn to different aspects of the SD, say, by their being “highlighted”—that is, computationally enhanced or sent to special addresses for further processing. And the naïve subject simply responds, in both, with a report caused by these highlighted SDs—for example how the stimulus sounds, what the words were, what phrases were used, what “co-refers” with what, what “the message” expressed was—just as in a vision experiment she might report on either how a stimulus looks or what worldly things she took herself to see. These “intuitive” reports are then evidence for the rules obeyed by the respective linguistic or visual faculty insofar as those rules afford the best explanation of, inter alia, the respective SDs. Consider an utterance of (4) John hopes Bill will help himself. —which the auditory parser could represent by an SD that would include, as a specification of its syntax, something like the following:

focus of attention. It will obviate these and other complications that Devitt raises if one simply assumes that “the outputs of a module” are all also inputs to the central processor.

40

 

IP NP

I’

John1 I -s

VP V

CP

hope C

IP

that

N Bill2

I’ I

VP

will V

NP

help himself2 And there might be similar specifications of its phonology; for example, an utterance of (4) might be represented phonologically as: (4a) /dʒɒn h’ops bɪl wəl help hɪmsεlf/, and there might also be semantic and pragmatic ones. As I noted earlier, the indices on John, Bill, and himself capture the fact that a speaker would spontaneously have the “intuition” that “himself” co-refers with “John,” not “Bill.” The hearer then responds with whatever overt vocabulary is available (e.g. “No, himself can’t be John, but has to be Bill”). Insofar as these responses are to be taken as evidence of linguistic and perceptual processing, both linguistic intuitions and perceptual reports are presumed to be fairly directly caused by representations that are the output of a language faculty. The analogy with vision is close: just as the visual faculty produces structural descriptions (“SDs”) of visual properties (e.g. shape and color, part–whole relations), in the case of language, the faculty produces SDs of phonological, syntactic and proto-semantic properties of various linguistic objects (words, phrases, sentences), and these SDs are in turn the basis for subjects’ intuitive reactions and verdicts.¹¹

¹¹ Devitt (2006b: §7.6) thinks that the analogy with vision is implausible, given that linguistic processing is modular. But, of course, the fact that the constituent structure may be the output of the language module does not in the least entail that the central processor has access to the processes or information inside the module that gave rise to these representations. But it does need to look at the results of the processes, and Fodor reasonably argued that those results include representations of, say, phonetic and syntactic structure. Note, by the way, that, despite Fodor’s passing remark that Devitt notes, a common suggestion about the output of vision is that it does consist of Marr’s 2–1/2D sketch; see Jackendoff (1987) and Pylyshyn (2006: 136).

      

41

3.2.2 Non-conceptual content: NCSDs What I suspect is at the heart of Devitt’s rejection of (VoC) is what he regards as four further problems he raises (see Devitt 2014: 285) for a perceptual version of it (renumbered): A. Ordinary hearers understanding (4) have no conscious awareness of its SD or of any inference from the SD to a translation of (4). B. Given that it takes a few classes in syntax even to understand an SD, it is hard to see how ordinary hearers could use it as a premise even if they had access to it. C. Given, as Fodor (1983: 63) also noted, “the relative slowness of paradigmatic central processes,” it is unlikely that such a significant part of understanding as moving from an SD to translation is a central process. D. How come [a typical speaker] does not have the intuition that, say, in (4), “John” c-commands “himself”? If her competence speaks to her in this way, “how come it says so little?” (see also Devitt, 2006b: 101). Devitt (2006b: 210–20) compares theories of language with theories of skills such as catching a ball or playing a piano, and concludes that the literature on these latter theories “should make us doubt that language use involves representing syntactic and semantic properties; it makes such a view of language use seem too intellectualistic” (2006b: 221). The first two of these objections are easily met by invoking appeals to the “nonconceptual content” whereby a person or animal might well represent, say, the property [square], but without having a representation that expresses the corresponding concept. Christopher Peacocke (1992) discusses a nice example from Ernst Mach (1914) in which a square can involve precisely the same objective stimuli as an experience of a diamond:

but the experiences will still be different, depending upon how the subject represents it. As Peacocke (1992: 75–7) stresses, a child (or perhaps an animal) could distinguish these two experiences without having the concepts [square] or [diamond]: they may not be able to reason about squares being equal-sided and equal-angled, and so might be regarded as lacking those concepts. Indeed, as Peacocke goes on to note: Intuitively, the difference between perceiving something as a square and perceiving it as a (regular) diamond is in part a difference in the way the symmetries are perceived. When something is perceived as a diamond, the perceived symmetry is about the bisection of its angles; when . . . as a square . . . about the bisection of its sides. (1992: 76)

42

 

But, as Peacocke again emphasizes, someone could have these different experiences without having the concept [symmetrical]. It seems pretty clear that Chomskyans are attributing non-conceptual content to the states of the I-language and associated systems such as parsing. There is no reason for anyone to suppose that many metalinguistic concepts, least of all the kind of conscious, technical ones used by linguists, are available for reasoning among speakers generally. Indeed, the relevant SDs issuing from the language faculty to a central processor are likely the results of a modularized processing involving non-conceptual ones (let’s call them “NCSDs”). The technical ones are, of course, what one learns to deploy either in “grammar school” or by taking linguistic classes. All that may be available to the naive hearer is awareness of some or other distinctions that the linguistic NCSDs are marking, just as all that is available to a non-geometer looking at the Mach figures is distinctions marked by non-conceptual visual ones. The NCSDs structure our perception with relations like c-command and inaudible elements such as copies or PRO, without necessarily supplying any sensory material on which one can readily fasten attention. There is, as I like to put it, more to conscious “phenomenology” than mere “phenomenal features.” As for (C), the issue of speed: as I have indicated, much has yet to be understood about the (non-conceptual) character of the SDs of the (proto)-semantic content of I-language expressions; but it is hard to see why there should be in principle any difficulty presented by the speed of the mapping from those SDs to concepts. And, with regard to (D), it is a matter of subtle empirical fact just what information NCSDs supply, but it would not be surprising that they might supply NCSDs of syntactic categories occurring in a tree, but not of all relations in the tree between categories. On the other hand, the fact that speakers seem to automatically respect the reference of “himself” in (3) and (4) above, or observe the rules on negative polarity items in rejecting *I will ever see him again does suggest that relations such as c-command might well be non-conceptually available in their auditory-parsing experience. (Devitt really shouldn’t assume that c-command is quite as unavailable, non-conceptually, to ordinary speakers as he supposes.)

3.2.3 Having vs. representing linguistic properties Devitt (2006b) initially seems to regard it as “uncontroversial” that “parsing assigns a syntactic analysis to the sentence” (235; see also 32n25), and allows that both the language and visual processors operate on SDs, but that these SDs are not what is sent to the central processor that is responsible for linguistic intuitions (Devitt 2014: 281–2). He claims that what is delivered to the central processor is something that has linguistic properties but does not specify or represent them. Referring to an utterance of our sentence

      

43

(4) John hopes Bill will help himself. and its SDs that we discussed in 3.2.1, he claims: As a result of what the system . . . delivers to the central processor, the mental representation involved in the hearer’s final understanding of the utterance will have something like those syntactic and semantic properties, and be a rough translation of (4). But this is not to say that it has those properties because the system delivers an SD that describes those properties. (Devitt 2014: 284, example renumbered; see also 287)

That is, for Devitt, it seems the natural language sentences themselves are actually entokened at the interface between the language module and the central processor. There are quite a number of problems with this proposal. For starters, from the fact that some item has a certain property, it of course doesn’t follow that anything treats it as having it. In general, neural states have multitudes of physio-chemical properties—not to mention a potential infinitude of relational ones (e.g. being n miles from Paris)—but these properties are not thereby incorporated into a mind and the states are not treated as having those properties by any system in the mind, central or otherwise. The states are certainly not perceived as having those properties. For that to happen, one would think the properties had better be represented, that is, made available to presumably computational processes of, say, early vision, recognition, comparison, and memory. Indeed—and this is a point that I think cannot be overemphasized—there is a general explanatory problem in psychology of accounting for how any creature in a universe of “local” physical interactions can in general be sensitive to non-local, relational, or non-physical (what I call “abstruse”) properties. A computational– representational theory provides a strategy. It is not logically necessary; other strategies can sometimes work. There could be “surrogate” local properties in the way a person may be identified by a fingerprint, or the Big Bang by local white noise. But there is a burden on the defender of such a claim to at least suggest where such a surrogate is plausibly to be found, and how exactly it would work—and not be representational. A quite natural computational strategy suggested by the work of many writers both in visual and in linguistic perception is some sort of probabilistic (e.g. Bayesian) one, where something’s being detected as a certain item is a function of the prior probabilities and likelihoods of its being so (cf. Chomsky and Halle 1968: 24; Lidz and Gagliardi 2015). All that is important for purposes here is to note that, if this is the strategy, then it perforce deals in representations, namely the representations of the categories to which probabilities are attached.¹² ¹² In Rey (2003 and forthcoming-b), I argue that this is the essential problem with the proposal, in Chomsky (2000) and others, of identifying SLEs with their neural representations. Note that the point doesn’t depend upon settling the further issue of whether the representations are “classical” data

44

 

Perhaps what Devitt has in mind at this point is another of his (quite) tentative hypotheses, that there is a “language of thought” (“Mentalese”), and that there is at least “a great deal of similarity between the syntactic rules of Mentalese and the speaker’s public language” (2006b: 149). However, even if we do think in our natural language, we’re owed some story about how the central processor could recognize, say, a word, a noun, verb, NP, VP, IP, so as to summon up the “same sentence” in Mentalese. Devitt (2006b: 225) acknowledges that, in view of the relational natures of linguistic properties, “recognizing a word as having [a syntactic] property may not always be an easy matter,” curiously adding, however, that “it mostly is,” referring us to his treatment of the issue in a previous section (§10.6). But there he merely observes that it’s quite easy to tell many English adverbs by the fact that they end in “-ly” (2006b: 185), adding: It can also be easy to tell that an object has a certain relational property if learning to identify the object involves learning to identify it as having that property . . . identification comes with word recognition. One way or another, it is quite easy to tell the explicit structural properties of utterances, although sometimes hard to tell the implicit ones. Devitt (2006b: 185–6, emphasis mine)

“Hard to tell” is a staggering understatement.¹³ Most linguistic properties cannot be locally identified, but are enormously complex relational ones involving abstract issues about their role in the grammar as a whole. Whether certain material is the main verb or a nested one, or an NPI (negative polarity item) in the correct (e.g. c-command) relation to a licensor, or just whether “well” is an adverb, adjective, interjection, noun or verb—these are not facts that are easily read off from anything but fairly complex computations on—what? It is hard to imagine any plausible candidate other than representations of the syntactic properties of an utterance to which probabilities might be assigned by probabilistic computations.¹⁴

structures or “distributed representations” of neural nets. The issue is representation, not the character of the computations upon them. ¹³ Even his example of adverbs doesn’t work: in English the -ly suffix is a fairly good indicator of adverbs of manner (“how”), but not for instance of adverbs of time (“when” –yesterday) or location (“where” –upstairs). Moreover, plenty of words ending in -ly are not adverbs, e.g. fly, belly, sly, ply. ¹⁴ Curiously, Devitt (2006b) cites a number of authors who he seems to think are explaining how linguistic properties could play a role in the mind without being represented. But the very passages he quotes from assume for instance “that the speaker wants to express the event ‘The ferocious dog bites the boy,’ the stored meaning-based representations for ‘ferocious,’ ‘dog,’ ‘to bite,’ and ‘boy’ are retrieved” (Vigliocco and Vinson 2003: 183, emphasis mine).

      

45

Lastly, at this point we can no longer set aside here Devitt’s (2006b: 222–3) odd disregard for phonological and phonetic properties. Surely these—[+sonorant], [+nasal]—are not entokened anywhere in the brain, a module, the central processor, or otherwise! But then how are they perceived or comprehended by the central processor without being represented? And if they need to be represented, why shouldn’t other linguistic properties be as well, their representations attached to the phonological ones? Indeed, how could the whole speech episode be integrated into a single perceptual object if they were not? Speaking at least phenomenologically, one seems to hear a single utterance as its having at least some phonological, morphological, syntactic, and (proto-)semantic properties.

3.2.4 How NCSDs would help In his final “serious” objection to (VoC), Devitt (2014b) argues that, even if (NC) SDs were available to the central processor, they still would not sustain a VoC. He considers the two possibilities of dealing with an ungrammatical string: (i) that the system provides to the central processor an SD of such a string, or (ii) that it doesn’t—it gags or “crashes.” He continues: If (i), then that SD would not directly cause intuitions of ungrammaticality. For, that SD does not come with a sign saying “ungrammatical.” To judge that the SD is of an ungrammatical string, the subject would have to apply her theoretical knowledge of the language to the SD. That’s [(Dev)], not VoC. If (ii), then information provided by SDs would have nothing to do with a subject’s grammaticality intuitions. Rather, the presence or absence of the SD would be the data for the central processor’s response. So, not VoC again. (Devitt 2014b: 287, substituting “(Dev)” for “(ME)” in the original)

But all this misconstrues the VoC proposal. Of course, it is no more likely that the parser adjoins “ungrammatical” to a NCSD of an ungrammatical string than that the visual system adjoins “impossible” to NCSDs of impossible visual figures (such as the Penrose triangle). It is enough that the perceiver in one way or another detects a difficulty in dealing with the material, which, of course, it can perceive phonologically and in some phrasal parts, even if not as a whole sentence. In any case, if the NCSDs were delivered to the central system by the parser, then there would be a simple answer to his question about how they would “fairly directly cause” VoC intuitions: they would do so by serving as the representations on which intuitive judgments are causally, computationally, and evidentially based, much as NCSDs of, say, occlusion relations and axes of symmetry provide the basis for reports of how things look; or, to take an example of a different sort, as efferent copies of motor commands provide special intuitive knowledge of one’s

46

 

voluntary movements. Our various perceptual systems are barraging us with detailed, often non-conceptual information that either directly causes or constitutes a premise in inferences about how things look, feel, smell, and sound, often in idiosyncratic non-conceptual terms. Why shouldn’t they provide us with information about linguistic properties in similar ways? In any case, if linguistic perception involves parsing and parsing is heavily constrained by the I-grammar, then, pace Devitt (2006b: 118; 2014: 288), we seem to have a perfectly respectable model of a VoC. But, of course, merely that an explanation is intelligible is perhaps of no great moment if there is no evidence that would motivate taking it seriously; and it is to providing that evidence and evaluating that abduction that I now turn.

3.3 The evidence Incorporating some of the distinctions we have discussed, the two rival explanations might be summarized as follows (where “=>” can be read as “eventuates in”): (VoC): audition => represented parsing of input => NCSDs, at least some of which => Central Processing => Intuitive Reports (Dev): audition => instantiated parsing of input => “the message” => Central Processing => intuitive reports Note, again, that, on both views, central processing may involve almost any information available to the speaker from whatever source. (VoC) differs from (Dev) only in allowing that some of that information consists in NCSDs. I want to stress from the outset that I do not take any of the following evidence as apodeictic. Evidence for perceptual processing in general is difficult to obtain (e.g. controlling for all manner of central guessing), and psycholinguistics shares in these difficulties. But I take the following evidence to provide at least a strong prima facie case for (VoC). Perhaps no single bit of it is conclusive, but it is hard to resist its cumulative effect, which Devitt would have to explain away in order to sustain his (Dev). By way of understanding what I take to be the significance of these experiments, I shall be assuming that, if some conscious (i.e. introspectible) perceptual task is sensitive to certain phenomena, that is a prima facie reason to suppose that those phenomena are in some way (e.g. at least non-conceptually) available for intuitive verdicts. Perhaps this is a mistaken assumption. But if it is, it seems to me that the burden is on someone who thinks otherwise to provide a model of the experimental results.

      

47

3.3.1 Involuntary phonology There’s an important point that I passed over earlier, when I allowed Devitt latitude in discussing noises or phonology that cannot be stressed enough. The point is that we cannot help but hear speech in our native tongues as language; indeed, it is virtually impossible to hear it as mere noise. Devitt (2014b) acknowledges this fact but, again, thinks that his distinction between having and representing linguistic properties will suffice. In understanding [a sentence], we hear it as having those linguistic features and not others in that, as a result of all the processing in the language system, we come up with a representation that has those features, and not others (2014: 287). But, to repeat the point at 3.2.3, to be part of our experience, it is not enough for the SD-output to have linguistic properties; there has to be some sort of further incorporation of it into our mental life, and, given the abstrusity of the properties, it is hard to think what would work besides representation. In any case, surely nothing in the brain has phonetic or phonological properties!

3.3.2 “Meaningless” syntax One of the most famous examples Chomsky (1957) produced for both the interest and the relative autonomy of syntax is his (5) Colorless green ideas sleep furiously, which any native speaker can phonologically and syntactically parse, even though it has no readily intelligible, literal “message.” But one doesn’t have to make up such examples. There is plenty of technical language in any specialized area that speakers can parse and “hear as” English, without any clear idea of what messages are being conveyed (and, as a philosopher like Devitt knows only too well, philosophy has provided more than its share of such prose; take for instance Kant, Hegel, Derrida, Barthes). Or consider “nonsense” verse such as Lewis Carroll’s “Jabberwocky,” which any English speaker—including lots of children—could readily “parse” without having a clue as to its “message.”

3.3.3 Syntax trumping “the message”: Garden paths, structural priming, and “slips of the ear” Further evidence for the perceptual reality of at least some syntactic categories independently of semantic ones is supplied by the “garden-path” phenomena

48

 

mentioned earlier. Thus, naïve subjects hearing or reading “The horse raced past the barn fell” are initially confused, presumably because the parser too quickly represents the main verb as being “raced” and has to recompute after it encounters “fell” at the end. This initial, unsuccessful parse determines how the hearer hears the words, and why she then has trouble understanding the message (see Fernández and Cairns 2011: 211ff. for discussion).¹⁵ Of course, garden-path examples could be explained equally as features of merely how the perceiver of a sentence (at least initially) thinks its message: she begins by thinking that (some) horse raced (somewhere), and has to rethink this message when she perceives “fell.” However, a number of experiments have shown that syntax can sometimes takes precedence over the message. For example, Ferreira et al. (2001) found that, when the garden-path parsing conflicts with the plausibility of a sentence’s message, some hearers will still take the garden path. Thus, given (6) While Mary bathed the baby played in the crib, and asked, Did Mary bathe the baby?, some subjects will say that she did: apparently, they still treat the baby as the direct object of bathed, even after they have recognized the correct parse that excludes this readings. Similar results have been obtained with regard to co-reference (see Cowart and Cairns 1987; Fernández and Cairns 2011: 224; Garnsey et al. 1989) and “structural priming” (Bock 1986). In a lengthy review of these and related results, Pickering and Ferreira (2008: 431) conclude that “priming appears to cut across meaning distinctions,” indeed, that the results provide compelling evidence for the view of “autonomous syntax,” which regards syntactic knowledge as independent of other forms of knowledge, such as the specific features of meaning or the perceptual properties of utterances. They go on to stress that these other features can also have priming effects: sometimes the semantics does, indeed, trump the syntax; but it’s enough for their and my “autonomy” claim that at least sometimes it does not. Similarly, in a review of “slips of the ear” (e.g. hearing “I seem to be thirsty” as “I sing through my green Thursday”), Bond (2008: 307) observes that “there are numerous mis-perceptions which involve radical changes in phonology and syntax, completely lacking in semantic appropriateness.” Sometimes people actually (mis-)hear words and syntax having no relation to obviously intended “messages”!

¹⁵ Regarding garden paths, Devitt himself has a curious response: “these phenomena are examples of language usage; they are not intuitions about the linguistic properties of the expressions that result from that usage” (2006b: 113). It is difficult to see why the Chomskyan is not perfectly entitled to infer from such data facts about the parsing and representation of syntactic properties.

      

49

3.4 Conclusion It certainly appears that the evidence seriously favors (VoC) over (Dev), at least for some standard phonological and syntactic properties: speakers can be intensely sensitive to phonological and syntactic properties independently of the “message” that might be conveyed by perceived speech—or might not be, in the case of technical prose and syntactically well-formed, semantic nonsense! It is hard to see how this apparently perceptual sensitivity could be explained other than by presuming that many standard phonological and syntactic properties are perceptually represented, at least in the form of NCSDs. (Dev), by contrast, is committed to the idea of speakers somehow generalizing to the English from noises to— perhaps—merely unrepresented phonology they have heard and produced, on the analogy of generalizations they have made about their own and others’ swimming, bicycling, and touch-typing. Again, I want to stress that there is no need to advance my specific version of (VoC) as one that has been conclusively confirmed. It is enough to show that, pace Devitt (2006b: 118, 2014: 288), it is scientifically reasonable, and indeed it could account for the some of the known phenomena at least as well as, if not a lot better than, (Dev) and other accounts that try not to appeal to a privileged epistemology. Perhaps some of the phenomena by themselves might be susceptible to a non-(VoC) explanation, but it is hard not to be impressed by the trend of the whole of it.

Note This chapter was the basis of my talk at the Aarhus conference. It is a considerable expansion of Rey 2006 and 2014a, and a very much abbreviated version of chapter 7 in Rey forthcoming-b, which readers should consult for more details than it is possible to include here. I am indebted to the audiences at that conference, especially to Michael Devitt for many discussions of his views.

4 Linguistic intuitions again A response to Gross and Rey Michael Devitt

 : VC  ME 4.1 Introduction Linguistics takes the intuitions that people have about the syntactic and semantic properties of their language as good evidence for a theory of that language. Why are these intuitions good evidence? In my book Ignorance of Language (2006b: ch. 7; see also 2006c), I rejected the received Chomskian answer, which I somewhat playfully called “Voice of Competence” (VoC), and gave an answer of my own, what Mark Textor (2009) has aptly named “the modest explanation” (ME). This has generated a lively debate.¹ The papers defending VoC by Steven Gross and Georges Rey in the present volume are the latest contributions to this debate. I present my view of VoC and ME in Part I and respond to Gross and Rey in Part II of this chapter.

4.2 Voice of Competence What is VoC? Consider the intuitive judgments that (1) John seems to Bill to want to help himself

¹ Collins (2006), Matthews (2006), Miščević (2006), Rattan (2006), Rey (2006), and Smith (2006), all responded to in Devitt (2006a); Pietroski (2008), responded to in Devitt (2008b); Textor (2009), responded to in Devitt (2010b); Culbertson and Gross (2009), which led to the exchange, Devitt (2010a), Gross and Culbertson (2011); Fitzgerald (2010), responded to in Devitt (2010a); Ludlow (2011) and Rey (2014b), responded to in Devitt (2014b); an exchange, Miščević (2009, 2012), Jutronić 2012 and 2014, Devitt (2014a), Miščević (2014a, 2014b), Devitt (2018), Jutronić (2018), Miščević (2018). Michael Devitt, Linguistic intuitions again: A response to Gross and Rey In: Linguistic Intuitions: Evidence and Method. First edition. Edited by: Samuel Schindler, Anna Droz˙ dz˙ owicz, and Karen Brøcker, Oxford University Press (2020). © Michael Devitt. DOI: 10.1093/oso/9780198840558.003.0004

52

 

is grammatical, and that in it “himself” co-refers with “John.” VoC is the view that intuitions like these are the product of a linguistic competence residing in a subcentral module of the mind. I describe VoC as the view that linguistic competence alone provides information about the linguistic facts . . . . So these judgments are not arrived at by the sort of empirical investigation that judgments about the world usually require. Rather, a speaker has a privileged access to facts about the language, facts captured by the intuitions, simply in virtue of being competent. (Devitt 2006b: 96)²

On this view, competence not only plays the dominant role in linguistic usage, it also provides metalinguistic intuitions. Those intuitions are indeed, “noise” aside, the Voice of Competence. That is why they are reliable.³ I argued that VoC was wrong (Devitt 2006b, 2006c, 2010a, 2014b). I shall summarize my objections in section 4.4 here. But the best reason for rejecting VoC is ME. First, two clarifications. (a) Intuitions are propositional attitudes or thoughts in the central processor (CP) with propositional contents expressible by sentences; thus the content of our first intuition about (1) can be expressed by “(1) is grammatical.” When I say that, according to VoC, linguistic competence provides “propositional knowledge” (2006b: 4), “information” (2006b: 96), and “informational content” (2010a: 834), I am referring to such propositional contents. VoC may allow some role to the CP in forming the intuition—perhaps some editing, correcting, or inferring—but it is essential to VoC, as defined, that any such CP processes should start from a metalinguistic propositional content provided by competence. It is important to keep this in mind in considering what Rey presents as a defense of VoC. (b) I have given the intuition that (1) is “grammatical” as an example. Linguists would now mostly prefer to talk of the intuition that (1) is “acceptable,” which is elicited by asking whether (1) is “ok,” “sounds good,” and the like. The significance of this preference is a tricky matter that I have addressed elsewhere (Devitt 2006b, 102; 2010a, 839–44). In brief, I have argued that the content of such a question is context-relative and that a speaker hearing the question from a linguist is likely to take it to be about what is grammatical rather than about what is polite, acceptable in church, and the like. So it is appropriate to treat the intuition as

² According to VoC, the “privileged access” is to linguistic facts but not, note, to the alleged fact that competence provides this linguistic information; see Gross (this volume, p. 25). ³ I cited what I took to be overwhelming evidence that VoC is the received Chomskian view (see Devitt 2006b: 96–7). So I was surprised that the attribution was rejected by some (see Collins 2008b: 17–19; Fitzgerald 2010; Ludlow 2011: 69–71). For responses, see Devitt (2010a: 845–7; 2014b: 274–8). Ludlow’s discussion is notable for its egregious misrepresentation of the evidence. I have also provided more evidence (Devitt 2014b: 273). I still think that the evidence for the attribution is overwhelming.

  

53

being about grammaticality. In any case, such an intuition is not provided by linguistic competence any more than the other metalinguistic intuitions.

4.3 Modest explanation If VoC is not the right theory of intuitions, what is? I argue that intuitive judgments about language, like intuitive judgments in general, “are empirical theory-laden central-processor responses to phenomena, differing from many other such responses only in being fairly immediate and unreflective, based on little if any conscious reasoning” (Devitt 2006b: 103). Although a speaker’s competence in a language obviously gives her CP ready access to the data of that language, the data that the intuitions are about,⁴ it does not give her CP ready access to the truth about the data; the competence does not provide the propositional content of the intuition. Textor (2009) rightly called this explanation “modest,” for it takes metalinguistic intuitions to be just like intuitions in general, particularly those about the outputs of other human competences; examples include chess, typing, and thinking (see Devitt 2006b: 106–8). So ME makes do with the sorts of cognitive states and processes, whatever they may be, that we have to posit anyway to explain intuitions in general. In light of this, I might have just left ME at that, but I felt the need to say more, at least to contrast ME with VoC. So I did say more about the etiology of the metalinguistic intuitions, speculating on the basis of what I thought we knew about intuitions in general (Devitt 2006b). I have revised and developed these speculations since Ignorance (see Devitt 2006a, 2010b, 2014a), in responses to Textor (2009) and Nenad Miščević (2006, 2009).⁵ And I continue revising and developing them here. This is a tricky empirical issue because, as Rey is fond of saying, “no one yet has an adequate theory of our knowledge of much of anything” (Rey 1998: 29). I emphasize that a favorable comparison of ME to VoC should not depend on my speculations being dead right. First, consider “theory-laden.” Intuitions are theory-laden in the way in which perceptual judgments are commonly thought to be; indeed, some of them are perceptual judgments (Devitt 2006b: 103).⁶ The anti-positivist revolution in the philosophy of science drew our attention to the way in which even the most straightforward judgments arising from observation depend on a background.

⁴ Rey responds: “Devitt does not provide any clear indication of precisely what he means by ‘the data of language’ to which speakers are exposed” (this volume, p. 37). This is odd: see Devitt (2006b: 98–9, 106–9); also “ ‘data’ is to be understood on the model of ‘primary linguistic data’: the data are linguistic expressions (and the experiences of using them)” (Devitt 2010b: 254; also 2010a: 835 n. 4). ⁵ I am indebted to Dunja Jutronić for a series of comments and questions that had a major role in prompting this development. ⁶ And I argue that perceptual judgments should be thought of as intuitions (Devitt 2015: 39–42).

54

 

We would not make those judgments if we did not hold certain beliefs or theories, some involving the concepts deployed in the judgments. We would not make the judgments if we did not have certain predispositions, some innate but many acquired in training, to respond selectively to experiences.⁷ Next, consider a native speaker of a language asked to make a syntactic judgment about a string of words in that language. According to ME, she might respond as follows. She starts by trying to understand the string: she deliberately goes through a process of understanding of the sort that she goes through “automatically” when presented with a string in normal conversation. This test is a straightforward exercise of her linguistic competence (along with some “pragmatic” competencies). She will then go in for some quick CP reflection upon this experience, deploying her concept of grammaticality or whatever from folk linguistics, to form a judgment. The judgment itself is propositional, of course, but the datum for the judgment is not. The datum is the experience of trying to understand the string, which is no more propositional than is an experience of actually producing or understanding a string in normal language use (Devitt 2006b: 109–11). So competence supplies the datum for the intuition, the CP provides the intuition (Devitt 2006a: 594). I say she “might” respond in this way because, although this understanding test is a likely response to a difficult case, it is not to an easy case. Consider the strings “responded the quickly speaker” and “the speaker responded quickly,” for example. The speaker is likely to recognize immediately, without reflection on a deliberate attempt at understanding, that the former word salad is ungrammatical and the latter simple sentence is grammatical. If so, her intuition is, in this respect, analogous to some other ones I have mentioned in the past: a paleontologist responding to a bit of white stone sticking through grey rock with “a pig’s jawbone”; art experts correctly judging an allegedly sixth-century Greek marble statue to be a fake; the tennis coach Vic Braden correctly judging a serve to be a fault before the ball hits the ground (Devitt 2006b: 104). Just as the paleontologist, the art expert, and Braden immediately recognize the relevant property in these cases, so too does the speaker in easy linguistic cases. There is no need for her to perform the understanding test (Devitt 2010b: 254–5). What I am emphasizing here is that a person’s linguistic intuitions may be immediate perceptual judgments, just like these others. Years of experience and education have made the paleontologist, the art expert, and Braden quick to recognize when to deploy their concepts of pig’s jawbone, fake, and fault, respectively; these perceptions are theory-laden but immediate. Similarly, a speaker’s years of experience and education, though less substantial, are likely to make her quick to recognize when to deploy her syntactic concepts in easy cases; those ⁷ So “theory” in “theory-laden” has to be construed very broadly to cover not just theories proper but also these dispositions. For further discussion, see Devitt (2011c: 19).

  

55

metalinguistic perceptions are also theory-laden but immediate. Doubtless the speaker’s past experiences included understanding tests in which, as noted, her linguistic competence plays a role. (In principle, a field linguist could come to have good intuitions about a language in which she is not competent; but perhaps this is not likely in practice.) So that competence helped to bring about her current capacity to immediately recognize the grammatical features of simple cases. The following example, popular in discussions of linguistic intuitions, both exemplifies such immediate perceptual judgments and shows that they can be wrong: (2) Many more people have been to France than I have. When a competent speaker is presented with this she is likely to judge immediately that it is grammatical. Yet it is not, as will become apparent to her as soon as she runs an understanding test: this string of words simply makes no sense. Perceptions of strings, with or without an understanding test, can yield theoryladen intuitions about the properties of the strings in just the same way in which perceptions of a white stone or a marble statue can yield theory-laden intuitions about the properties of those entities. And, the greater the expertise, the more theory-laden the intuitions. Finally, I emphasize two crucial respects in which ME differs from VoC. First, unlike VoC, ME is not committed to novel sorts of cognitive states and processes but simply to those we must be committed to anyway if we want to explain intuitions in general. Second, according to VoC, linguistic competence provides the metalinguistic propositional content of intuitions (perhaps after some editing, correction, or inference). ME denies this. Relatedly, “the grammatical . . . notions that feature in these judgments are not supplied by the competence but by the central processor as a result of thought about language” (Devitt 2006b: 110–11). But, note, ME does not deny that competence plays a role in the etiology of an intuition; see its role in an understanding test.

4.4 The rejection of VoC So why should we prefer ME to VoC? What’s wrong with VoC? I have recently summed up my former criticisms of VoC (Devitt 2006b, 2006c, 2010a, 2014b) as follows: The main problems with it are, first, that, to my knowledge, it has never been stated in the sort of detail that could make it a real theory of the source of intuitions. Just how do the allegedly embodied principles yield the intuitions? We need more than a hand wave in answer. Second, again to my knowledge, no argument has ever been given for VoC until Georges Rey’s [2014b] recent

56

  attempt which, I argue [Devitt 2014b], fails. Third, given what else we know about the mind, it is unlikely that VoC could be developed into a theory that we would have good reason to believe. (Devitt 2015: 37)

In brief, VoC needs details and evidence before we should take it seriously. I have also drawn attention to some other implausibilities of VoC (Devitt 2006b, 2006c, 2010a, 2014b)—in brief: (i) If competence really spoke to us, why would it not use the language of the embodied theory and why would it say so little? (ii) There would be a disanalogy between the intuitions provided by the language faculty and by perceptual modules. (iii) Developmental evidence suggests that the ability to speak a language and the ability to have intuitions about the language are quite distinct, the former being acquired in early childhood, the latter, in middle childhood as part of a general cognitive development. (Devitt 2015: 37)

An argument for VoC should confront these implausibilities. An Occamist consideration counts heavily against VoC. ME seems like a good explanation and has not been shown to be otherwise. So there is no explanatory need to posit the special states and processes required by the immodest VoC. If ME is right and VoC wrong, then there are serious methodological consequences. Thomas Wasow and Jennifer Arnold, in a rather damning criticism of the methodology of generative grammarians, rightly claim that “intuitions have been tacitly granted a privileged position in generative grammar” (Wasow and Arnold 2005: 1482). Furthermore, they claim that “usage data gets almost no attention from generativists” (1486).⁸ They note also: “For reasons that have never been made explicit, many generative grammarians appear to regard primary intuitions as more direct evidence of linguistic competence than other types of data. But there is no basis for this belief” (1484). As I have indicated here (§4.2), generativists have this view of intuitions because they embrace VoC. And, although sometimes they make this explicit (I cite evidence of this in Devitt 2014b: 272), it is true that mostly they do not. Rather, they “seem to just presuppose VoC without even stating it explicitly” (273). So the first methodological consequence of the truth of ME and falsity of VoC is that the evidential focus in linguistics should move away from the indirect evidence provided by intuitions to the more direct evidence provided by usage, by the processes of linguistic production and understanding. The second consequence is that, insofar as the evidence of intuitions is sought, there will seldom be good reason for preferring those of folk over those of experts about language. And the third is

⁸ I think this exaggerates a bit (Devitt 2006b: 98–100).

  

57

that we should use intuitions only where we have some ground to believe that they are reliable.⁹ In light of this, it seems to me, ME is very plausible, whether or not my speculations on its details are right. Against the background of this plausibility I respond to Gross’ “Linguistic intuitions: Error signals and the Voice of Competence” (chapter 2 in this volume) and to Rey’s “A defense of the Voice of Competence” (chapter 3 in this volume). I must perforce be briefer in my response than I would like and than their papers deserve.

 : G, R,     VC 4.5 Background I have just pointed out that VoC, although apparently the received view among linguists and philosophers of linguistics, was left largely unsupported until recently. Then Rey (2014b), to his credit, argued for it. He is doing so again here. Gross also argues for it here, to his credit. But I don’t think that their arguments succeed. Rey’s defense of VoC stemmed from the widespread view that the non-central language system for language processing generates “structural descriptions” (SDs)—that is, metalinguistic representations of the syntactic and semantic properties of the expressions being processed. Rey’s VoC proposes that the CP has access to these SDs. On the basis of the information they provide, the CP forms the speaker’s intuitive judgments. In response to this proposal, I made “two serious objections”: (I) Why should we suppose that the language system, in processing (1) [“John seems to Bill to want to help himself”], makes [an] SD of (1) available to the central processor? (II) Even if it did, how would the SD’s information “fairly directly cause” the intuitions that are the concern of VoC? (Devitt 2014b: 283)

Concerning objection (I), I claimed that Rey cites “no evidence that the noncentral language system provides SDs to the CP” (287). But suppose that we did get some evidence. The proposal would still face objection (II). I argued that “those SDs would mostly not provide the informational content of speakers’ intuitions. So Rey has not provided a respectable model of VoC” (288). ⁹ The methodological situation is worse in the philosophy of language. Thus, in the theory of reference, philosophers seem to rely on nothing but intuitions. According to Michael McKinsey, most philosophers of language think that such intuitions are a priori (McKinsey 1987: 1). However, some philosophers may implicitly embrace the more respectable VoC. On this, see Stich (1996) and Devitt (2012). There has recently been a move toward testing theories of reference against usage (see Domaneschi et al. 2017 and Devitt and Porot 2018).

58

 

An analogy with vision was central to Rey’s argument for VoC. He takes the non-central vision system to deliver descriptions to the CP that are analogous to SDs and that provide the content of visual intuitions. I criticized a version of this analogy in Ignorance (Devitt 2006b: 112–14). This led to an exchange (Rey 2006: 563–7; Devitt 2006a: 596 n. 25; Rey 2014b: 253–4; Devitt 2014b: 281–2; see also Devitt 2010a: 850–2, 854). I am not satisfied with my response. I still think that comparison with the vision system—and, I might have added, with the audition system—does not support the idea that the language system provides SDs to the CP; indeed, the comparison supports the idea that it does not. But I should have emphasized that syntactic intuitions are examples of, rather than being analogous to, perceptual intuitions; see section 4.3 here.¹⁰ They are theoryladen perceptual judgments, reflecting past CP conceptualizing of experiences, just as are the intuitions of the paleontologist, the art expert, and Braden. The immediate causally relevant background for these linguistic perceptual judgments is thought about the language, not competence in the language (although, of course, the competence will have provided data for those thoughts). Rey makes many mentions of the vision analogy in his present discussion. I don’t find anything there to change my negative view of the analogy and will say no more about it. I will attend instead to what Rey sees as advancing his case for VoC. Gross’ case for VoC focuses on the role of “error signals.” My 2014 response to Rey (Devitt 2014b) yields a response to some of Gross’ argument.

4.6 Intuitive linguistic usage versus intuitive metalinguistic judgment (= linguistic intuition) A certain distinction is crucial to identifying clearly the disagreement over VoC. The distinction is between linguistic intuitions and the processes of language use. The intuitions are judgments about linguistic expressions. These are quite different from the behaviors of producing and understanding those expressions. Those speedy, largely automatic behaviors¹¹ might well be regarded as intuitive, but they are not judgments and hence not intuitions. The distinction is crucial for two reasons. First, because it is the very nature of linguistic competence to play a direct role in causing those behaviors, whereas competence’s causal role with intuitions is precisely what is at issue in the challenge to VoC. For that issue is whether competence provides the propositional content of the intuitions as well as ¹⁰ And I should have made much less of the analogy between visual intuitions about what is seen and linguistic intuitions about what is said; but that is another story. ¹¹ It is a mark of a skill that it is “automatic” in this way: it can be performed while attention is elsewhere (Anderson 1980: 230–5; Reisberg 1999: 460). So I see the “automaticity” of linguistic usage as part of the evidence that linguistic competence is simply a skill (Devitt 2006b: 210).

  

59

competence’s obvious role in language use. Second, whereas the evidential status of the intuitions is controversial, the behaviors indubitably provide evidence about the language.¹² These behaviors are the “evidence from usage” that I emphasize in presenting ME (Devitt 2014b: 271–2). In light of this, consider Rey’s claim that there is a crucial difference between Devitt’s and a Chomskyan conception of “intuitions”: on Devitt’s view, intuitive verdicts about a string are understood as straightforward claims about the strings . . . For Chomskyans . . . “[u]nacceptability reactions” need not be in the least self-conscious or metalinguistic in the way they typically are for linguists and their reflective students. It would be enough that speakers simply produce some idiosyncratic reactions to various strings: hesitation, perplexity, or just pupillary dilation would suffice as well. (Rey, this volume, p. 35)

Now I’m not sure what Rey sees as the “crucial difference,” but let me emphasize what it is not. I very much agree with Rey that both sorts of “verdicts” or “reactions” are, and should be, used as evidence. But hesitation and the like provide evidence from usage, not from intuitive judgments, and hence are not relevant to the disagreement over VoC. Next, consider the following from Gross, repeating a line in Maynes and Gross (2013): Mentalists need not commit themselves to the view that the language faculty itself outputs a state with the content That string is unacceptable. It can suffice that the parser fails to assign a structural description to the string and that the absence of a parse can in turn play a causal role in the process that leads the speaker to judge that the string is unacceptable. (Gross, this volume, p. 17)

(i) Right: mentalists need not commit to VoC. But, as a matter of fact, they seem to (see 4.2). And I suggest that they are encouraged to do so by their mentalism, by what I call their “psychological conception” of grammars (Devitt 2014b: 273–4). (ii) The absence of a parse may indeed play a causal role in the process that leads to the intuition, but that is accommodated by ME; see the role of an “understanding test” in section 3 here. Hence no disagreement.¹³

¹² Textor (2009) hankers after non-judgmental “linguistic seemings” that fall somewhere “in between” processing behavior and metalinguistic judgments and yet have epistemic authority; see also Fitzgerald (2010: 138); Smith (2014). I argue that there are no such seemings (Devitt 2010b). ¹³ Gross (this volume, p. 17) thinks that there is a disagreement because in his view Culbertson and Gross (2009) have shown that linguistic intuitions are not theory-laden in the way ME requires. I don’t think that they have shown this (see Devitt 2010a); Gross and Culbertson (2011) is a response.

60

 

Gross (this volume, p. 19) cites evidence that there is a monitoring mechanism that generates an “error signal” when presented with an ungrammatical string. He suggests that this “may play a causal role in the generation of a linguistic intuition (this volume, p. 22). Furthermore, the signal “may have associated phenomenology: a felt sense of badness, motivation, and norm violation” (this volume, p. 21). This also fits nicely with ME: see the talk of “experience” in section 3. Again, no disagreement.

4.7 Rey’s bait and switch Ignorance discussed both sides of the distinction in section 6 here. My rejection of VoC and embrace of ME is, of course, a thesis about the metalinguistic intuitions. That thesis is the “third major conclusion” of the book (Devitt 2006b: 120). Many chapters later, I entertain a thesis about the very different matter of the processes of language production and understanding: the speedy automatic language processes arising wholly or, at least, partly from linguistic competence are fairly brute-causal associationist processes that do not operate on metalinguistic representations of the syntactic and semantic properties of linguistic expressions. (Devitt 2006b: 229)

This was not a “main conclusion” but rather a “tentative proposal.” It was tentative because, I argued, we do not have nearly enough evidence about the workings of the mind to adopt it or its rival, the widespread Chomskian view that SDs are involved in processing. I summed up my attitude to the brute-causal alternative: The point is not, of course, to offer the alternative as a complete explanation of language use. Like the received view, it is far far too lacking in details for that. The point is rather to suggest that the best explanation is more likely to comply with the brute-causal alternative than the received view. And the considerations favoring the alternative are, it goes without saying, far from decisive. The . . . proposal really is tentative. (Devitt 2006b: 221; see also 229)¹⁴

My proposal on this fascinating empirical issue was made thirteen years ago. Perhaps if I examined the latest evidence I would change my view now (but I doubt it). The important point here, however, is that this processing issue is not

¹⁴ David Pereplyotchik (2017) offers a detailed defense of the widespread view and criticism of my tentative proposal.

  

61

what we are debating, and my tentative proposal about it plays no role at all in my argument against VoC. So, why do I raise the issue? Because, although Rey (this volume, p. 34) promises to provide “evidence” for VoC that seems to him “to make it . . . empirically plausible,” what he mostly provides is evidence for the widespread Chomskian view of processing and against my tentative proposal. His promise requires him to produce evidence that the SDs that he thinks play a role in processing are also accessible to the CP and used to form intuitions. What Rey delivers, time and again, is evidence that SDs do play a role in processing. Rey’s case for VoC is largely a bait and switch. This is very puzzling. (a) My argument against VoC in Ignorance obviously does not rest on my processing proposal; indeed, it precedes that proposal by more than a hundred pages. (b) I could not rest that argument on a “tentative proposal,” given that the rejection of VoC is (part of) a confidently presented “main conclusion.” At least, I could not if in right mind. (c) I started my exchange with Rey (described in section 5 here) by declaring that “I will go along with the widespread view [of processing] for the sake of argument” (Devitt 2014b: 280). So my tentative proposal plays no role in my criticism of VoC in the exchange discussed here either. (d) Finally, I noted there that the evidence Rey cites in support of VoC “provides no evidence at all of what is needed: no evidence that the non-central language system provides SDs to the central processor” (Devitt 2014b: 287). Rather, the cited evidence for the aforementioned widespread view is that processing involves SDs (285–6). I was, in effect, accusing him of a bait and switch. Yet, despite all this, he is still baiting and switching.¹⁵ The bait and switch is very important to the dialectic. For, whereas Rey is on very weak ground in arguing for VoC, he is on quite strong ground (even though not strong enough!) in arguing for the widespread view of processing. The first clear sign of the switch comes after Rey summarizes four problems, (A) to (D), that I raise in my “serious objection (I)” to Rey’s VoC (see Devitt 2014b: 284–5). Rey (this volume, p. 41) responds by immediately drawing attention to my view that the literature on skills—such as catching a ball or playing a piano—“should make us doubt that language use involves representing syntactic and semantic properties; it makes such a view of language use seem too intellectualist” (Devitt 2006b: 221). I am there doubting what Rey of course does not doubt: that language processing involves SDs. But my doubt is irrelevant to the issue at hand. For my four problems are with Rey’s view that the language system provides SDs to the CP, with Rey’s view of how competence provides the content of intuitions. So that is the view that Rey needs to be arguing for. My discussion of ¹⁵ Related to this, Rey’s “summary” of the ME view of intuition formation—his “(Dev)” (Rey, this volume, p. 46)—is actually, until its final mention of “intuitive reports,” an account of my proposal about language processing. It does not capture the ME.

62

 

Rey’s VoC does not challenge the widespread view that SDs play a role in processing; indeed, I have explicitly gone along with the view that they do. After a discussion of “non-conceptual content”—to be addressed in section 8 here—the switch continues. My four problems are with Rey’s VoC. Yet he responds to them by criticizing, at length, my tentative proposal about processing (Rey, this volume, pp. 42–45). For example, he claims that I owe “some story about how the central processor could recognize, say, a word, a noun, verb, NP, VP, IP, so as to summon up the ‘same sentence’ in Mentalese” (this volume, p. 44). Indeed, I owe at least a favorable comparison of my proposal’s handling of this recognition problem with its handling by the received view of processing that Rey likes. And I provide that comparison (Devitt 2006b: 225). I am sure it is not to Rey’s liking, and he may be right: we have little basis for confidence on this processing issue. But what I am now emphasizing is that these processing issues are beside the point of VoC, which is what Rey presents himself as arguing for. When Rey (this volume, pp. 46–8) comes to provide “the evidence” for his VoC, the focus is again on arguing about what is not at issue: that SDs are involved in parsing. Finally, consider Rey’s “Conclusion.” It begins: “It certainly appears that the evidence seriously favors (VoC) over [(ME)], at least for some standard phonological and syntactic properties.” There’s the bait, promising evidence that SDs delivered by competence provide intuitions about those properties. The switch follows immediately: speakers can be intensely sensitive to phonological and syntactic properties independently of the “message” that might be conveyed by perceived speech— or might not be in the case of technical prose and syntactic nonsense! It is hard to see how this apparently perceptual sensitivity could be explained other than by presuming that many standard phonological and syntactic properties are perceptually represented. (Rey, this volume, p. 49)

This might be evidence that SDs play a role in language processing. It is not evidence for VoC. In sum, a great deal of Rey’s chapter is relevant to theories of language processing but simply irrelevant to VoC.¹⁶ I turn now to the rest of the chapter. I seek arguments that are actually for VoC (but passing over the frequent mentions of the vision analogy, which I am not discussing further).

¹⁶ My tentative proposal would have been relevant to VoC, had I argued (as I did not) that, since the processing system does not use SDs, it could not provide them for the CP.

  

63

4.8 Rey’s arguments for VoC At this point we need to take account of a new feature of Rey’s discussion of VoC: his emphasis on the distinction between “conceptual and non-conceptual content”: It seems pretty clear that Chomskyans are attributing non-conceptual content to the states of the I-language and associated systems, such as parsing . . . (let’s call them “NCSDs”) . . . The NCSDs structure our perception with relations like ccommand and inaudible elements such as copies or PRO. (Rey, this volume, pp. 42)

In the abstract of his paper, Rey charges me with overlooking the distinction he is here emphasizing.¹⁷ First, he claims that two of the problems I raise “are easily met by invoking appeals to [NCSDs]” (this volume, p. 41). I wonder why he thinks so. These problems are part of my serious objection (I) to Rey’s view that the CP has access to SDs that the speaker uses to form linguistic intuitions. Problem (A) is that ordinary speakers are not consciously aware of these SDs. Problem (B) is that it is hard to see how ordinary speakers could use SDs, given that it takes a few classes in syntax to understand them (Devitt 2014b: 285). But taking SDs to be NCSDs strengthens the case that ordinary speakers are not aware of them and would not understand them if they were. The move to NCSDs seems to worsen the problem for VoC. Later on, Rey (this volume, p. 45) claims that my serious objection (II) “misconstrues the VoC proposal” in taking it to concern SDs with conceptual content. In objection (II), I give reasons for thinking that even if SDs were available to the CP they could not “fairly directly cause” the intuitions (Devitt 2014b: 287–8). Now it is true that I did not take Rey’s proposal to be about NCSDs (even though I agree with him that the plausible view of processing involves NCSDs). I did not do so because taking them that way makes objection (II) stronger: if SDs are not conceptual, how could they provide the content of intuitions that are conceptual? Does Rey perhaps misconstrue VoC? More on this in what follows. Rey is strikingly unimpressed with objection (II). He quotes my argument that the allegedly presented SDs could not directly cause an intuition about an ungrammatical string and has two responses. (a) “It is enough that the perceiver in one way or another detects a difficulty in dealing with the material” (this ¹⁷ He also charges me, mysteriously, with overlooking the distinction between “a grammar and a parser.” It’s hard to think of anything more central to Ignorance than this and related distinctions, e.g. between the rules described by the grammar and the rules that govern parsing (Devitt 2006b: 24–5).

64

 

volume, p. 45). But this is not enough for Rey’s VoC. Of course, a detection of difficulty may play a causal role in an intuition (§§3 and 6). And SDs may well play a causal role in the parser’s difficulty, and hence in the intuition. But VoC requires much more: that the SDs provide the content of the intuition as a result of being presented to the CP. Indeed, if detection of difficulty were enough, VoC’s requirement would be explanatorily otiose and should be abandoned in favor of ME (as I point out: Devitt 2014b: 287). (b) “In any case . . . there is a simple answer” to how NCSDs “would ‘fairly direct cause’ VoC intuitions: they would do so by serving as the representations on which intuitive judgments are causally, computationally and evidentially based” (Rey, this volume, p. 45). But this is not an answer; it is just a pronouncement that VoC can answer. We need some idea of how this story is possible and some evidence that it is actual. Finally, consider the following passage: I shall be assuming that, if some conscious (i.e. introspectible) perceptual task is sensitive to certain phenomena, that is a prima facie reason to suppose that those phenomena are in some way (e.g. at least non-conceptually) available for intuitive verdicts. (Rey, this volume, p. 46)

There are surely countless conscious perceptual tasks that falsify this assumption; for example, an outfielder catching a fly ball is sensitive to the acceleration of the tangent of the angle of elevation of gaze: his behavior keeps the acceleration at 0 (McLeod and Dienes 1996: 531). Yet that phenomenon is surely not available to his intuitions! Indeed, I have claimed the opposite of Rey’s assumption: the operation of the rules that govern perceptual modules “may yield information that guides the module in arriving at its message to the central processor about what is perceived. Yet the central processor has direct access only to the message, not to any intermediate information involved in arriving at it” (Devitt 2006b: 118). When I last looked, this claim was supported by the psychology of skills (Devitt 2006b: 210–20). Rey’s assumption may be the crux of his case for VoC, yet it seems baseless. In criticizing my view, Rey emphasizes that “it is impossible to hear one’s native language, normally pronounced, as mere noise.” He continues: Thus, just as visual SDs are the output of a visual one, linguistic SDs provide the input to (let us suppose) a central processing system, which then processes them in combination with other representations, for example of the experimental context of an utterance, in order to produce more or less spontaneous verdicts on what has been said (or seen, in the case of vision). (Rey, this volume, p. 39)

The “thus” illustrates that Rey takes the indubitable “hearing-as” fact as somehow counting for VoC and against ME. But it does not. That fact is simply a sign of the

  

65

effectiveness of our largely automatic language processing system. As I have pointed out, we hear a sentence as having certain linguistic features and not others “in that, as a result of all the processing in the language system, we come up with a representation that has those features and not others” (Devitt 2014b: 287).¹⁸ Rey (this volume, p. 47) notes this response and refers to an earlier criticism. In that criticism, Rey insists that it is not enough for the representation that we come up with to merely have those properties. For something could have them without being “thereby incorporated into a mind” and without the states being treated as having those properties by any system in the mind, central or otherwise. The states are certainly not perceived as having those properties. For that to happen, one would think the properties had better be represented, that is, made available to presumably computational processes of, say, early vision, recognition, comparison, and memory. (Rey, this volume, p. 43)

True enough, but not an objection! For my proposal is that the properties are incorporated in the mind: they are incorporated in the subcentral processing system that delivers to the CP the representation that has the properties. And I’m going along with the view that these properties are indeed represented by SDs that are available to computational processes in that system. So Rey seems to have misunderstood the proposal. Furthermore, he has not addressed the critical point that follows it. We have no reason to believe that, in thus hearing the sentence, the CP thereby has access to the SDs, and hence to the informational basis for intuitive judgments about its syntax. “Hearing an utterance in a certain way is one thing, judging that it has certain properties, another” (Rey 2014b: 287).¹⁹ Hearing-as provides no support for VoC. It’s tempting to say that Rey’s responses to the considerable difficulties facing VoC are reminiscent of the old comic book line: “With one bound, Jack was free.” But perhaps I have misunderstood Rey.²⁰ He claims to “defend . . . a version of what Michael Devitt . . . has called the ‘Voice of Competence’ (VoC) view” (this volume, p. 33) and seems to be doing so, for example in describing metalinguistic intuitions as “manifestations of an I-language” (this volume, p. 35). But perhaps he is not really doing so. Gross (this volume, p. 19n5) thinks that Rey has dropped “the content requirement” of VoC—namely that competence provides ¹⁸ Rey (this volume, p. 43) wrongly takes this as requiring that “the natural language sentences themselves are actually entokened at the interface between the language module and the central processor.” It requires only that there be a representation of some sort with the appropriate features. (For my views about what sort that is, see Devitt 2006b: 142–62.) ¹⁹ On one occasion, in a strange slip of the mind, I have written as if there were a direct route from hearing-as to intuition (Devitt 2014a: 15). ²⁰ I thank an anonymous referee for raising this issue and Steven Gross and Georges Rey for comments.

66

 

the propositional contents of intuitions. If so, Rey has dropped something essential to the VoC that I have attributed to linguists and rejected (§4.3 here). I took Rey to hold that the CP “has access” to SDs (Devitt 2014b: 280); and he has not denied that he does. But perhaps, contrary to what I assumed, this “access” is not supposed to provide the intuition’s content but simply to identify a reliable source of evidence for the intuition. But then he is defending a version of ME, not VoC, and we have been at cross-purposes.

4.9 Gross’ argument for VoC As mentioned in section 6, Gross (this volume, pp. 21–2) takes competence to play a role in causing intuitions about an ungrammatical string on the grounds that it includes a monitoring system that generates “error signals,” perhaps associated with a phenomenology. I had no problem with this: it is an example of competence providing evidence from usage. But, as Gross well realizes, it does not get us to VoC. To get there, it is necessary that these error signals be representations that are presented to the CP and provide the contents of intuitions. Gross (this volume, p. 22) claims that “it is a natural thought” that the signals indeed meet this “content requirement.” I can be brief in response because my “serious objection (I)” to Rey’s (2014b) proposal carries straight over to this “natural thought.” As I have emphasized here, Rey provided evidence of the role of representations, SDs, in subcentral language processing. But, I argued (Devitt 2014b: 284–7), he did not provide evidence that the CP has access to these representations. Similarly, Gross offers evidence that language processing involves a monitoring system that provides error signals that may well be representations. But he offers no evidence that the CP has access to those representations. (Gross does not take account of my exchange with Rey.) My “serious objection (II)” to Rey’s proposal for an ungrammatical string was that, even if the language system did provide an SD to the CP, it would not thereby provide the informational content of a speaker’s intuition, the content that the string is ungrammatical (unacceptable). For the “SD does not come with a sign saying ‘ungrammatical’ ” (Devitt 2014b: 287). Rey agrees in the present paper (this volume, p. 45). Interestingly, Gross suggests that the language system delivers an error signal with a content that is indeed along the lines of This string is unacceptable. This could of course provide the intuition. But we have even less reason to believe that the language system outputs a representation with this content to the CP than an SD. So we should reject Gross’ VoC account of intuitions about ungrammatical strings. What does Gross say about our intuitions that a string is acceptable or, as I prefer to say, grammatical (see §4.2 here)?

  

67

There is an obvious asymmetry here. In such cases, there is presumably no error signal to play an etiological role, so the content of the intuition would not seem to be the content of some output of the language faculty. (Gross, this volume, p. 26)

Gross contemplates several possible answers: (i) the “absence [of error signals] leads to judgments of acceptability”; (ii) there is a “non-error” signal “with content to the effect that: This string is acceptable”; (iii) “the speaker’s having comprehended what was said” causes the intuition (this volume, p. 27). In response, I note that only (ii) would meet VoC’s content requirement. Indeed, the roles that (i) and (iii) give to competence in causing intuitions fits ME nicely (see §4.3 here). And the problem with (ii) is, of course, that we have no reason to believe that competence does provide such a representational content. Finally, Gross (this volume, p. 29) considers an intuition about co-reference. He claims that “the intuition amounts to little beyond comprehension—the content requirement is fulfilled and arguably the Voice of Competence is vindicated.” Now I allowed that such intuitions are the most plausible ones for VoC (Devitt 2014b: 287). Thus, suppose speakers are presented with our example (1) John seems to Bill to want to help himself and asked: “Who does John think that Bill wants to help?” Almost all speakers will answer “Bill,” thus providing evidence that “himself” must co-refer with “Bill.” This evidence from usage is provided simply by the speakers’ linguistic competence. And, as Gross in effect points out (this volume, p. 28n13), it is only a short step from such an answer to the metalinguistic intuition expressed by answering “Bill” to the question “Which name does ‘himself ’ co-refer with?”²¹ Still, it is a step: competence does not provide the concept of co-reference, and hence not the content of the intuition. Still, we should resist Gross’ claim.

4.10 Conclusion Gross (this volume, p. 32) presents his VoC cautiously: his “speculation deploys a fair number of ‘maybes’ and ‘perhapses.’ ” Rey’s presentation is similarly cautious: his claim is not that “the model is true,” just that it is “scientifically reasonable” (this volume, pp. 34 and 49). I don’t think that either Gross or Rey have supplied

²¹ The step would seem particularly short if “refer” were like “true” in having a purely “expressive” role. But I rather doubt that it is (Devitt and Porot 2018: 1564–5).

68

 

the sort of empirically based details that make VoC worth pursuing. Aside from that, what is the theoretical motivation for their VoCs, each requiring the positing of novel and dubious cognitive states and processes? ME seems to do the explanatory work without this novelty. Neither Gross not Rey have attempted to show that ME does not do this work. Occam favors ME.

5 Do generative linguists believe in a Voice of Competence? Karen Brøcker

5.1 What is the received view? In linguistic fieldwork in general and in generative syntax in particular, the practice of using native speakers’ intuitive judgments about the well-formedness of sentences is ubiquitous.¹ But what is the justification for this practice? More specifically, why are linguists justified in using intuitive judgments as evidence for theories of grammar? Although philosophers have debated this question following discussions between Devitt and his critics (see e.g. Devitt 2006c, 2010a, 2014b; Collins 2008b; Textor 2009; Rey 2014b),² it is still unclear what generative linguists themselves think about the matter. It is the purpose of this chapter to find out. In fact much of the criticism of Devitt’s contribution has focused on his claims about what generative linguists actually think. Devitt has called the view that he takes to be the received view among generative linguists the Voice of Competence (VoC) view. On this view, as Devitt characterizes it, we are justified in relying on linguistic intuitions on the grounds that they are fairly directly caused by the speaker’s linguistic competence. More specifically, a core part of the account is the assumption that the rules of a speaker’s grammar are represented in that speaker’s mind or brain. Then, when the speaker is presented with a sentence and is asked to make an acceptability judgment about that sentence, the speaker unconsciously derives an answer from the rules that are represented in their mind. If the stimulus sentence is not permitted by the rules, the content of the acceptability judgment will be “not acceptable” and, if the sentence is permitted by the rules, the content will be “acceptable.”³ In this way the ¹ In this chapter I am only concerned with intuitive judgments of morphosyntactic well-formedness, usually called acceptability judgments or grammaticality judgments. Intuitive judgments are used as evidence in other subfields of linguistics as well, however. ² In the present volume, this question is debated in chapters by Gross, Rey, Devitt, and Collin, in Drożdżowicz’ chapter for intuitive judgments about meaning, and in Santana’s chapter from a different perspective altogether. ³ Here I gloss over whether ordinary speakers can give acceptability judgments or instead give grammaticality judgments. See Culbertson and Gross (2009), Devitt (2010a), and Gross and Culbertson (2011) for a discussion. Karen Brøcker, Do generative linguists believe in a Voice of Competence? In: Linguistic Intuitions: Evidence and Method. First edition. Edited by: Samuel Schindler, Anna Droz˙ dz˙ owicz, and Karen Brøcker, Oxford University Press (2020). © Karen Brøcker. DOI: 10.1093/oso/9780198840558.003.0005

70

  ø 

speaker’s linguistic competence provides the content of his or her intuitive linguistic judgments. On this view, there is still room for noise from performance factors such as lack of attention, limitations on working memory, and so on to influence intuitive judgments. Devitt (2006c) notes that what he calls the standard version of VoC, in which rules are represented in speakers’ minds, might not be the received view. Instead, he suggests, generative linguists might believe in a non-standard version of VoC, where the speaker’s competence still supplies the informational content of that speaker’s intuitive judgments but where this is based on embodied rather than represented rules. He questions, however, what such a non-standard view could consist of. VoC, as Devitt characterizes it, seems to follow naturally from traditional Chomskyan mentalist views, in which talk of representations, rules, and knowledge features heavily. Devitt (2006c) lists a couple of quotations that support his reading, including the following: it seems reasonably clear, both in principle and in many specific cases, how unconscious knowledge issues in conscious knowledge . . . it follows by computations similar to straight deduction. (Chomsky 1986b, quoted in Devitt 2006c: 483)

Quotations like this one, along with the general use of terms like “representation,” “rules,” and “knowledge,” do seem to invite the conclusion that VoC is (or was, at least at some point) the received view within generative linguistics. But, as mentioned, some of the criticism of Devitt’s work on linguistic intuitions has focused exactly on the claim that VoC is the received view within generative linguistics. Ludlow (2011: 70–1), for instance, criticizes Devitt’s interpretation of the quotation here and of a couple of other ones. He takes the Chomskyan passage to be about “unconscious knowledge” of linguistic facts, rather than of linguistic rules, issuing in conscious knowledge. He even mentions that earlier drafts of his own text contained quotations that suggested belief in VoC, but he attributes this to “careless exposition” and suggests that the same might be true of other writers. Against VoC, Pietroski (2008) and Collins (2008b) reject the idea that intuitive judgments are used as evidence in generative linguistics on the grounds that they are about grammaticality. Instead, they argue, generative linguists use intuitive judgments as evidence because they can be seen as evidence of grammaticality. In other words, it is not on account of the content that intuitive judgments have that they are used as evidence by generative linguists. Rather, they are used as evidence because they are (thought to be) products of the speaker’s mental grammar (along with other cognitive systems) and for that reason tell us something about what the mental grammar must be like. Textor (2009) and Rey (2014b) both characterize what they take to be the received view in generative linguistics, and their

        ?

71

characterizations differ from Devitt’s VoC on certain central aspects, although none of them rejects the term “the Voice of Competence.”⁴ The question of whether VoC is in fact the received view in generative linguistics has received almost more attention in the debate, it seems, than the separate question of what the received view should be. But, even so, at this point in the debate it is still an open question whether VoC as Devitt characterizes it really is the received view among generative linguists or not. From an outsider’s perspective, the attribution of VoC to generative linguists does not seem entirely surprising. As mentioned, it accords well with talk of “knowledge,” “rules,” and “representations,” which, historically, has been part of what has set the mentalist, generative approach to linguistics apart from other approaches. This makes the strong opposition that this attribution has been met with within generative linguistics all the more interesting. If the question of what the received view is were to be settled, the focus of the debate could move on to the justification question itself. This calls for an investigation into what the received view on the justification question really is in generative linguistics; it also calls for an answer to the question of whether that received view is VoC or not. What tacit assumptions underlie the practice of using linguistic intuitive judgments as evidence? Do those assumptions in fact amount to VoC, either standard or non-standard? And, if not, what is the received view then? In the rest of this chapter I report the results of a study that investigates just this problem. The study consists of a questionnaire that was centered on seven central aspects of VoC. I describe those aspects in the next section, before presenting the study and its results in section 5.3. In section 5.4 I discuss what the results tell us about the received view on the justification question in generative linguistics and conclude that the evidence suggests that VoC is not the received view in this field, at least not currently.

5.2 Dissecting VoC The main objective of my study was to test the hypothesis that VoC as characterized by Devitt is the received view in generative linguistics. To test it, VoC was divided up into seven sub-issues, which I present in more detail in this section. Each sub-issue was made the focus of one question in the questionnaire. The questionnaire was designed this way in order to heighten the validity of the study by attempting to make each part of the view clear to participants. The design also meant that, since participants considered each part of the view separately, if VoC is to turn out not to be the received view, we would know how the received view differs from VoC and where the two views agree. The sub-issues are the following: ⁴ When I use the term “VoC” in this chapter, I generally use it to mean the view characterized by Devitt.

72

  ø  • Are linguistic intuitive judgments useful as evidence owing to their connection to competence, or owing to their connection to the speaker’s experience of reflecting on his or her language? • Should we use judgments of acceptability or grammaticality as evidence in linguistics? • Does the speaker’s linguistic competence supply the data for intuitive judgments, or does it supply the full informational content? • Do speakers have direct, Cartesian, infallible access to truths about their language through linguistic intuitions (noise from performance factors aside)? • Should we have a mentalist or non-mentalist view of grammar? • Should we think that the structure rules of a speaker’s language are somehow implemented in that speaker’s mind? Or is it enough that whatever is in the speaker’s mind respects those rules? • And, if the structure rules are implemented, are they represented or merely embodied?

In the following sections I expand on each of these issues.

5.2.1 Competence or experience with reflecting on sentences? On Devitt’s characterization of VoC, syntactic intuitions are good evidence because they are, partly, a product of the speaker’s linguistic competence. On his own view, syntactic intuitions are, by comparison, good evidence, at least to some extent, because they are made by a speaker who has experience with reflecting on his or her language (Devitt 2006c: 497). And although, on his view, competence does play some role in the etiology of intuitive judgments, competence is not the main reason why intuitive judgments are good evidence in linguistics. If VoC is the received view within generative linguistics, we should find a majority of participants indicating that intuitive judgments can be used as evidence because of their connection to competence.

5.2.2 Acceptability or grammaticality? Another, much debated, disagreement between Devitt and his critics is over whether the syntactic intuitions used as evidence by linguists should be grammaticality or acceptability judgments.⁵ ⁵ It is frequently noted that the term “grammaticality judgment” is widely used in the generative literature to refer to what should, by Chomsky’s (1965) definition, be called “acceptability judgments” (e.g. Schütze 1996; Den Dikken et al. 2007; Culbertson and Gross 2009).

        ?

73

On the traditional generative view, an acceptability judgment is a speaker’s report of how acceptable, natural, or good a sentence strikes that speaker to be. Whether a sentence is in fact permitted by the speaker’s grammar or not is a crucial part of the reason why a sentence will be experienced as (un)acceptable by a speaker, but other things might influence this experience as well, such as whether a sentence is highly taxing for the working memory to process. In the generative literature, you find two different uses of the term “grammaticality judgment.” In the first one, “grammaticality” means actually “generated by the speaker’s [mental] grammar” (Schütze 1996: 20). In those contexts, generative linguists will usually remark that speakers do not have direct intuitions about whether a sentence is permitted by their mental grammar or not without interference from factors like working memory, attention, and so on (see e.g. Schütze 1996: 26). In the other use, “grammaticality judgment” means something like a hypothesis about “what sentences are permitted, or generated, by a grammar” (Culbertson and Gross 2009: 722). On this interpretation, whether the sentence is deemed grammatical or not depends on the person’s theory of grammar; also, judgments of this kind are not traditionally considered evidence in the generative linguistic literature. On Devitt’s own view of linguistic intuitive judgments, on the other hand, these are the grammaticality judgments we should use as evidence in linguistics. Devitt (2006c: 488) notes that many generative linguists argue for the use of acceptability judgments rather than grammaticality judgments. This support for acceptability judgments is taken to be part of VoC.

5.2.3 Role of mental grammar: Supplying data or content? There are two central aspects of VoC as Devitt characterizes it. First, on his account, the speaker’s linguistic competence plays a special role in the formation of an acceptability judgment by providing some form of special input to that judgment. Since the input comes directly from the speaker’s competence, it gives privileged access to truths about the speaker’s language. Devitt contrasts this characterization with a characterization of judgments about touch-typing, which are more likely based on the subject’s experience of the activity itself and not on any special input from a specialized cognitive module.⁶ Second, on this view, the speaker’s linguistic competence delivers the informational content of the judgment without the involvement of a centralized cognitive system that we may call the central processor. The central processor receives input from specialized cognitive modules and can form conscious beliefs and decisions ⁶ Of course, one might hold the alternative view that judgments about acceptability are in fact—very much like judgments about touch-typing—based on our experience with the activity. Devitt himself argues for such a view.

74

  ø 

on the basis of those inputs. According to VoC, linguistic intuitive judgments are not just central processor judgments based on some special input from the speaker’s competence. Instead, the central processor is not involved at all, and the competence directly provides the informational content of the judgment (say, the proposition “That sounded fine to me”). On this view, one representation (the rules in the mind) issues in another representation (the content of the judgment), with no input from the central processor. Devitt endorses this second requirement for VoC in his response to Maynes and Gross (2013), who argue that, rather than the speaker’s competence delivering the content of the judgment, the content of the judgment might be created by the speaker’s central processor reacting to the arrival of a parse—or, alternatively, to a failure to parse the sentence. In their opinion, that would still allow for a fairly direct connection between the speaker’s competence and the content of his or her intuitive judgments, but the competence would not provide directly the content of the intuitive judgment. In this case, Devitt writes, “the presence or absence of [a successful parse] would be the data for the central processor’s response. So, not VoC again” (Devitt 2014b: 286). On VoC, “competence does provide [linguistic propositions that could be true or false]” (Devitt 2010b: 254). In contrast, Devitt (2014b) himself argues for a view on which competence provides just data that are then further processed by the central processor. On VoC, the central processor is not involved in coming up with the informational content of the intuitive judgment.⁷ Furthermore, on VoC it is assumed that the way the speaker’s competence provides the informational content of intuitive judgments is that the verdict of the judgment is deduced from rules that are somehow implemented in the speaker’s mind (more on this later in this section). As this is the only specific proposal that I know of regarding how the speaker’s competence could supply the informational content of intuitive judgments (rather than the data for such judgments), the question that raises this issue in the questionnaire is phrased in terms of intuitive judgments being deduced from mental rules.⁸

5.2.4 Fallibility/direct access? On VoC, Devitt argues, the fact that competence supplies the informational content of intuitive judgments means that through these intuitions we have direct, ⁷ See Devitt’s chapter in this volume. This is, I believe, a central difference between Devitt’s strict VoC and Rey’s (and others’) less demanding competence-based views. ⁸ One could think up other potential (perhaps empirically implausible) proposals, of course. The content of intuitive judgments could, in principle, be related to output of the speaker’s linguistic competence by certain probabilities, for instance (thanks to Samuel Schindler for this suggestion). This would satisfy the requirements of the competence determining the content of intuitive judgments without the central processor being involved. This proposal is not based on any particular empirical evidence, however, and I have not seen anyone argue for it.

        ?

75

Cartesian access to facts about language. On the VoC view, the propositions expressed as linguistic intuitive judgments, will, noise aside, be true because they come directly from the speaker’s linguistic competence. Judgments might still be influenced by performance factors, however, just as our competence can be influenced by performance factors when we are producing or comprehending utterances (Devitt 2006c). For this reason, speakers in practice do not have infallible access to these facts; but, if we imagine that performance factors could be filtered out, then that access would hypothetically be infallible. Alternatively, if intuitive judgments were the result of central processor deliberation on data from the speaker’s competence, then intuitive judgments could be wrong about the speaker’s language even if performance factors were filtered out. This can be the case, for instance, if the speaker holds some mistaken theory or belief about grammar and applies it in the judgment. This would not be a result of performance factors per se, such as lack of attention, working memory limitations, and the like, but would still lead to an intuitive judgment that is incorrect about the sentence in question.

5.2.5 Mentalist view of grammar? On a non-mentalist conception of grammar, the aim of grammatical research is to account for the external patterns we can observe in languages, whereas on a mentalist conception of grammar, the aim of grammatical research is to account for the internal mental mechanisms that give rise to the externally observable linguistic patterns. Fitzgerald (2010: 124) describes how, when psychologists collect intuitions, they usually do it because they are interested in what these intuitions “reveal about the psychological states of the people that have them,” while at least some philosophers might instead be interested in intuitions “because they are revelatory of a non-psychological domain of facts.” These different approaches correspond to a mentalist and a non-mentalist conception of grammar respectively. Although mentalism is not part of VoC itself, Devitt describes the received view in generative linguistics as being mentalist. A mentalist conception of grammar fits well with VoC, whereas a non-mentalist conception less obviously combines with VoC, and so this issue is included in the study.

5.2.6 Implemented structure rules? Another issue related to VoC, although not directly a part of the view, is the question of how the mental linguistic system is organized and what intuitive judgments may tell us about that. Devitt (2006b) points out that the rules

76

  ø 

governing the structure of the output of a competence (the structure rules) must not necessarily be included among the rules governing the exercise of that competence (the processing rules). He argues that, in a case where the set of processing rules does not include the structure rules, the processing rules must at least respect the structure rules, that is, yield output that conforms to the structure rules, but there are no restrictions on how the processing rules must work to make it so. There is no detectable difference in the output, and therefore we cannot say, on the strength of merely identifying the structure rules that govern some output, whether or not these rules are included among the processing rules or whether they are only respected by the processing rules. Separately from the question of whether structure rules are included among or just respected by processing rules, one can also ask how the processing rules are implemented. One possibility is that the processing rules are represented in the mind of the speaker; another is that they are no more than embodied in the mind of the speaker. Devitt compares the difference between these two possibilities to the difference between a general-purpose computer that has a calculator program installed and a specialized pocket calculator. In the first case, the general-purpose computer reads the represented rules in the calculator program when loading and executing the program. In the second case, the processing rules that govern the execution of calculations are not represented in the architecture of the calculator, they are merely embodied as part of the way the calculator is wired. On what Devitt calls the standard version of VoC, the structure rules are considered to be included among the processing rules, and they are hypothesized to be represented. One reason why he characterizes VoC in this way is that traditional generative literature speaks explicitly of mental representation of rules. Another reason for attributing this view to generative linguists might be that it fits well with another part of VoC, namely the idea that competence provides the informational content for intuitive judgments as discussed earlier in this section. If the structure rules that govern the output of our linguistic competence are represented in the mind, then the question of how competence could provide the content of intuitions becomes a question of how representations of one sort lead to representations of another sort (Devitt 2006c: 484), which seems straightforward. Devitt (2006c) finds this aspect of VoC cognitively very immodest. He supports instead the alternative hypothesis on which whatever is in fact implemented in the minds of speakers just respects the structure rules that govern the output of our linguistic competence. As mentioned, Devitt (2006b: 97; 2010a: 835) also considers a non-standard version of VoC, in which structure rules are embodied in the minds of speakers rather than being represented. He finds it unlikely, however, that embodied rules could give rise to the informational content of syntactic intuitions without some influence from the central processor.

        ?

77

Let us sum up VoC on the basis of the discussion in this section. If VoC is indeed the received view among generative linguists in the form Devitt characterizes it, we should find the following positions to represent the majority view among the generative participants in the survey: • Linguistic intuitive judgments are reliable as evidence because of their connection to competence. • We should use acceptability judgments rather than grammaticality judgments. • The speaker’s linguistic competence supplies the informational content of intuitive judgments. • Speakers have direct, Cartesian access to truths about their language through linguistic intuitions (noise from performance factors aside). • We should have a mentalist view of grammar. • The structure rules of a speaker’s language are somehow implemented in that speaker’s mind (either by being represented or embodied).

5.3 The study 5.3.1 Materials and participants For the questionnaire, seven questions were prepared, each covering one of the issues outlined in the previous section (these questions are listed in the appendix). The questionnaire also contained a background section with questions about demographic data and theoretical orientation, among other things, and with a number of other questions related to the use of intuitive judgments in linguistics; these questions will not be reported here. I present the results in section 5.3.2. The questionnaire was distributed through LinguistList, a widely read online linguistics bulletin board. Participants self-selected by responding to the call. Those who were interested could sign up for a lottery for a number of vouchers to an online store at the end of the questionnaire. In total, 192 participants filled in the questionnaire. Responses that were partially incomplete were discarded (n = 47), as were 11 responses that did not live up to the participation criteria listed in the call for participants.⁹ In the background section of the questionnaire, participants were asked to identify their main theoretical orientation in linguistics (formal/generative, cognitive/functional, mixed, or other). Of the remaining 134 participants, 73 either identified as mainly formal/generative (n = 57) or pointed

⁹ Of these 11, 4 participants were excluded for reporting thinking that intuitions cannot serve as evidence at all, 6 participants were excluded for reporting neither being a researcher holding a PhD nor being enrolled in a PhD program, and 1 participant was excluded for not reporting specializing in linguistics or in a closely related field.

78

  ø 

Table 5.1. Demographic background of participants Age range

Gender

30 31 40 41 50 51 60 60

12

28

17

10

6

Position

Location

Female Male Researcher PhD USA, Europe Rest + Other student Canada of world 34 39 58 15 26 36 11

to formal/generative linguistics as one of their influences (n = 16). The answers from these 73 participants are included in the reported analyses.¹⁰ For a brief overview of the demographic background of these participants, see Table 5.1.

5.3.2 Results For each question, participants were presented with a sentence and asked to complete or continue it by choosing from the options presented the one that expressed their opinion the best. In each case, one option invited them to indicate an alternative answer of their own (except for the fallibility question; see more details about that later in this section). For each question, a chi-square goodnessof-fit test was applied in order to test the null hypothesis that all answer options were picked equally frequently (i.e., that there was no significant majority for either of the options). If the null hypothesis was rejected, post hoc pairwise comparisons were carried out to find out whether one option was picked up significantly more frequently that any other.¹¹ An alpha level of .05 was used for all statistical tests, excluding the post hoc tests, where Bonferroni corrections were applied. In what follows I start by presenting the results for each of the questions, then I take a look at whether the overall picture supports the claim that VoC is the received view within generative linguistics. The phrasing of each question is presented in the appendix at the end of the chapter. Competence or experience with reflecting on sentences? Participants could indicate that syntactic intuitions are reliable as evidence, either because intuitions express speakers’ linguistic competence or because they express speakers’ reflections about language. Participants were also able to supply their own alternative answer.

¹⁰ For more details on the study and for the full results, see Brøcker (2019). ¹¹ Specifically, the frequency of each option was compared to the frequency of each of the other options. This was done, again, using chi-square goodness-of-fit tests with equal frequency as the null hypothesis.

        ?

79

Seventeen (23%) participants chose “reflections,” 40 (55%) participants chose “competence,” and 16 (22%) participants provided their own alternative. The chisquare goodness-of-fit test showed that the answers were not equally distributed: χ² (2, n = 74) = 14.30, p = .001. The post hoc pairwise comparisons showed that “competence” was chosen by significantly more participants than were the other two options (“reflections” and “other”). The majority view on this question is thus that syntactic intuitions make for reliable evidence because of their connection to the speaker’s competence. This is in line with VoC. Acceptability or grammaticality? Participants were first introduced to a short example of acceptability and grammaticality judgments designed to make the distinction salient in the context (see Appendix in this chapter). Participants could then choose to indicate either that only acceptability intuitions can serve as evidence for theories of grammar or that only grammaticality intuitions can perform this function. Participants were also able to supply their own alternative answer. The chi-square goodness-of-fit test showed that the null hypothesis of equal frequency could not be rejected: χ² (2; n = 74) = 4.08, p = .13. This could be due either to the true absence of a result or to the lack of power of the present experimental set-up.¹² In all, 31 (42%) participants answered “acceptability,” 17 (23%) participants answered “grammaticality,” and 26 (34%) participants provided their own alternative. Several of those participants who provided their own option answered that both acceptability and grammaticality intuitions can serve as evidence. According to VoC, we should use speakers’ intuitions of acceptability, so this result does not provide support for VoC’s being the received view. Interestingly, it also shows that a sizable group of participants must disagree with the view, often expressed in the generative literature (as already mentioned, p. 73 in this chapter), that either speakers do not have (direct) grammaticality intuitions or such intuitions (theoretical hunches) cannot serve as evidence: a relatively large group of participants thought that grammaticality judgments can in fact serve as evidence. Data or content? Participants were presented with a statement saying that intuitive judgments are sometimes said to be deduced from the speaker’s mental grammar. They could then indicate that they found this either (i) a poor description of how intuitions are likely to be formed, (ii) a good way to talk about intuitive judgments that should, however, not be taken too literally, or (iii) a good description of how intuitive ¹² With 73 participants, 2 degrees of freedom, the chosen alpha level of .05, and Cohen’s (1988) suggested conventional power level of .8, effect sizes of .36 and above should be detectable. While one should be careful with post hoc interpretations of the power level of tests, this gives an indication that, if the test missed a true present result, it would be at the lower end of the effect size scale, which is not what we would have expected from the literature.

80

  ø 

intuitions are likely to be formed. Participants were also able to supply their own alternative answer. Thirteen (18%) participants chose “poor description,” 44 (60%) participants chose “not too literally,” 8 (11%) participants choose “good description,” and 8 (11%) participants provided their own alternative. The chi-square goodness-offit test showed that the answers were not equally distributed, χ² (3, n = 73) = 49.36, p < .001, and the post hoc tests showed that “not too literally” was significantly more frequent than the other options. This shows that a majority of participants did not take deduction talk to be a hypothesis about the actual etiology of linguistic intuitions. This result goes against the hypothesis of VoC as the received view. As mentioned in section 5.2, this question was intended to weigh on the question of whether the competence supplies data or content for linguistic intuitions. It is phrased in terms of deduction, as this is the only specific proposal that I know of for how the competence could provide the informational content of intuitions with no input from the central processor. So, tentatively, I think we may assume that the received view in generative linguistics is that the speaker’s competence provides the data (not the content) for intuitive judgments.¹³ Fallibility/direct access? Participants were presented with a statement saying that, even if one could abstract away from performance factors, a speaker’s syntactic intuitions could still be mistaken about the grammatical properties of sentences. Participants were asked to indicate to what extent they agreed with this statement (“strongly agree,” “somewhat agree,” “neither agree nor disagree,” “somewhat disagree,” “strongly disagree”). The categories “strongly agree” and “somewhat agree” were combined into one category (“agree”) for the analysis, as were the categories “strongly disagree” and “somewhat disagree” (“disagree”). In this case, participants were not able to supply their own alternative answer. Eighteen (25%) participants chose one of the “agree” options, 12 (17%) participants chose “neither,” and 41 (58%) participants chose one of the “disagree” options.¹⁴ These answers were not equally distributed, χ² (2, n = 71) = 19.80, p < .001, and the post hoc tests showed that significantly more participants ¹³ One might worry that it might not be clear to participants what the answer options to this question entail. However, the relatively low number of participants choosing “other” suggests that participants themselves were not especially worried about the interpretation of this question (compare with the acceptability or grammaticality question above, where 26 participants chose that option). ¹⁴ Two participants used an optional comments box to ask to have their answers to this question disregarded. Their answers are not included in the analysis. This points to the fact that this question might have been difficult for participants to interpret. One potential issue is with the term “performance factors.” If it is interpreted loosely, to mean all irrelevant factors, then this would likely inflate the number of subjects who disagree with the statement. On the other hand, if the term is interpreted very strictly, so that only certain specific sources of performance factors are included (say, memory limitations but not attention span), this would likely lead to an inflation of the number of subjects who agree with the statement. Due to limitations of space, I will leave this issue here.

        ?

81

disagreed with the statement than agreed or chose “neither.” In other words, a majority of participants thought that, if performance factors could be filtered out, linguistic intuitions would give infallible information about grammatical phenomena. This result is in accordance with VoC being the received view. Mentalist view of grammar? Participants could indicate that the ultimate aim of grammatical research is to understand the systematic patterns found in linguistic behavior, or alternatively that the ultimate aim is to understand the linguistic capacity of the mind. Participants were also able to supply their own alternative answer. Thirty-three (45%) participants answered “capacity,” 26 (36%) participants answered “patterns,” and the remaining 14 (19%) supplied their own alternative answer. The initial chi-square goodness-of-fit test showed that the answers were not equally distributed, χ² (2, n = 73) = 7.59, p = .02. However, the post hoc analysis showed that no one answer was significantly more frequent than either of the other two answers. Most importantly, we could not reject the null hypothesis of equal frequency between the “capacity” answers and the “patterns” answer. This lack of a clear majority for the mentalist answer is surprising, both on the hypothesis that VoC is the received view and in light of the mentalist commitments of generative linguistics in general. However, when looking at the alternative options that participants provided, 13 out of 14 thought that grammatical research should aim to uncover the mental capacities as well as the externally observable linguistic patterns. This suggests that the participants in question do not subscribe to an interpretation of linguistics and grammar on which the one and ultimate reason to investigate the patterns found in language is to say something about the underlying mental capacities. Still, these participants’ conception of grammar at least includes a mentalist component, and so one could argue that, while there is no majority for a purely mentalist conception of grammar, there is at least a majority for a conception of grammar that includes a mentalist perspective. Implemented structure rules? Participants could choose between the following options: that structure rules are likely implemented in the minds of speakers, that we can only infer that the mind works as if it was following the structure rules we observe, or that the structure rules tell us nothing about how language is processed by the mind. Participants were also able to supply their own alternative answer. Twenty-one (29%) participants answered that it is a good hypothesis that structure rules are implemented in the mind, 44 (60%) participants answered that we can only infer that the mind works as if guided by structure rules, 5 (7%) participants answered that structure rules tell us nothing, and 3 (4%) participants gave their own alternative option. The answers were not equally distributed,

82

  ø 

χ² (3, n = 73) = 59.11, p < .001, and the post hoc tests revealed that the majority view on this issue is that we can only infer that the mind works as if it was following structure rules. This result contrasts with VoC as characterized by Devitt. Another interesting result came from asking those participants who thought that structure rules are likely to be implemented in the minds of speakers (as well as those who submitted their own alternative options) whether implemented structure rules are likely to be explicitly represented in speakers’ minds or not. Participants were also able to supply their own alternative answer. Twenty-four participants from the generative group answered this question: 12 (50%) participants answered “represented,” 10 (42%) participants answered “implemented but not explicitly represented,” and 2 (8%) participants supplied an alternative answer. The post-hoc tests did not show any one answer to be significantly more frequently chosen than the other two, even though the initial chi-square test showed that the answers were not equally distributed, χ²(2, n = 24) = 7, p = .03. This shows that, even among the participants who do think that structure rules are likely to be implemented in speakers’ minds, there is no agreement that rules are represented rather than embodied (the difference between the standard and nonstandard version of VoC, as described in section 5.2).

5.4 The received view assessed Table 5.2 shows an overview of the results presented in the previous section, compared to what would be expected if VoC were the received view. On two of the seven identified sub-issues of VoC, this study supports the prediction that VoC is the received view among generative linguists. On another two issues, the answers found to be the majority views in this study go directly against the VoC prediction, and for two issues no clear majority was detected, which does not lend support to the VoC prediction.¹⁵ Table 5.2. Majority answers and their agreement with VoC Issue

Majority answer

In agreement with VoC?

Competence or reflections on experience Acceptability or grammaticality Data or content Direct access Mentalist view of grammar Implemented structure rules Represented or embodied rules

Competence (No majority) Only as if, not deduced Infallible (No majority) Not implemented (No majority)

Yes No No Yes No No -

¹⁵ As for the last issue, whether structure rules that are implemented in speakers’ minds are represented or embodied, a majority on either side would have been coherent with VoC.

        ?

83

Importantly, a majority of participants did not believe that structure rules are implemented in the mind. Furthermore, participants rejected the idea that linguistic intuitions are in some literal sense deduced from the mental grammar of speakers. Another result that did not fit the hypothesis of VoC as the received view was that there was not a majority of participants answering that only acceptability intuitions can be used as evidence. Of those who provided their own answer, a number of participants wrote that they think that both acceptability and grammaticality intuitions can serve as evidence. Similarly, and perhaps even more surprisingly, there was not a majority of participants behind an exclusively mentalist view of grammar. So, while the quotations that Devitt presents do suggest that among generative linguists VoC is a prominent view on the justification question, this study, overall, offers evidence against the hypothesis that VoC is the received view in current generative linguistics. More specifically, there was not a majority of participants who thought that structure rules are in some way implemented in the minds or brains of speakers, and there was not a majority who thought that the content of intuitive judgments is deduced from the speaker’s mental grammar. This means that some of the central, but also cognitively highly immodest, assumptions of VoC do not seem to represent the majority view in generative linguistics. On two issues, however, the hypothesis of VoC as the received view was supported. The majority of participants did think that intuitions are reliable because of their connection to competence; and they did think that, noise from performance errors aside, intuitions give an infallible window on grammar. These results help explain why VoC might look like a good candidate description of the received view. However, together with the results mentioned above, they also raise further questions. How could intuitive judgments give infallible insights into grammar, noise from performance factors aside, if not via represented or embodied rules? And, if generative linguists do not believe in VoC, then what is the received view? To start with the latter question: how about a less demanding competencebased view, such as the view that Rey (2014b; forthcoming-b; this volume) defends? On two of the issues covered by the study, the majority view seems clearly in agreement with such a looser version of a competence-based view. Perhaps most importantly, a majority of participants answered that intuitions are reliable because of their connection to competence. Second, a majority of participants answered that intuitions are not somehow deduced from the speaker’s mental grammar. However, on other issues, the majority view is harder to reconcile with traditional generative, competence-based views. One example is the lack of a clear majority for an exclusively mentalist conception of grammar, although, as mentioned, the comments provided by participants suggested support for a conception of grammar that at least includes a mentalist component. This is surprising, given the strong mentalist focus in mainstream generative

84

  ø 

literature and the fact that all the participants whose answers have been analyzed here see themselves as working within generative/formal linguistics. The lack of a clear majority for acceptability judgments over grammaticality judgments is also hard to reconcile with views often expressed in the general generative literature— namely that either there are no grammaticality judgments or grammaticality judgments cannot serve as evidence in linguistics. Thus the combined majority view that emerges from the results of this study is definitely competence-based, though it does not agree on all issues either with the strict VoC or with traditional generative literature. This leaves the question of whether the majority view as documented in this study can be combined into a coherent view and, most pressingly, how it is possible to explain that competence-based linguistic intuitions are infallible, performance factors aside, if this is not because competence delivers the informational content of the judgment. As such a view is not represented in the debate, and as the questionnaire was based closely on the views that have already been put forward, the results cannot tell us much about what generative linguists might think about this matter. It is possible that there simply are no widely shared beliefs about the underlying mechanisms. But at least the results from the questionnaire study point to these questions as interesting avenues for further investigation. Summing up, this study suggests that generative linguists do not believe in the VoC view as it is characterized by Devitt, with its heavy cognitive assumptions. Nor does the received view seem to match completely other proposed competence-based views, mainly through lack of a clear mentalist commitment. It is, however, clear that generative linguists subscribe to a view on which the speaker’s competence plays a prominent role and on which, performance factors aside, intuitive judgments give us access to truths about our language independently of the speaker’s theory of grammar.

5.5 Appendix Below are the questions as they were phrased in the questionnaire. After each question, participants could optionally leave a comment with any further thoughts they might have on the issue. Competence or experience with reflecting on sentences When syntactic intuitions are reliable as evidence, this is mainly because . . . . . . they are speakers’ reflections about language use, and speakers are to some degree reliable judges about this. . . . they express speaker’s competence in their native language. [Supply other answer]

        ?

85

Acceptability and grammaticality Most native speakers of English would find the sentence “the dog that the woman that the man saw owned ran” to be an intuitively unnatural sentence in English. In the terminology of generative linguistics, the sentence is unacceptable to those speakers (this is not to be confused with whether or not the sentence is infelicitous to the speakers in some specific context). Most native speakers would also most likely find the sentence intuitively ungrammatical. However, some experts argue that it is, in fact, grammatical. To the extent syntactic intuitions can serve as evidence for theories of grammar, only those syntactic intuitions can serve as evidence that are . . . . . . acceptability intuitions. . . . grammaticality intuitions. [Supply other answer] The role of mental grammar: Supplying data or content Syntactic intuitions are sometimes said to be “deduced from the speaker’s mental grammar.” This is probably a poor description of how intuitions are formed. This is a good way to talk about how intuitions are formed but should probably not be taken too literally. This is likely to be the actual process of how syntactic intuitions are formed in the mind. [Supply other answer] Fallibility/direct access Imagine you could abstract away all performance factors that might influence a speaker’s syntactic intuitions. In that case, it would be possible for the resulting syntactic intuitions to be mistaken about the grammatical properties of the sentence. Strongly agree—Somewhat agree—Neither agree nor disagree—Somewhat disagree—Strongly disagree Mentalist view of grammar When I study grammatical phenomena, I ultimately seek to understand . . . . . . the systematic patterns found in linguistic behavior.

86

  ø 

. . . the linguistic capacity of the mind. [Supply other answer] Implemented structure rules Syntactic intuitions is one type of evidence used by linguists to characterise how languages are structured. Let’s call the rules that describe this the structure rules of particular languages. The structure rules that linguists describe are sometimes said to be “implemented in the minds of speakers.” It is a good hypothesis that structure rules are actually implemented in the minds of speakers. From the structure rules we observe, we can only infer that the mind works as if it was following those rules. From the structure rules we observe, we cannot infer anything about how the mind processes language. [Supply other answer] The form of implementation of mental rules There must be something in the mind that gives rise to what we call “rules of grammar.” The rules of grammar are probably explicitly represented in the mind. If one could look into subjects’ minds, one could find explicit rules. The rules of grammar are implemented in the mind, but they are probably not explicitly represented. The rules of grammar are probably not implemented in the mind. [Supply other answer]

Acknowledgments This work was made possible by a grant from the Independent Research Fund Denmark (grant number: DFF 4180-00071). I want to thank Samuel Schindler, Anna Drożdżowicz, and Pierre Saint-Germier for extensive and very helpful discussions of the study design, as well as to Samuel Schindler, Anna Drożdżowicz, and an anonymous referee for their valuable comments on earlier

        ?

87

drafts on this chapter. I would also like to thank Michael Devitt for generously discussing his views with me on several occasions. Finally, I would like to thank the participants of the workshop “Linguistic Intuitions, Evidence, and Expertise” in Aarhus, October 2017, for their useful comments and suggestions on a talk that formed the basis for this chapter. Any remaining flaws or shortcomings of the chapter are, of course, entirely my responsibility.

6 Semantic and syntactic intuitions Two sides of the same coin John Collins

6.1 Introduction The nature and evidential status of so-called grammatical or syntactic intuitions has been a focus for both linguists and philosophers. Considerably less attention has been paid to semantic intuitions. Such intuitions are ones that bear upon the semantic properties of sentences and phrases. What I shall suggest in the following is that syntactic and semantic intuitions are really aspects of the same basic phenomena, in the following sense. Syntactic intuitions reveal the conditions for a sentence to have an interpretation (or a contribution to an interpretation), and semantic intuitions reveal constraints on what can be said with a sentence with a fixed interpretation. So, most generally, speaker-hearers have intuitions about what can be said with a sentence, and it is the theorist’s job to figure out what such intuitions reveal about semantics and syntax. On this view, there are no direct intuitions about either syntax or semantics, at least not ones that serve as evidence for syntactic or semantic theories. We shall see how this perspective allows us to recognize the special evidential role of both sorts of intuition and why such a role follows from a general understanding of the nature of syntax and semantics. The chapter is structured as follows. First, I shall succinctly spell out in broad terms what I take a syntactic and semantic theory to be, in a way that will not prejudge the standing of any of the substantive claims to be forwarded later. I intend the combined position to be acceptable to anyone who otherwise commends a generative perspective on syntax and a truth-conditional perspective on semantics. Second, I shall present some familiar reasons why syntactic and semantic intuitions are taken to be default-evidential and distinct from each other. Third, an account of semantic intuitions will be given as effectively reports that reflect or are informed by what can be said with strings. Fourth, a reciprocal account will be offered for syntactic intuitions as effectively reports that reflect constraints on what can be said. This phrasing is important, for in both cases I seek to reject a naïve view of intuitions that treats a speaker-hearer as being essentially correct in her judgments about her language, as if she were a fellow John Collins, Semantic and syntactic intuitions: Two sides of the same coin In: Linguistic Intuitions: Evidence and Method. : : First edition. Edited by: Samuel Schindler, Anna Drozdzowicz, and Karen Brøcker, Oxford University Press (2020). © John Collins. DOI: 10.1093/oso/9780198840558.003.0006

90

 

theorist; rather, the right view is to treat the fact that speaker-hearers have intuitions as a source of evidence for the nature of language, the underlying system. Fifth, I shall show how to accommodate cases where one appears to have interpretation without grammaticality and grammaticality without interpretation. Both cases are illusory, at least so described. In both cases, grammaticality (syntactic soundness) is aligned with interpretability. Sixth, I shall conclude by being explicit about the moral of the picture painted; in particular, it will be clear how the evidential status of intuitions of both kinds is preserved and supported by the general conception of linguistic theory presented in 6.2.

6.2 Semantics and syntax: Linguistic phenomena Let LP schematically pick out the linguistic phenomena that are of primary concern to theorists in syntax and semantics: (LP) In context C, speaker-hearer A can use (produce/consume) S with interpretation I to say(/understand) P. Clearly the instances of LP fail to exhaust the linguistic phenomena. Theorists are interested in a multitude of phenomena that go beyond the basic interactions covered by LP (historical change, acquisition, deficit, evolution, etc.). Indeed, there are no doubt many cross-linguistic generalizations concerning the instances of LP that a theory should explain rather than merely register. Still, I take LP to be basic in the sense that it captures the contribution of competence (“knowledge of language”) to performance (using the knowledge to “do something”), and the interplay of morphophonemic, semantic, syntactic, and pragmatic properties and how they are both used and perceived (I shall largely take morphophonemic properties as given). If any theory has a handle on that, then it is in good standing and should constrain what explanations one is minded to offer elsewhere. To be concrete, consider a speaker forlornly remarking (1) to her friend, following the fall of raindrops on her umbrella on the beach: (1) It’s raining. Let us assume that the relevant syntax underlying the sentence type used is as in (2) (the precise details won’t matter): (2) [TP It [T’ is [ProgP ing [VP V+rain [N ]]]]] So a speaker employs (2) in the morphophonemic guise of (1). Let us further suppose that its interpretation is (3):

   

91

(3) (∃e)[raining(e) ^ PRESENT(e)]1 Sentence (3) fails to encode what the speaker says, however, for the speaker is talking of her location, not no location at all. In effect, therefore, the speaker asserts a content of the following form: (4) (∃e)(∃l)[raining(e) ^ PRESENT(e) ^ AT(e, l) ^ HERE(l)] The point here is that, if (1) is interpreted as saying (3), then the speaker will have said something true so long as it is presently raining anywhere at all, even if it is not actually raining on the beach (the drops on the umbrella were coming from a boy’s water pistol). Thus the speaker is read as saying (4), whose truth conditions are location-sensitive. The phenomenon of someone saying something fractionates via an apportionment of explanatory responsibility. Syntax has responsibility for (2); semantics has responsibility for (3); and pragmatics has responsibility for (4), given (3). The respective theories interact, of course, just as the various aspects of linguistic phenomena occur together. In general, syntax is the form that is interpreted; semantics is the interpretation that imposes constraints upon what one can say with the form; and what is said is what is truth-evaluable (or satisfiable) and hence can be believed, doubted, asserted, and so on. Two clarificatory remarks are in order. First, the example of weather reports and the distinctions made above are controversial in various ways. One major point of contention concerns the locative variable in (4), namely whether it is part of the semantics proper, interpreting a syntactically realized covert item, or whether it is a pragmatically supplied element of interpretation, as I have assumed. Here is not the place to resolve this issue. Suffice it to say that everyone makes the kind of distinctions I have suggested; where theorists differ is on the nature of the explanation of phenomena—whether a locative construal of weather reports is pragmatically supplied or semantically determined, for example. What is agreed is that the fixation of the sayable is not solely a linguistic matter but, in general, a function of linguistic and extralinguistic factors. Nothing about to be argued in what follows presupposes any contentious position in this regard, although I do favor a thin semantics that is based on a minimal syntax, and so I take what is said to be richly pragmatically determined (cf. Recanati 2010; Collins 2020). One may disagree and still accept the position to be offered on intuitions.

¹ Tense can be handled in various ways, for example as a function of the state of affairs picked out by the structure it merges with (here, the ProgP), or as an event that encodes that state of affairs and is present/past.

92

 

Second, by “x constrains y” I mean that x limits the possible properties of y, and so specifies certain invariances on how y is realized. So, syntax constrains semantics, which constrains what is said. Take the syntax of a language (as realized in the mental states of a population of speakers-hearers) to be fixed, constrained by universal grammar as a biophysical property of the human brain. Semantics is constrained by syntax in the sense that what interpretation a sentence can have is limited by its syntax. Thus the syntax in (2) fixes (3), so that the interpretation specifies an event—one of raining that happens presently and involves no participants (it is pleonastic, and hence it does not refer to the environment; indeed, rain projects no arguments). These syntactically imposed conditions are invariances regarding the possible interpretation. At the next level, the interpretation imposes constraints, and so invariances regarding what can be said. Thus, every utterance of (1) must truth-conditionally involve a raining event that happens at the present. What is left open is the location of the event and its compass (how encompassing here is depends upon what one intends to say). So, a theory of syntax is a competence theory of constraints upon interpretation, that is, the structures made available for use via a cognitively realized combinatorial system. The structures massively outstrip usability conditions and are not explicable by interpretability conditions (syntax is its own system, one not designed for, or otherwise reducible to, semantics). These facts do not belie the general constraints view I am offering; they only indicate that syntax is an “autonomous system” per standard generative assumptions (see further in this chapter). A theory of semantics is a theory of interpretations, that is, the interpretations that syntactic structures can receive independently of communicative effects or intentions and context. So, an interpretation is an invariant contribution, from language alone, to what is said. We may also assume that interpretations are compositional (homomorphic) functions defined over syntax. So, given a combinatorial syntax that delivers Ss, there is a homomorphism h, such that for every syntactic object F(α₁, . . . , αn), there is an operation O such that h(F(α₁, . . . , αn)) = O(h(α₁) . . . , h(αn)). That is one way of saying that, at the syntax–semantics interface, the interpretation of the syntax should respect its constituent structure. What remains in question is whether h(F(α₁, . . . , αn)), qua O(h(α₁) . . . , h(αn)), as it were, is what someone literally says by uttering S with the form F(α₁, . . . , αn), or even if it is the kind of thing that can be said in the appropriate sense. So, the interpretation of syntax may or may not determine what is said. I assume it mostly does not. Suppose that all this is on the right lines. The standard evidential base for theories in both syntax and semantics is intuitions. As we shall now see, the received view in both fields is that the intuitions are different in kind.

   

93

6.3 The distinction between syntactic and semantic intuitions Chomsky (1957) noted that native English speakers can recognize the different status of the cases in (5), even though both sustain an interpretation in the sense in which a hearer would easily understand that what would be said by an utterance of (5b) is akin to what would be said by an utterance of (5a). (5) a The child seems to be sleeping b *The child seems sleeping Similarly, Chomsky noted that speakers recognize a difference between the cases in (6): (6) a Colorless green ideas sleep furiously b *Furiously sleep ideas green colorless On the face of it, both these latter strings lack an interpretation, at least a coherent one, but speaker-hearers still register one as being ill-formed word salad and the other one as being okay, albeit semantically deviant. A ready explanation for this is that speakers can recognize the form of (6a) to be okay, such as is instanced elsewhere, whereas the form of (6b) is simply out: (7) a Friendly young boys play nicely b *Nicely play boys young friendly One might think, therefore, that competent speaker-hearers have grammaticality intuitions that in effect divide the superset of strings made from a class of formatives (lexical items) into two sets, the well formed and the ill formed. This thought is mistaken on a number of fronts, as we shall see, but it seems at least that intuitions can and do target well-formedness independently of interpretation. Similarly, semantic intuitions target facts of interpretation that appear not to be underwritten by well-formedness conditions. Sentence (6a) offers a putative example where a structure is well formed according to intuition, but lacks an interpretation. Conversely, it looks as if interpretations are available even where the string is ill formed. Sentence (5b) is a case in point. This is not an ideal example of the putative phenomenon, however; for the interpretation is not systematic in the sense that (5b) is read by analogy with the complete (5a); (5b) merely has some words missing. As we shall see, island violations and vacuous quantification offer more significant cases. Independently of such cases, it seems that semantic facts are lexically dependent in a way syntactic facts are not. Suppose that competent speaker-hearers can recognize the following strings to have the properties mentioned in the brackets:

94

 

(8) a Black beetles are colored [analytic] b Dogs aren’t dogs [contradictory] c If everyone was at the party, then someone was at the party [valid/analytic inference] d That is a dog [true, when a dog is demonstrated] e Mary loves Joe/Joe loves Mary [distinct truth conditions] f Mary resembles Sam/Sam resembles Mary [same truth conditions] The point here is that the intuitions turn on the interpretation of particular lexical items in a way syntactic intuitions appear not to. Such a difference is predicted, of course, if the latter simply register whether a string is well formed rather than telling us what interpretation, if any, it might have. Semantic intuitions, to the contrary, are about what interpretations strings might have, and strings cannot possess any interpretation independently of the properties of the constituent lexical items. There is a lot of complexity here I have elided in order to highlight what does appear to be a striking distinction between syntactic and semantic facts. It is my business to undermine the idea that these different kinds of fact engender fundamentally different kinds of intuition qua evidence. I shall start on this task by thinking about the bare idea of a semantic intuition and what form it takes. I shall assume that we may distinguish between semantic intuitions that turn on language alone, which are our concern, and semantic intuitions that turn on some wider conceptuality. I do not assume that this is an easy task, but I do assume that it can be done, at least for a range of clear cases.

6.4 Semantic intuitions Generally the common philosophical view of intuitions is that they are propositional attitudes, much like belief or judgment, or dispositions towards such. What (if anything) distinguishes intuitions from other attitudes is simply that they are taken to be more or less non-demonstrative, non-discursive, typically automatic, and non-inferential (cf. Pust 2000; Fischer and Collins 2015). That is, they are attitudes the entertaining of which is not grounded or even associated with much cognitive processing. One finds oneself with intuitions rather than arriving at them. It makes little sense, for example, to say, “After considering the ins and outs of the case, I’ve arrived at the following intuition.” Suffice it to say that philosophical puzzles about intuitions in general are not my concern. Such puzzles mainly orbit around the question of how central intuitions are to various philosophical programs and what special epistemic status they might possess; instead, I want to focus on what intuitions are supposed to be, when taken as evidence in linguistic theory. Although there are many disputes, it is broadly accepted in the philosophical literature that the basic form is as follows:

    (AV)

95

A intuits that F(S),

where intuits is some non-inferential quasi-experiential propositional attitude, S is some linguistic material (+/- context), and F is some semantic property that is the target of inquiry (see e.g. Katz 1981; Soames 1984, 1985; Schütze 1996; Devitt 2006b). Here are some examples: (9) a A intuits that Mad dogs and Englishmen . . . has narrow- and wide-scope readings of the adjective. b A intuits that He loves him lacks a bound reading. c A intuits that Polar bears are white is true over exceptions. d A intuits that Sam boiled the soup entails The soup boiled. e A intuits that It’s raining uttered in c means that it’s raining in c. What the semanticist does, therefore, is infer the semantic properties of the language from data points of the kind exhibited, where an adequate theory is one that entails such data and meets whatever other desiderata are in play. Of course, a speaker-hearer does not volunteer such metalinguistic judgments. A theorist must know what interests her in order to ask the native speaker the right questions. Furthermore, most work in semantics is not based upon any empirical survey of intuitions, nor need it be. The idea here is that, were one to elicit the relevant judgments, they would cohere with one’s own. In this sense, the theorist’s own intuitions go proxy for a survey of the relevant population that would issue in the same judgments. I shall simply assume that this universalizing aspect of linguistic methodology is in good standing (cf. Sprouse and Almedia 2012b). As before, I shall assume here that one should distinguish between semantic intuitions that are due to language alone (syntax plus lexical content) and wider conceptuality. This propositional attitude view is fundamentally mistaken. The next section will explain why.

6.5 Why the attitude view is wrong Although I take the attitude view to be standard, various aspects of it have been questioned (see Fiengo 2003; Fitzgerald 2010; Ludlow 2011; Mayes and Gross 2013). Here I shall highlight two insuperable problems with the view. One pertains to the supposed content of the attitude itself; the other pertains to the metalinguistic conceptual application with which it credits competent speaker-hearers. Thus my complaints are directed not so much at the idea itself of having an attitude toward a sentence or toward what it might mean, as they are at the

96

 

peculiar properties that such an attitude and its content are supposed to possess. Let us consider the first complaint first. It is simply false that judgments or intuitions of the kind that concern theorists are required to be in any sense immediate, non-inferential, or quasi-experiential. Many of the relevant intuitions might have such properties, but this is a matter of no particular evidential import. If, however, intuition simply means that such properties are realized by some relevant attitude, then we can drop the word as misleading. I will shortly get to just what we want from intuitions and why the word is appropriate, albeit liable to mislead. Pro tem, I shall show why the relevant intuitions are not always obvious or immediate, and why obviousness doesn’t matter, anyway. As a first example, consider our intuitions about scope options. Where there are two scoping phrases, it is relatively easy, ceteris paribus, to identify the relevant ambiguity: (10) a Every teacher loved some pupil b Every train wasn’t late c You can fool some of the people most of the time Intuitions here, let us say, are quite immediate and robust. Any adequate theory must accommodate them, but no steer is provided, of course, on how they are to be accommodated, whether, that is to say, the explanation should primarily be syntactic, semantic, or pragmatic, or some mix of the factors. Be that as it may, the evidential status of intuitions becomes much less clear when three scoping phrases occur, including an indefinite (or “weak”) DP (a/an/some/three N). Taking a leaf from Ludlow (2011: 78–9), consider (11): (11) Each teacher overheard the rumor that a student of Bill’s had been caught cheating The acceptability of this sentence raises the familiar issue that indefinite DPs (e.g. a student of Bill’s) can scope out of islands (such as clausal complements of nominals) in a way definite and “strong” DPs cannot. Park this fact. The fact to which I wish to draw attention is simply that what readings (11) may support is unobvious, not non-inferentially immediate. Fodor and Sag (1982), for instance, claim that constructions like (11) lack an intermediate reading. Such a reading is where an indefinite DP (a student of Bill’s) that occurs lowest in a sequence of DPs is read as scoping intermediately between the two remaining DPs. Fodor and Sag suggested that indefinites scope either low or high, never intermediately. In intuitive terms, the hypothesis here is that (11) does not have the reading in (12): (12) For each teacher, there is a student of Bill’s about whom the teacher has overheard the rumor that they cheat.

   

97

This putative fact was taken to support the thesis that indefinite DPs are ambiguous between a quantifier reading, where they occur low and are dependent upon a higher DP, and a referential singular-term-like reading, which accounts for DPs scoping high outside islands and not taking intermediate positions—that is, a reading in which a referential term always scopes highest. It strikes me and most other native speakers, it seems, that the facts here are not obvious. One way of aiding intuition is to ask whether the truth of (11) is inconsistent with there being more than one student of Bill’s about whom there was a cheating rumor. Sentence (12) has a reading that maps students of Bill’s onto teachers, and so can be true if there is more than one student (each teacher hears a rumor about a different student). The question, then, is whether (11) has such a reading. Whatever the ultimate answer might be, I take it to be obvious that no clear, non-inferential intuition will provide the answer. If intuitions do substantially bear on the matter, they do so in ways that are independent of their putative immediate, non-inferential character. In the literature, the issue is not approached by table-thumping commendations of one set of intuitions over another; rather, appeal is made to a mix of theoretical generalizations and clearer cases (cf. Reinhart 1997; Kratzer 1998). Thus, if an indefinite DP ambiguity hypothesis is required to explain the semantic properties of (11), the ambiguity must be a general feature of indefinite DPs rather than a peculiar property of (11) for all indefinite DPs scope out of islands. It is straightforward, however, to find intermediate readings for indefinite DPs: (13) a Each student has to come up with three arguments which show that some/a condition proposed by Chomsky is wrong. b Most linguists have looked at every analysis that solves some/a problem. c Most producers admire every song that has been written by some/an artist. d Every literature professor dislikes every novel that some/an author wrote. In each of these cases, the lower indefinite DP can be read as dependent on the highest DP, but higher than the surface intermediate DP. Thus, perhaps the natural reading of (13a) is one where (13a) is true if and only if each of the students picks her own Chomsky condition and offers three arguments against it. Likewise, (13b) is naturally read as stating that linguists are thorough in narrow areas, most of them having a favored problem that they have exhaustively analyzed—mutatis mutandis, the same for (13c–d). What is a theorist to do? Other factors can be considered. So, it is pretty plain that the hypothesis that indefinite DPs are ambiguous is explanatorily inadequate, failing to predict, let alone explain, why the intermediate readings in (13) are okay. One might think, therefore, that ad hoc ambiguity hypotheses should be avoided;

98

 

instead, all scope positions should be available, and anomalies need to be explained as departures from such principles. In this light, the intuitions concerning (11), regardless of how immediate they might be, would appear to be simply wrong. This conclusion is problematic, as we shall see, for maybe intuitions are not to be considered right or wrong; at any rate, the immediacy of the intuitions counts for nothing when it comes to the general issue of the scope behavior of DPs. It is instructive here to note how intuitions routinely lead one astray. Consider: (14) a The shooting of the elephant shocked the children b The men asked their wives to talk to each other c Dogs must be carried on the subway These sentences appear to lack relevant readings, made explicit if we substitute soldiers for elephant, children for wives, and ID papers for dogs. The readings are available; they just result in absurdity given the nature of the world, or how we expect it to be, but presuppositions about the world are not constitutively semantic. In such cases, a speaker-hearer needs to filter out worldly conditions on interpretation in order to understand what is semantics proper. No such filtering can be achieved in an immediate, non-inferential manner, of course. As for (11) in particular, the confounding factor appears to be that a student of Bill’s is easily read as referring to a unique student, for the genitive carries a default or presupposed uniqueness construal, under which a student of Bill’s is read as Bill’s student. It thus becomes hard to read the DP as scoping under each teacher. Substituting a student for the DP—dropping the genitive, that is—makes the intermediate reading far more natural. The first problem for the attitude view just rehearsed focused on the supposed nature of the attitude of intuition: that it is supposed to relate the speaker-hearer to linguistic properties in an immediate, non-inferential, quasi-experiential manner. The second fundamental problem with the attitude view relates to the supposed content of the attitude. Recall that the idea is that speaker-hearers have intuitions that linguistic material (under various presentations) possesses or fails to possess linguistic properties relating to scope, ambiguity, binding, and so on. In effect the speaker-hearer has metalinguistic intuitions and we are interested in the true ones, which thus tell us the truth about the language. It is as if the informant were a fellow theorist, albeit implicitly. This picture is wrong-headed in at least two respects. First, speaker-hearers typically do not have the concepts that the attitude view attributes to them. This is obviously trivial when the concepts are theoretical inventions, but it applies generally in the following sense. The phenomena upon which intuitions can evidentially bear are not at all restricted to what the intuitions are about. As theorists, we are most often concerned with fairly complex properties of the language, not with what native speakers think about such properties (if they

   

99

do). Take something relatively simple such as ambiguity, of which most speakerhearers, let us suppose, have some sense. We are not interested in their folksy conceptual competence in this regard, however. For example, it takes me some time and patience to get quite sophisticated students to recognize that the cases in (14) are relevantly ambiguous. I have no interest in their pre-theoretical notion, which is a conflation of different factors. One can still garner ambiguity intuitions, of course, but these are ones that bear on ambiguity, not ones where the informant wields the concept of ambiguity that concerns the theorist, that is, an ambiguity predicated upon linguistic rather than worldly properties. Second, the attitude view is mistaken in its presumption that the theorist is concerned with true intuitions. If intuitions were metalingustic judgments ascribing linguistic properties to linguistic material, then, of course, we would be interested in true intuitions—insofar as we want to know what linguistic material has what properties and we are treating the intuitions as evidence. The basic problem with this position is that it treats the informant—the actual object of study—as if she were a fellow theorist. Intuitions are to be explained, though, whether right or wrong in relation to some theoretical perspective. For example, speaker-hearers’ intuitions about intermediate indefinite scope and cases in (14) are “wrong,” but such intuitions require explanation as linguistic phenomena. To be sure, we do not take the intuitions to be revelatory of core linguistic competence, but this is precisely because we can explain them as being sensitive to extralinguistic factors such as expectations about the nature of the world (elephants don’t use firearms). Similarly, consider the familiar contrast: (15) a The horse raced past the barn fell b The onions fried in the pan burnt c The paint daubed on the wall stank Speaker-hearers reliably find (15a) unacceptable, but judge (15b–c) to be okay. If the theorist were only concerned with true intuitions, then the intuitions concerning (15a) would be dismissed. The intuitions, however, are highly significant. The subject of (15a) is treated as a thematic agent, whereas the subjects of (15b–c) are not. There is a further division. The onions fried in the pan creates a “temporary ambiguity,” just like (15a), between the onions being the subject of ergative fried and fried in the pan being a relative clause. The verb at the end does not produce unacceptability, however, precisely because the onions is not treated as the agent of the first verb. Sentence (15c) is even more clear, for the paint daubed on the wall is not even a sentence, that is, daub is non-ergative and requires an agentive subject (in the absence of an adverb such as easily forming a “middle”). We learn a lot about verb classes thanks to the pattern of intuition, including the “wrong” intuition.

100

 

6.6 The right view of semantic intuitions The right view of semantic intuitions simply involves ditching the metalinguistic condition and the immediacy condition of the attitude view. Thus, speakerhearers’ semantic intuitions are not so much judgments about the properties of sentences as they are judgments about what one can say with a sentence in a context. This is not a property of a sentence, but a property of a speaker. It is the theorists’ task, not the informants’, to tease apart the factors contributory to the propositional contents available to be said—that is, to determine the invariance of interpretation that is the peculiar contribution of semantics proper. Thus, this peculiar invariance can be unobvious, for one is not immediately aware of an invariance. The theorist must root out generalizations that affect what can be said without being explicit in what is said. It might appear highly odd to think of intuitions as being about properties of speakers. The point, however, is that the intuitions concern what a speaker can say with a sentence rather than what the properties of the sentence are, independently of anyone saying something with it. Since sentences only concern us as potentially used to say something, the difference between the sentence and what we can say with it may seem academic. At any rate, there is no harm in continuing to talk of intuitions as concerning sentences, not speakers, so long as one is thinking of the sentences as being used by a generic subject to say something. By way of analogy, think of a hammer for knocking in a certain kind of nail. One readily speaks of the hammer in this functional manner, but the hammer does not do anything, it is merely an instrument for our knocking in nails. Likewise, a sentence distinguished from us does not mean anything, but is rather reflective of a range of things we can say that is constrained by the sentence’s syntax, much as the design of a hammer constrains what can be done with it. I can use a sentence to clear my throat or to test a microphone, much as I can use a hammer to keep a door open. Neither role depends on or tells us about the respective instrument qua the instrument it is. The unobviousness of the invariance of what can be said, which is the peculiar contribution of semantics, bears emphasis. Consider the ambiguity of (16): (16) Mary shot the elephant from Africa The adjunct may modify the DP the elephant or the VP shot the elephant. The ambiguity thus turns on whether we think of the adjunct as being an immediate constituent of the DP or an immediate constituent of the VP; the surface properties cannot help us, because the adjunct is in an adjacent position to both the DP and the VP. What we do know, however, is that the adjunct cannot modify the DP subject— (16) is not true, if Mary, who is from Africa, shot an Indian elephant in Kansas. All this is pretty clear, but what we are after is the scope of the adjunct. Now consider (17): (17) The doctor examined the patient naked

   

101

On the face of it, the secondary predicate naked can modify the object DP or the subject DP, which appears to contradict the apparent insight from (16), namely that naked is too far from the doctor to modify it. Yet now we need to notice that (17) is not true if the doctor, who is presently naked, examined the patient yesterday fully dressed. In other words naked cannot modify the subject in itself, but only as an agentive participant in the examination event. In this light, we now see that (16) does have a reading where from Africa may modify Mary, only not as a relative clause, but more as a property agents may possess as participants in events. Origins are not such properties. Even a simple parade example such as (17) throws up a good deal of complexity. The semanticist is interested in the constraints on what can be said as abstract conditions on truth, which are not clear either in any given case or in how any one case relates to another. Still, the reason why intuitions about what can be said with a sentence count as good evidence for semantics is that semantics just is what shapes what can be said as an invariant abstract condition, and so the condition indirectly shapes the very intuitions. Equally, since what can be said is an interaction effect of linguistic and extralinguistic conditions, we expect and find all kinds of interference between the semantic properties and what is said that is open to reflection. Indeed, if the general picture offered in section 6.2 is correct, the interference is necessary, since the semantic properties determine less than a content to be entertained.

6.7 Syntactic intuitions A reciprocal naïve attitude view of syntactic intuitions is out for the same reason as the semantic view is: many relevant intuitions are non-immediate and speakerhearers lack the relevant concepts. For example, both cases in (18) appear to be unacceptable, although traditionally (18a) is deemed less acceptable than (18b), the difference turning on whether the wh-item is an adjunct (where) or an argument (who): (18) a *Where did Mary wonder whether Sam visited? b ?Who did Mary wonder whether Sam visited? Both are so-called “weak island” violations, where the wh-item moves from a whisland. Yet (18a) appears to contravene another principle that (18b) satisfies, namely the empty category principle, which, for our purposes, says that the site of a moved item (a trace or a copy) must be the sister of a head, such as a verb, or be “close” to the moved item or another copy of it. The precise details do not matter; the basic point is that there is a difference between the cases, and this is due to the less degraded case contravening fewer principles than the more degraded case (cf. Chomsky 1981; Rizzi 1990).

102

 

The interest of this case for us is that the matter is somewhat subtle; the ordinary speaker does not have the right concepts readily to describe the difference in acceptability. This becomes clear when we note that the difference between the cases in (18) is not absolute, but relative to the reading of the adjunct. Construed downstairs, as asking after the location of the visit, (18a) is entirely out. Construed upstairs, as asking after the location of the wondering, the sentence is fine, albeit a bit odd in its content. Sentence (18b) does not admit any such contrast, because wonder has no unsaturated argument position for who to fill in place of the argument position of visit. What we want from speaker-hearers, therefore, is for (18a) to be ruled out on a downstairs reading but okay on an upstairs reading, whereas (18b) only has one reading and is marginal on that reading when compared to the out reading of (18a). The same issues arise with socalled “late merge” effects: (19) a Whose accusation that Billi accepted did hei latter reject b *Whose accusation that Billi stole the car did hei latter reject Under reconstruction, both cases should be Principle C violations, Bill being bound by the pronoun in the pre-movement object position of reject; but only (19b) appears bad (Lebeaux 1988). The phenomenon requires a good deal of attention from anyone whose intuitions might be relevant, and the explanation remains unclear. The same applies to copy raising constructions: (20) a Sam looks as if he works out b Sam looks as if his wife is a good cook c ?Sam looks as if every semi-stable elliptical curve is a modular form The issue here is whether acceptability turns on the presence of an experiential condition on the subordinate clause such that one could garner evidence from the appearance of Sam that the clause is true. In this regard, (20c) looks out. Imagine, however, that Sam is a mathematician known for trying to prove the conjecture that every semi-stable elliptical curve is a modular form. Her appearance might, indeed, context willing, lead one to think that the conjecture is true. After Chomsky’s (1965) introduction of the notion, it is now widely recognized that speaker-hearers have acceptability intuitions, not grammatical ones. Speakerhearers are not fellow theorists who make theoretical judgments about their language, but informants who are the very object of study and lack the relevant linguistic concepts to make grammaticality judgments. At best, they can report on how they find the presented strings.² Yet even the notion of acceptability is too blunt. ² Likewise, construing intuitions as judgments about whether a string belongs to a language or not is far too crude. First, speaker-hearers have no need of a concept of the language to which strings either do

   

103

As we have seen, many of the relevant intuitions are not of the “yes-acceptable– no-unacceptable” variety, as if an informant were a Roman emperor in the Colosseum. The intuitions, rather, concern the possible pairings of interpretations and sentences. An intuition of x being acceptable, therefore, corresponds to the speaker-hearer’s being able to assign some interpretation, whereas an unacceptable intuition is the speaker-hearer’s not being able to assign any systematic interpretation. In general, then, we may think of speaker-hearers as having intuitions about what is said, which is partly shaped by semantic interpretation, which in turn is partly shaped by syntax. The theorist, therefore, is looking for structural invariances in interpretation showing up in the judgments of competent speaker-hearers as to what is possible to say with a given linguistic form. At no point do speaker-hearers have direct intuitions about syntax, just as they have no real direct access to semantics. The only direct access is to what they could say. The theorist must sieve such judgments to arrive at the linguistic gold. In the remainder of the chapter I shall defend this position in the light of two phenomena that appear to refute it.

6.8 Grammaticality without interpretation Consider again the famous (21) Colorless green ideas sleep furiously. A natural way of describing this case is as one of grammaticality without interpretation. In other words speaker-hearers have the intuition that (21) is syntactically okay, but gibberish. If so, then this goes against my claim that semantic and syntactic intuitions are in fact intuitions about what can be said; for, if I am right, then a competent speaker-hearer must find (21) interpretable in order to recognize that it has a syntactic form. This is exactly so. The problem with (21) is not that it is meaningless, but that it is contradictory or necessarily false. It could be neither, if it lacked an interpretation. Thus we can treat (21) as simply a type whose tokens are unfit to express a truth, but it remains perfectly meaningful, for it can be denied: (22) a It is false that colorless green ideas sleep furiously, for there are no green ideas or do not belong. The import of an intuition is how the speaker finds the sentence, not her assessment of how other speakers of the language might find it. In this sense, appeal to a language appears to be a misleading abstraction at best. Second, the notion of judging a sentence as belonging to a language or not is far too coarse-grained to be of much interest. We are interested in how a string might be construed. At the least, then, we are interested in intuitions that inform us of sentences having this or that interpretation dependent upon this or that structure. No sentence could belong to a language without consideration of how it might be interpreted.

104

  b Colorless green ideas don’t sleep furiously c There are no colorless green ideas that sleep furiously d Nothing is a colorless green idea or sleeps furiously

The relevant intuition for (21), then, is not that it is well formed but has no interpretation, but that it has no true interpretation, or is contradictory, which is a content supported by the very syntactic form. It is worth emphasizing that Chomsky’s original intention in formulating this sentence was to show that probability of occurrence was no guide to either semantics or syntax. The special feature of (21) is that its transitional probabilities over adjacent word pairs is practically zero. Still, it is okay. Thus, if one thought that interpretation was dependent upon use or likelihood of use, then (21) would prima facie refute one’s position. I have no interest in defending a use theory of any sort. Equally, any autonomy or independence of grammar thesis worth defending is consistent with the position I am endorsing. I am not claiming that semantics explains syntax; on the contrary, I think that syntax constrains possible semantic interpretations. My claim is only that the evidence for syntactic hypotheses via intuitions goes through a speaker-hearer’s assessment of what can be said with the linguistic form. A deeper concern is raised by genuine nonsense of the kind Lewis Carroll enjoyed (cf. Higginbotham 1989): (23) All mimsy were the borogoves We cannot reckon that (23) has an interpretation, because it features nonsense words, and so it cannot be false either. Yet it still appears to be well formed in a way (24) is not: (24) Borogoves the were mimsy all What we have to say here is that (23) has a partial interpretation thanks to its functional closed-class items being interpreted; only the open-class categories are nonsense in Carroll. Of course, such partiality is not ad hoc. Sentence (23) would elicit no semblance of well-formedness if its functional items were replaced by open-class expressions; its appearance of well-formedness is exactly proportional to its functional items being interpreted. Thus (23) has a perfectly good, albeit schematic, interpretation. Speakers cannot assert a definite content with it, but it is as if we know the constraints the semantics and syntax impose: a speaker must be saying that some property was universally possessed by every member of some class of thing at a past time, its being left open whether the property is a stage- or an individual-level one.

   

105

There might be other phenomena relevant to the possibility of grammaticality without interpretation. As far as I can see, however, all cases will fall under the two options discussed: an extant interpretation, albeit a contradictory one, or a partial interpretation due to the preservation of the interpretation of the functional items in an appropriate grammatical position. Grammaticality in the face of out-and-out gibberish strikes me as utterly out, and I can not conceive of what it might look like.

6.9 Interpretation without grammaticality A more interesting phenomenon, although much less discussed, is the apparent possibility of interpretation without grammaticality. If actual, this phenomenon would belie my central claim from the opposite direction to the above problem, namely that there could be linguistic interpretations that are not relevantly constrained by syntax, or so it would appear. I shall look at two phenomena: islands and vacuous quantification. Islands, as briefly discussed above, are syntactic environments that block the movement of wh-items and the scoping out of strong DPs. For our purposes, we may sideline the many complexities islands pose, for my use of the constructions rests upon their most general properties.³ Consider: (25) a *Who does Bill love Mary and? b *Who does Mary doubt the fact that Sally loves? The apparent syntactic interdiction here is that a wh-item cannot move from a coordinate argument position or from within “a complex NP” (here a DP whose nominal takes a clause). Note that these cases are different from ones such as The child seems sleeping, where words are simply elided, or from loose talk and slips of the tongue. We can find interpretations for many kinds of mistaken or distorted structures because it is clear what structure was intended, which serves as the template, as it were, for understanding the deviant case. The peculiarity of islands is that they would have a precise and perfectly coherent interpretation determined by the syntax, if only a grammatical condition didn’t apply. Thus the respective interpretations for (25) are easily specified: (26) a (Wh x)[Bill loves Mary and x] b “Which person is such that Bill loves Mary and her?” ³ I assume, in particular, that islands are syntactic phenomena rather than, say, effects of processing (for an extensive discussion, see Sprouse and Hornstein 2013). If this proves to be wrong, then my position is not harmed, for such an outcome would simply render otiose my attempt to defuse the problem that islands as syntactic phenomena pose to my position on the indispensable role of interpretation in syntactic intuition.

106

 

(27) a (Wh x)[Mary doubts the fact that Sally loves x]. b “Which person is such that Mary doubts the fact that Sally loves her?” It will be noted that the pronouns in (26b–27b) are not resumptive, and such pronouns are not possible in (25b), even though they are typically (but not always) in complementary distribution with traces or copies. The general point, though, is that we know what the interpretation would be, were the grammar to match the properties of logicese or always allow resumptives to be spelled out for traces or copies (cf. Kayne 1981). The crucial point about islands, however, assuming that they are genuine linguistic phenomena, is that the interpretations are not linguistically licensed, no matter how specific the interpretations would be, were they to be licensed. We can work out the interpretation, not so much because it is otherwise coherent and we know the relevant syntax, but because we can imagine a different syntax, where any “variable” can be bound by a prefixed operator. Such is first-order logic and lambda calculus, which lack the peculiar constraints of natural language. So, islands are a case of interpretation without grammaticality, but the interpretation is not one that is determined by the properties of the language; for, ex hypothesi, such properties do not determine any interpretation. Vacuous quantification offers a similar case. Such quantification occurs where an operator or quantifier prefixes a formula that contains no variables over which it may scope. In standard systems of first-order logic and lambda calculus, vacuous quantification is either allowed or disallowed, according to choice; for, while its admission streamlines the syntax somewhat, it is logically redundant. For example, the following equivalence holds, where the left-hand side is vacuous: (28) (8x)[p] $ (8x)[x = x ! p] So, if there is at least one object (a non-empty domain), then “(8x)[p]” is true; if there are no objects (the domain is empty), then “(8x)[p]” is false. Such arbitrariness might seem silly, but it is perfectly coherent. Consider (29): (29) a (8x)[2 + 2 = 4] b “Everything is such that 2 + 2 = 4” Since 2 + 2 = 4, then, of course, everything is such that 2 + 2 = 4. Vacuous quantification here is simply a weird way of thinking or talking: instead of just making the assertion one endorses, one states that everything has the property of that assertion holding. Vacuous quantification, however, is not realized in natural language. Sentence (29b), for instance, has a vacuous interpretation, but does not contain a quantifier

   

107

with no variable (trace/copy) to bind—not any more than Everything is red does.⁴ More explicitly, wh-items are quantifiers in the sense that they scope out of argument position. Therefore, if natural language admitted vacuous quantification, we should expect to find vacuous wh-interpretation. (28)

*Who does Bill love Sam?

The reason why such cases are out follows directly from wh-scope’s being a function of movement (where the wh-item is not in situ): there is no argument position from which these wh-items could have moved. At any rate, whatever the explanation, the cases are utterly unacceptable. Yet, like islands, they have interpretations: (29) a (Wh x)[Bill loves Sam] b “Which person is such that Bill loves Sam?” Suppose that Bill loves Sam. In this case, (29b) would be truthfully answered by one’s citing any person whatsoever. Suppose that Bill doesn’t love Sam. In this case, (29b) could not be truthfully answered; citing anyone would result in falsehood. As in the other cases, the moral to draw here is not that there is interpretation without grammaticality, but that the structures have no linguistically determined interpretation. We can find an interpretation that appears to fit the syntax, but only via an association with another language, such as first-order logic, that can feature scope-taking items without variables to bind.

6.10 Conclusion My claim has been that intuitions are never direct evidence for some relevant linguistic hypothesis, either syntactic or semantic. Clearly, such a view is not inimical in any way to the truism that we have intuitions in the sense of having ready judgments about what we say and can say in this or that context with some linguistic material. It is the theorist’s task to tease out how syntax and semantics contribute to what is sayable by way of constraining what we judge to be sayable. The speaker-hearer has no peculiar authority or epistemic access to some inner linguistic realm. We are, in effect, after the principled causes of why a speakerhearer finds P acceptable, but not Q, as due to linguistic properties rather than extralinguistic ones. The mere fact that this is possible to some extent is a striking ⁴ If we take the quantifier DP to move obligatorily, then it will move to an adjoined position (via QR) and bind its trace/copy in SPEC-TP.

108

 

feature of our general linguistic competence. It is strictly superfluous, though. In principle, a linguist could garner all the information she requires about the language from an elicited production. The eschewal of intuitions, however, would introduce great complexity without any apparent benefit. After all, it is no mean feat to design experiments in such a way as to produce the construction one is after, or for its absence to be predictable. Instead one may happily consult, or reflect on, one’s own competence. We need a reason to think that we will be bound to be led astray not to help ourselves to the evidential convenience of our being conscious of what we can say. Ultimately, this fairly prosaic fact is the justification for the use of intuition in both syntactic and semantic theory.

Acknowledgments : : My thanks go to Samuel Schindler, Anna Drozdzowicz, and Karen Brøcker for organizing the “Linguistic Intuitions, Evidence, and Expertise” conference and for many helpful remarks on the chapter. I also thank Georges Rey, Steve Gross, and Michael Devitt. A reviewer for OUP also provided valuable comments.

7 Intuitions about meaning, experience, and reliability Anna Drożdżowicz

7.1 Introduction Speakers’ intuitive judgments about meaning are commonly taken to provide important data for many debates in philosophy of language and pragmatics. Michael Devitt (2012, 2013) calls this “the received view.” The armchair practice of appealing to intuitive judgments in philosophy has been recently criticized (Weinberg 2009; Machery 2017) and, as part of this criticism, the extensive use of speakers’ judgments about meaning has not gone unquestioned either (Machery et al. 2004; Devitt 2012, 2013). Can the evidential use of such judgments be defended and justified? And, if so, how? In this chapter I discuss two strategies that aim to explain and justify the evidential role and methodological utility of speakers’ intuitive judgments about meaning. The first strategy is inspired by what is known as the perceptual view on intuitions, a view that emphasizes the experience-like nature of intuitions and has recently gained some prominence in debates concerning philosophical methodology (Bengson 2015; Koksvik 2017).¹ It proposes that speakers’ intuitive judgments are based on perceptual-like, conscious experiences of understanding that language users typically undergo when listening to utterances in a familiar language. In recent debates on the epistemology of language understanding, such experiences have been argued to provide prima facie, albeit defeasible, evidence for corresponding beliefs (and judgments) about the meanings of utterances, in a similar way to the one in which perceptual experience has been argued to provide evidence for perceptual beliefs (and judgments) (Fricker 2003; Brogaard 2018). The second strategy is a reliabilist one. It derives the evidential utility of speakers’ judgments about meaning from the reliability of the psychological mechanisms that underlie their production. Reliable judgments must be formed

¹ Related observations have also been made about the experience-like nature of syntactic intuitions (Textor 2009; Fiengo 2003; Pietroski 2008). Anna Droz˙ dz˙ owicz, Intuitions about meaning, experience, and reliability In: Linguistic Intuitions: Evidence and Method. First edition. Edited by: Samuel Schindler, Anna Droz˙ dz˙ owicz, and Karen Brøcker, Oxford University Press (2020). © Anna Droz˙ dz˙ owicz. DOI: 10.1093/oso/9780198840558.003.0007

110

 żż

via a normally operating, cognitive process, the function of which is to reliably produce true judgments about the retrieved meanings of utterances. This strategy is an instance of proper function reliabilism (for discussion of the proper function reliabilism in general, see Plantinga 1993; Goldman 2011). By being a product of such processes, speakers’ judgments can be argued to correspond to or covary with external facts about the meanings of utterances retrieved in the course of utterance comprehension. In a similar manner, the evidential uses of syntactic (Rey 2014b, this volume) and epistemic judgments (Nagel 2012) have been defended. One way to develop this strategy is to argue that intuitive judgments about meanings of utterances are generally reliable owing to monitoring processes intrinsically involved in speech production and comprehension (Drożdżowicz 2018; for a different account, see Cohnitz and Haukioja 2015). What are the merits of the two strategies for justifying the evidential status of speakers’ intuitive judgments about meaning? Is one of them more fundamental than the other? In this chapter I argue that we have strong reasons to prefer the reliabilist view. I support my claim by providing three parameters on which the reliabilist strategy fares better than the experience-based one. First, the reliabilist strategy can better capture the practice of appealing to such judgments (i.e. it has descriptive adequacy). Second, it can respond to recent criticisms from experimental philosophy concerning the diversity of such judgments (i.e. it can address recent criticisms). Third, the reliabilist strategy requires fewer epistemological commitments, while the experience-based strategy seems to rely on methodologically redundant and possibly controversial assumptions (i.e. it requires fewer epistemological assumptions). The upshot of my arguments will be this: although experiences of language understanding may sometimes be involved in arriving at intuitive judgments about meaning, it is the reliability of processes involved in the production of such judgments that can provide the justification required for grounding the methodological practice of relying on them. On the basis of these observations I will suggest that the reliability of intuitive judgments about meaning is a necessary condition for their use to be justified. By developing and comparing the two strategies just described, this chapter draws a novel connection between recent debates in methodology (broadly construed), epistemology of language understanding, and philosophy of linguistics. The chapter is structured as follows. In section 7.2 I explain what speakers’ intuitive judgments about meaning are and how they are used in philosophy of language and in some strands of linguistics. The experience-based and the reliabilist strategy are presented in sections 7.3 and 7.4, respectively. The two strategies are compared in section 7.5, where I argue that the reliabilist strategy, if successfully construed, could on its own vindicate the methodological practice of appealing to intuitions, irrespectively of corresponding experiences. In section 7.6 I reply to one objection. Section 7.7 concludes the chapter.

  , ,  

111

7.2 Intuitive judgments about meaning and their use Philosophers of language and linguists appeal extensively to intuitions about meaning as evidence for theories of language. Typically, these intuitions take the form of judgments about various aspects of the meanings of utterances— judgments that may come from philosophers themselves (as competent speakers of a language) or from other speaker-informants. Such judgments can be best understood as reports and answers that speakers provide to questions about various aspects of meanings of utterances. For example, judgments about the meaning of the word “bank” in the utterance “John went to the bank,” presented in a particular context, are common data for theories of disambiguation or theories of the truth-conditional content of utterances. Speakers’ intuitive judgments about utterance meaning have been argued to provide important data for many debates in philosophy of language and linguistics, for example contextualism versus relativism in semantics, “faultless” disagreement, the limits of truth-conditional semantics, vagueness, and the status of figurative utterances. Not only can this reliance on speakers’ judgments be observed in actual practice; it has also been explicitly endorsed by many philosophers of language. According to Stephen Neale, the intuitive judgments that speakers make about what an utterance meant, said, or implied, as well as about an utterance’s truth or falsity, constitute the primary data for our theory of interpretation (Neale 2004: 79). Jason Stanley and Zoltan Szabó (2000: 240) maintain that accounting for speakers’ ordinary intuitions about truth conditions is “the central role of semantics.” François Recanati (2013: 1–3) argues that speakers’ judgments about utterance meaning are direct intuitions about truthconditional content and provide data that should be accounted for by theories of meaning. It is typically agreed among the users of such judgments that the role they play is evidential. For example, if one puts forward the claim that the disambiguation process in a natural language contributes to the truth conditions of an utterance, one needs a good example to illustrate this claim, and speakers’ intuitions deliver some material that can support (or disconfirm) it. Observations about how we would interpret various utterances constrain our theoretical claims, and in that sense they are evidence. Speakers’ judgments about utterance meaning can directly support or disconfirm a hypothesis. Take the common intuition among speakers of English that the utterance “My friend went for a drive with the king of France last week” is false and the intuition of feeling merely “squeamish” when asked to assign a truth value to an utterance such as “The king of France is bald.” This contrast between the two intuitive judgments has often been taken to show that the first sentence carries no presupposition of existence (of a king of France), and hence it is simply false, while the second sentence has no truth value, because its presupposition

112

 żż

(that there is a king of France) is not satisfied. Whether this kind of pattern in speakers’ judgments about utterances with empty definite descriptions supports this further claim or not, it is certainly evidence to consider.² The armchair method of appealing to intuitions in philosophical debates has recently come under attack. Several studies conducted by experimental philosophers suggest that intuitive judgments made in prominent philosophical cases and used in various domains of philosophy tend to vary with demographic and presentational factors.³ On that basis it has been argued that intuitive judgments of various sorts are unreliable and should not be trusted (e.g. Weinberg 2009; Machery 2017). The extensive use of speakers’ judgments about utterance meaning has been criticized as well (Machery et al. 2004; Devitt 2012, 2013). Michael Devitt, for instance, argues against using them as evidence for theories of language and meaning. According to Devitt, speakers’ intuitions are fallible empirical judgments about language that merely track speakers’ folk theories of meaning rather than meaning itself. This is why he takes them to be essentially “metalinguistic” and argues that they are not a reliable guide to meanings but can only, at best, provide indirect evidence about linguistic reality (Devitt 2012: 562). In his view, we should not rely on folk metalinguistic intuitions, as many philosophers of language do, but on the intuitions of better trained experts, that is, those of philosophers and linguists themselves (2013: 87).⁴ How can the use of speakers’ intuitive judgments about meaning be defended and justified in the light of these criticisms? In what follows I present two strategies for explaining the evidential role and methodological utility of such intuitions—the experience-based strategy and the reliabilist strategy—and argue that the latter is better suited to justify the methodological practice of appealing to such judgments.

7.3 Intuitions and the phenomenology of language understanding: The experience-based strategy The first strategy for explaining the nature and evidential status of speakers’ intuitive judgments about the meanings of utterances is based on internalist

² Von Fintel (2004) rejects this further claim and discusses the origin of this particular pattern of speakers’ intuitive judgments. For experimental evidence about this pattern of judgments, see Abrusán and Szendröi (2013). ³ The effects have been reported for factors such as: cultural background (e.g. Machery et al. 2004; cf. Lam 2010), gender (Buckwalter and Stich 2014), order of presentation (e.g. Swain, Alexander, and Weinberg 2008) or emotional state (Schnall et al. 2008; Wheatley and Haidt 2005). ⁴ In a separate paper, I provide a detailed response to Devitt’s criticism (Drożdżowicz 2018). In brief, I argue that the involvement of metalinguistic abilities does not preclude the evidential role of such judgments given their causal connection to mechanisms responsible for arriving at meanings. For a different criticism of Devitt’s view, see Cohnitz and Haukioja (2015).

  , ,  

113

views within the epistemology of language understanding⁵ and on one recent popular approach in philosophical methodology, already introduced here as the perceptual view. It has been argued that, when listening to an utterance in a familiar language, a hearer typically becomes consciously aware of that utterance’s meaning. In typical cases of linguistic communication, competent language users undergo experiences of language understanding (e.g. Hunter 1998; Fricker 2003; Brogaard 2018). Such experiences can be made vivid in the following situation. Imagine hearing someone uttering a sentence to you in your native language, for instance “John went to the bank.” Now imagine that everything is the same, except that this person is talking in a language unfamiliar to you and expresses the same proposition to the effect that John went to the bank, but this time you have no idea what you have been told. It will strike you that the experience of listening to an utterance in a familiar language differs from the experience of listening to the one in a language you do not know. In the first case you have an experience of understanding the utterance, while in the second case you do not. This observation is often taken as evidence for the claim that experiences of language understanding exist (Fricker 2003; Brogaard 2018).⁶ Following this observation, it could be argued that (1) speakers’ intuitive judgments about meaning are based on such experiences of understanding, and (2) the judgments in question are prima facie justified in virtue of language users’ corresponding experiences. On this strategy, when one judges that an utterance has meaning x, one does so on the basis of one’s experience of understanding that utterance as having meaning x. For example, when one has an experience of understanding the utterance “John went to the bank,” one can judge that “bank” in the utterance refers to a certain kind of financial institution. The experience of understanding an utterance provides one with a prima facie justification for a belief and a corresponding judgment, although in some cases this justification may be overridden. I now clarify the basic tenets and commitments of this account. The experience-based strategy is a straightforward application of several influential ideas from two recent debates. On the one hand, it draws on the results from internalist views in the debate about the epistemology of language understanding (e.g. Hunter 1998; Fricker 2003; Brogaard 2018). On the other hand, it can be seen as the application of a version of the perceptual view of intuitions. According to a recent version of this view (Bengson 2015; Chudnoff 2013a; Koksvik 2017), intuitions are a separate class of mental states with a presentational phenomenology, a feature they share with perceptual mental states. It is argued that, owing to their presentational phenomenology, intuitions have an evidential role: they ⁵ It is argued that experiences of understanding provide essential justification for the hearer’s knowledge (Fricker 2003) or justified belief about what a speaker said (Brogaard 2018). The reliability of belief forming process may also be required, but is not sufficient to secure knowledge or justified belief. ⁶ Other evidence comes from the sinewave speech phenomenon (Remez et al. 1981, cf. O’Callaghan 2011).

114

 żż

provide a prima facie (albeit defeasible) justification for beliefs that assert their contents. The perceptual view draws on the idea that a subject can be justified in forming a belief and in expressing a corresponding judgment by having the relevant experience with a certain phenomenology.⁷ Explained in this way, the experience-based strategy is a prima facie live option, worth investigating in the debate about speakers’ intuitive judgments. The experience-based strategy draws on the analogy with perception. Experiences of understanding are taken to provide a speaker with evidence about utterance meaning, in a way similar to the one in which perceptual experiences are often claimed to provide subjects with evidence for their perceptual beliefs and judgments (Siegel 2010). On this account, the methodological practice of appealing to such judgments in the armchair could be vindicated as follows. If a theorist, be it a philosopher or a linguist, is a competent speaker of a language, she can make (in the armchair) all kinds of intuitive judgments about the meanings of utterances in real or stipulated contexts, just as any other speaker would do, so long as she has relevant experiences of understanding. Typically, language users have no direct control over whether in listening to (or reading) an utterance they experience understanding that utterance or not; experiences of utterance understanding are not volitional. The involuntary, automatic nature of such experiences suggests that they are probably produced by encapsulated mechanisms and that we cannot be aware of the details of the linguistic processes that yield them. These features have been taken to suggest that experiences of understanding are interestingly similar to paradigmatic perceptual experiences (Hunter 1998; Fricker 2003; Siegel 2006). Elisabeth Fricker (2003: 329) calls them quasi-perceptual. The exact nature of the similarity between experiences of utterance understanding and typical perceptual experiences (such as visual or auditory experiences) is still debated.⁸ Settling this matter is not, however, required for developing the experience-based strategy. The relevant feature is what most participants in this debate seem to agree on, namely that experiences of utterance understanding can, in principle, provide defeasible justification for speakers’ beliefs about utterance meaning (e.g. Hunter 1998; Fricker 2003; Brogaard 2018). Without adopting the claim that experiences of utterance understanding are perceptual in a robust sense (Siegel 2006, cf. O’Callaghan 2011), one can develop the analogy between the epistemology of perceptual experience and that of experiences of language understanding. The analogy rests on the idea that having a perceptual or quasi-perceptual experience as of x can provide a prima facie justification for a belief or judgment ⁷ One important difference between the experience-based strategy and the perceptual view is that the former draws a connection between experiences and judgments, while the latter focuses on identifying intuitions as experiences and does not account for the role of intuitive judgments as such. ⁸ For discussion, see Siegel 2006; Bayne 2009; O’Callaghan 2011; see (minor revisions) and the related debate about cognitive phenomenology (Bayne and Montague 2011; Prinz 2011; Dodd 2014).

  , ,  

115

about x. In visual perception, having the perceptual experience as of a red tomato has been argued to provide prima facie justification (leaving aside relevant defeaters) for my belief that there is a red tomato before me—a view labeled dogmatism about the evidential role of perceptual experience (e.g. Tucker 2013; Brogaard 2013). However, having an experience as of seeing a red tomato may not always be good evidence for the belief that there is a red tomato before me. For example, in cases of hallucination, an experience as of seeing a red tomato will not match how things are in the world. In a similar vein, having an experience as of understanding an utterance could be argued to provide prima facie, albeit defeasible, justification for the subject’s beliefs about the meaning of that utterance (Fricker 2003; Brogaard 2018); yet the evidence gained through such an experience may not always be good evidence. For example, a subject may mishear and in effect form an incorrect belief about the meaning of an utterance. If the analogy holds, then one could argue that the nature of intuitive evidence in the case of speakers’ intuitive judgments about meaning is analogous to the nature of evidence for perceptual beliefs and for corresponding perceptual judgments. Perceptual beliefs (and judgments) are formed on the basis of perceptual experiences, while speakers’ beliefs and corresponding intuitive judgments about utterance meaning are formed on the basis of experiences of utterance understanding. Speakers’ intuitive judgments about meaning are reports of what speakers take to be utterance meanings on the basis of their experiences, just as perceptual judgments are reports of subjects’ perceptual experience. According to the experience-based strategy, it is in virtue of having a relevant experience with a certain phenomenology that a subject is claimed to be justified in forming a belief and in expressing a corresponding judgment. The justification in question is within the hearer’s first-person perspective. Arguably, the main appeal of this strategy is that it tries to capture the nature and epistemic status of intuitions about meaning by drawing on fairly uncontroversial observations about the typical phenomenology of intuitions and language understanding.⁹

7.4 Intuitive judgments and the monitoring of speech comprehension: The reliabilist strategy Following externalist views in epistemology, one can look for factors other than conscious experience to explain the evidential status of intuitive judgments about meaning and to ground the methodological practice of appealing to them. ⁹ Somewhat related remarks have been made about the potential role of phenomenology in explaining the nature of syntactic intuitions (e.g. Textor 2009; Fiengo 2003; Pietroski 2008). Syntactic intuitions are sometimes argued to have a special phenomenology, in that they involve an experience of unacceptability of an utterance—a negatively valenced experience of its “badness” or “yuckiness” (Textor 2009; Luka 2005; Pietroski 2008).

116

 żż

Given that intuitive judgments should be truth-conducive (insofar as they track facts about meaning), one can appeal to the reliability of mechanisms that are required for their production and for their connection with such facts. Following proper function reliabilism (Plantinga 1993; Goldman 2011), one could try to develop an account according to which speakers’ judgments about meaning, by being a product of reliably operating processes, correspond to or covary with external facts about the meanings of utterances retrieved in the course of utterance comprehension. In what follows I will assume that such facts exist independently of intuitions; they are not constituted by them (for the opposite view, see Cohnitz and Haukioja 2015). For various reasons, such facts may be hard to define. First, it is debated how they are typically determined: whether they are fixed roughly by the speaker’s intention (Grice 1989; Wilson and Sperber 2012), by linguistic conventions, or by some other worldly factors (Williamson 2007a; Dorr and Hawthorne 2014). I will remain largely neutral on that matter, but will assume that the properly functioning language system (broadly construed) can reliably track such facts when arriving at interpretations of utterances. Arguably, if intuitive judgments can reliably track the outputs of the language system, then they can also indirectly track the relevant facts about meaning. In a separate paper I develop and defend a reliabilist position in this debate (Drożdżowicz 2018). Here I briefly summarize the main points of this proposal in order to illustrate one version of the reliabilist strategy and compare it with the experience-based strategy. The purpose of this account is to postulate a generally reliable mechanism that generates intuitive judgments. The mechanism is based on a suggested connection between intuitive judgments about meanings and the output of the language system. The proposal is explicated by looking at psychological and linguistic processes that are typically involved in arriving at a judgment about what an utterance communicated and by assessing their average reliability in terms of the monitoring functions operating in language comprehension (Drożdżowicz 2018). First, the procedure of coming up with a judgment typically involves confronting a subject with an utterance. In order to comprehend its meaning, a competent language user has to go through several processes: phonological decoding, parsing, and drawing on contextual information in the course of assigning a certain contextually relevant meaning. The underlying psychological mechanism delivers as its output an interpretation assigned to an utterance. It involves all stages of linguistic comprehension (grammar, and perhaps a semantic module, if there is one) as well as a variety of performance and pragmatic systems, yet the process is quick and automatic.¹⁰ The function of this mechanism is to deliver accurate

¹⁰ Speech comprehension is a matter of milliseconds. Late semantic and pragmatic processing is traced in ERP components from around 400 to 600 milliseconds after the sentence is presented (Coulson 2004).

  , ,  

117

information about the communicated meanings of utterances. Furthermore, the mechanism has to make its output, that is, the information it delivers, available to what is sometimes called the central system, where the central system can be characterized as a system responsible for using and integrating the outputs of specialized subpersonal systems, for example to form beliefs, make conscious decisions, and so on. In benefiting from linguistic communication, the central system can fulfill several essential life functions such as guiding the subject’s action, forming beliefs, and acquiring knowledge. On this account, the output (interpretation) is made available through constant monitoring. Extensive evidence suggests that monitoring is an intrinsic element of speech comprehension and production (Silbert et al. 2014; Pickering and Garrod 2013, 2014). One way to develop this idea has been in terms of what is called prediction models, where the predicted production and comprehension representations of the linguistic features of utterances (phonological, syntactic, semantic and pragmatic) are constructed beforehand and compared with the implemented representations as soon as the latter become available (Pickering and Garrod 2014). For example, when you hear me saying “John went to the b . . . ” as in the earlier example, you are likely to construct a prediction model such that the predicted representation of the utterance ends with the word “bank.” This prediction model will be immediately compared with the actual representation of my utterance and may or may not match it (“bank” with a particular meaning). This kind of monitoring in both production and comprehension is a rapid process, typically implemented on the fly and without conscious reflection (Hickok 2012).¹¹ The monitoring of the interpretation has to be (generally) reliable in order to make the output available to the central system. This feature is crucial for judgment making. When a competent language user is asked to make a judgment about what a speaker intended to communicate with an utterance, for example by using the word “bank,” the interpretation of that utterance is already available to the central system and serves as an input to the further task of forming a judgment. Judgments about utterance meanings can be expected to be generally reliable, thanks to the constant monitoring of interpretations involved in speech production and comprehension. Several observations about the importance of reliable linguistic communication for the transmission of information in humans seem to support this claim. We commonly rely on what we take others to say, via testimony, and we commonly fall back on the information passed through linguistic communication (Lackey 2008; Goldberg 2010). Often we need to justify, to others or to ourselves, our beliefs and decisions made on the basis of communication. In order to do that, we need to have conscious or reflective access to at least some aspects of the output of the comprehension system, that is, of interpretation. ¹¹ See Brouwer et al. (2017) for a recent overview of neuropsychological literature on monitoring in language comprehension that summarizes results concerning N400 and P600 components.

118

 żż

We need to be able to make a generally reliable judgment about what an utterance meant. The monitoring mechanisms of speech comprehension allow us to retrieve such information and to draw on it when making explicit judgments. In the light of this evidence, judgments about utterance meaning can be seen as generally reliable thanks to the involvement of monitoring, which is essential for speech comprehension and production and for our communicative practices. The scope of this reliability claim is of course limited. First, the reliability of a judgment is a function of the reliability of the language system (broadly construed). If for some reason a hearer has misunderstood an utterance, she is unlikely to provide a correct judgment. Second, as with any intuition, belief, or judgment, several factors may influence one’s abilities to judge what the utterance meant, even when one’s language system correctly tracked the facts about that meaning. For this reason, judgments about meaning can be seen as generally reliable only in the absence of interference from other systems (for a similar remark concerning syntactic intuitions, see Rey 2014b). The connection sketched above between assigned interpretations and judgments explains why the latter can be used as generally reliable evidence about the former and can provide a gateway to facts about meanings that in normal cases should be tracked by the interpretations assigned by the language system. I call the above proposal “the voice of performance view,” because the reliability of intuitive judgments is grounded in monitoring mechanisms that constantly operate to secure a safe uptake of the information during speakers’ linguistic performance.¹²

7.5 Comparing the two strategies The experience-based strategy for justifying the use of speakers’ intuitive judgments about meanings derives its evidential status from the prima facie justification provided by corresponding experiences of understanding utterances. The reliabilist strategy derives the evidential utility of such judgments from the expected average reliability of the psychological mechanisms that underlie their production and secure their connection with the output of psychological mechanisms responsible for language understanding. What are the merits of the two strategies? Is one of them more fundamental than the other? I will argue that we have strong prima facie reasons to favor the reliabilist strategy. In order to address this matter, I will propose three parameters that I take to be indicative of successfully justifying the evidential use of intuitive judgments about meaning: ¹² The reliabilist strategy can take different forms. For a different account, see Cohnitz and Haukijoa (2015: 629), who argue that referential intuitions are direct outputs of speakers’ “competence with proper names in general.” Their view can be thus seen as an instance of the so called voice of competence view, widely debated in the case of syntactic intuitions (Devitt, this volume; Rey, this volume). I discuss their proposal and compare it to the one sketched above in (Drożdżowicz 2018).

  , ,  

119

descriptive adequacy; addressing recent criticisms; and epistemological assumptions. I will then argue that the reliabilist strategy fares better than the experiencebased one on all three of them.

7.5.1 Descriptive adequacy Which of the two approaches best captures the practice of appealing to intuitive judgments about meanings of utterances? The two approaches seem to provide very different descriptions of the practice. The experience-based strategy focuses on the armchair practice that language users and theorists engage in when coming up with judgments. Since the justification is claimed to come from having experience with a particular phenomenology, the first strategy captures the process of forming a judgment about utterance meaning from a first-person perspective. In many cases such a description may seem accurate. However, forming a judgment about the meaning of an utterance need not require acknowledging the presence of such experiences. It has been argued that, when the process of language understanding goes smoothly, language users rarely reflect on the process and need not pay attention to corresponding experiences (Carston 2010; see also Koriat 2007). Assuming that the cases of smooth unproblematic communication are predominant, language users will often be in a position to issue judgments about various aspects of utterance meaning without paying extra attention to the experiences that underlie such judgments. To make such decisions, they only need to accurately draw on the outputs of the interpretation processes. On the other hand, one could argue that in some cases, when grasping the meaning of an utterance is more complicated and requires some reflection (Carston 2010: 295) and when a hearer is explicitly asked to justify and ground her judgment (Hunter 1998), it may be necessary to take the route back and reflectively acknowledge the presence of a relevant experience of understanding. The reliabilist strategy is prima facie compatible with the existence of such experiences, but it has a different focus. It characterizes the practice from a thirdperson (or an observer’s) perspective and provides a justification in terms of the underlying mechanisms. On this approach, speakers’ judgments about meanings are treated first and foremost as data, and as such they can be systematically investigated in the construction and testing of hypotheses.¹³ The characterization provided by the reliabilist strategy is therefore compatible with several versions of the explicit description or endorsement of the practice of using such judgments, in

¹³ Note that this practice need not always require the reliability of the content of such judgments. The reliability of data, i.e. of subjects’ responses, may in some paradigms be enough to apply this methodology. See Gross (this volume) for a discussion about the distinction between these two types of reliability.

120

 żż

philosophy of language and in various strands of linguistics (e.g. Stanley and Szabó 2000; Recanati 2013; Devitt 2012, 2013). On the reliabilist account, the relevance of judgments as data for hypotheses is grounded in their being reliably connected to the output of the psychological mechanisms responsible for language understanding. Moreover, the reliabilist strategy seems more descriptively adequate than the experience-based alternative. The former can capture not only the armchair practice of appealing to one’s own judgments about meaning, but also the systematic use of such judgments in experimental pragmatics as well as in experimental semantics and psycholinguistics (Phelan 2014; Drożdżowicz 2016). In these domains of inquiry, speakers’ judgments are probed in order to confirm or disconfirm hypotheses about speakers’ preferences for one particular interpretation as opposed to another. The hope is that, by using an experimental approach to probe patterns of intuitive judgments, theorists can obtain crucial evidence for deciding between alternative theories. Sometimes two theories will predict different interpretations for the same utterance. Sometimes, while predicting the same interpretations, they will differ in their predictions about the cognitive mechanisms that produce these interpretations (Noveck and Sperber 2007).¹⁴ If having an experience of understanding an utterance in the armchair is enough to ground the methodological practice of using intuitive judgments about meaning, then why go to the trouble of testing them experimentally? The experience-based approach, by focusing only on the armchair practice, does not provide a convincing explanation either for the widespread practice of treating intuitive judgments as data or for the experimental approach to probing them. It is therefore less compatible with the commonly accepted descriptions of the practice of appealing to judgments about meaning (e.g. Recanati 2004; Noveck and Sperber 2007). Because of this, I argue, the experience-based strategy seems overall less descriptively adequate than the reliabilist strategy.

7.5.2 Addressing recent criticisms Another way to compare the two strategies is to see how they fare in terms of responding to recent criticisms of appealing to intuitions in philosophy and of appealing to speakers’ intuitions about meaning in particular. Experimental philosophers have argued against the practice of appealing to intuitions in

¹⁴ Several tasks have been used to probe speakers’ intuitive judgments about utterance meaning, e.g. the verification task or the truth-value judgment task; see Geurts and Pouscoulous (2009), Benz and Gotzner (2014), and Degen and Goodman (2014). Like any experimental approach, the systematic probing of speakers’ judgments about utterance meanings requires various methodological decisions. For a detailed overview of recent methods and related challenges, see Phelan (2014).

  , ,  

121

philosophy by showing that intuitions are subject to various distortions and confounding factors, such as cultural background (e.g. Machery et al. 2004; cf. Machery et al. 2017), gender (Buckwalter and Stich 2014; cf. Adleberg et al. 2014), order of presentation (e.g. Swain et al. 2008), or emotional state (Schnall et al. 2008; Wheatley and Haidt 2005).¹⁵ The criticism targets most domains of philosophy, including philosophy of language and the reliance on intuitive judgments about meaning, broadly construed. Machery et al.’s (2004) pioneering study was argued to show that intuitive judgments about reference vary across cultures (see also Sytsma et al. 2015). Relatedly, Michael Devitt (2012, 2013) has argued that the use of speakers’ intuitive judgments about meaning is problematic, because they are based on people’s folk theories of meaning (2012: 561–3). Devitt calls the reliance on such intuitions a common mistake made by philosophers of language and linguists (2013: 87). Which of the two accounts could better respond to the above charges? Once we acknowledge the nature of criticisms raised by experimental philosophers and Devitt, the case seems to speak for the reliabilist account. The reason is quite simple: the charge is typically phrased in terms of intuitions’ unreliability. For example, Machery (2017) defends the view that the main problem of the armchair method of cases used by philosophers is that judgments made in philosophical cases are unreliable due to several problematic features that philosophical examples typically have. Machery (2017: 96) defines reliability as a dispositional property of a process producing a particular type of judgment, such that the process tends to produce a large proportion of true judgments, assuming that its inputs are true. He then argues that studies presented by experimental philosophers provide inductive basis for the claim about the unreliability of various types of intuitive judgments used in philosophy, including judgments about reference used by philosophers of language. Devitt’s (2012, 2013) criticism, too, can be naturally phrased in terms of reliability, since, according to him, the connection between meanings themselves and the judgments that language users make about them is not direct enough for the latter to constitute evidence about the former, given the interfering folk theories and the confounding metalinguistic factors. Not surprisingly, the experience-based and the reliabilist strategy differ substantially in terms of what they can offer in response to these criticisms. Given that the experience-based strategy derives the evidential utility of intuitions from the phenomenology of experiences that correspond to beliefs and judgments, it has little to say about the reliability of the underlying process of forming such experiences and the corresponding judgments. In this respect, its explanatory

¹⁵ Some of these effects were not replicated by consequent studies (e.g. Lam 2010; Seyedsayamdost 2015a, 2015b), which may suggest that the initial skepticism towards folk intuitions may not be well founded, or that the instability effects are rather elusive.

122

 żż

power seems to be entirely dependent on the success of the reliabilist view. After all, if the proponents of the experience-based strategy were to find out that the processes of forming such experiences and the corresponding beliefs and judgments are unreliable, this would constitute an undercutting defeater for the warrant that experiences of understanding are taken to deliver.¹⁶ The reliability of processes that produce experiences and corresponding intuitive judgments is a requirement if one wishes to defend the practice of appealing to the latter. The experience-based view remains silent on that matter, and in consequence it seems ill suited to address these criticisms, unless it already presupposes the reliability of an underlying process. If we agree that a good justification for the use of intuitive judgments about meaning should be responsive to recent criticisms phrased in terms of reliability, then we have some reasons to prefer the second strategy, given its structure and focus on reliability. However, it is still an open question to what extent the reliabilist strategy can successfully address the criticism that comes from studies such as those of Machery et al. (2004, 2017). I return to this problem in section 7.6, where I argue that the reliabilist view has multiple resources to account for such results.

7.5.3 Epistemological assumptions The third parameter in comparing the two strategies concerns the epistemological assumptions they require. I will argue that the experience-based strategy comes with non-trivial assumptions that are, strictly speaking, redundant for the task of providing a convincing justification for the use of intuitive judgments about meaning. The prima facie justifying role of experience has been widely defended in epistemology (BonJour 1980; Tucker 2013; Brogaard 2013; Fricker 2003). In particular, internalist views in the debate on belief and knowledge are typically supported by claims about a first-person rationale or justification that experiences can deliver. A famous example, often used to demonstrate the purported role of such rationale, is the case of Norman (BonJour 1980). Norman is a completely reliable clairvoyant and has no evidence or reasons of any kind for or against possessing this kind of cognitive power. One day Norman comes to believe that the president is in New York. Although his belief-forming capacity is fully reliable, Norman has no evidence for or against his belief. With this example internalists argue that Norman’s belief is not justified because Norman lacks an internally available rationale for it. Drawing on this observation, they argue that in typical

¹⁶ For a related point in the context of rational intuition, see also Machery (2017).

  , ,  

123

cases of belief formation the mere reliability of belief formation, as in the case of Norman, is not sufficient for justified belief or knowledge. People, claim the internalists, are unlike Norman in that, for their beliefs to be justified or to amount to knowledge, they need to have internally available reasons. Perceptual experiences are taken to be a good example of such reasons. Although intuitive judgments about meanings can be reasonably construed as being based on perceptual-like experiences of understanding, examples such as that of Norman seem to have little argumentative appeal in the case of justifying the methodological role of such judgments, or so I will argue. The experiencebased strategy comes with an assumption that experience delivers a kind of internalist justification that is crucial for a corresponding judgment about what an utterance meant. I will now argue that we have grounds to reject this assumption when justifying the methodological practice discussed here. Consider the following imaginary scenario, which is analogous to perceptual blindsight. In blindsight, subjects who are cortically blind and lack visual experiences from one side of their visual field can nevertheless respond reliably to visual stimuli that they do not consciously see (Overgaard 2011). In the parallel case that I propose here, a language user who has to make a “yes–no” judgment about the meaning of an utterance may reliably provide correct answers to various questions, although she lacks a corresponding experience of understanding. Just as perceptual blindsighters lack introspective access to their perceptual experiences, language-understanding blindsighters cannot access or consciously draw on relevant experiences of language understanding. We can call the latter group language-understanding zombies.¹⁷ Let us now compare typical vision blindsighters with language understanding zombies. Suppose that, for some reason, perceptual blindsighters become the only population available to vision scientists. Scientists know that these subjects do not have experiences that correspond to what is in one side of their visual field, but can reliably respond and show sensitivity to various features of those unexperienced objects. Moreover, they can draw on this unrepresented information in their actions, judgments, and decisions. In this case, judgments of perceptual blindsighters become a crucial source of data for hypotheses about human vision mechanisms. Should vision scientists discard these data only because blindsighters do not experience objects in one side of their visual field? It seems plausible that the answer to this question is “no.” Scientists would probably choose to be pragmatic and, knowing that the judgments of blindsighters reliably track what their visual system unconsciously represents, they would accept these judgments ¹⁷ Although this example may seem outlandish, its variants have been discussed in recent philosophical literature. Fricker’s (2003: 337) case of Ida provides one example of such hypothetical language users. The case is discussed in the context of first-person knowledge about what was said. Bengson (2015: 742) introduces a parallel case of intuitive blindsighters and claims that subjects who lack relevant presentations will also lack the corresponding justification for their beliefs.

124

 żż

as prima facie good data for their research. A similar case can be made, I think, for acceptability judgments that are commonly used in linguistics by syntacticians (Maynes and Gross 2013; Rey 2014b). If for some reason linguists had access only to a group of competent speakers who reported no feelings about whether a string of words is acceptable or not, and yet in a forced-choice task they would respond reliably and correctly, then linguists would, I think, accept the judgments of this group of speakers as relevant data for theories, on the assumption that they are reliable. Suppose now that most language users turn out to be language understanding zombies. They can reliably form beliefs and make judgments about what various utterances of their language mean, but they do not have conscious quasiperceptual experiences of understanding those utterances. In order to build the analogy with perceptual blindsight, assume that such language-understanding zombies are capable of coming up with reliable and correct judgments, and that theorists who draw on those judgments have prima facie good theories about their reliability. We can now ask an analogous question: should philosophers discard these judgments as data only because understanding zombies do not have corresponding experiences? The answer to that question is, I think, “no.” If it could be established that they are reliable in their judgments, subjects without corresponding experiences of understanding would provide good data about the meanings of utterances just as much as subjects with their phenomenology left intact. Their reliable but not experience-based judgments would still provide evidence for theoretical claims. This hypothetical scenario suggests quite convincingly that experiences of understanding, although they may play a role in the first-person process of forming a belief and coming up with a judgment, are, in the methodological context, explanatorily inert. Since the role of experience seems to be redundant in this context, the experience-based strategy comes with an unnecessary epistemological assumption. The reliabilist strategy, on the other hand, provides one kind of justification that, if proved successful, would be good enough to ground the practice. This provides another reason to prefer the reliabilist strategy. The arguments adduced here seem to suggest that the overall reliability of intuitive judgments about meaning is a necessary condition for justifying the methodological practice of appealing to them in theoretical debates in philosophy and linguistics.¹⁸

¹⁸ The fact that in this theoretical and methodological context reliability is a crucial concern is compatible with acknowledging that in the context of everyday communication, when speakers are interacting with each other, such experiences may have an important epistemic role. In particular, it has been argued that such experiences are a necessary part of the justification required for knowing what a speaker said (Fricker 2003) and that they provide a prima facie justification for beliefs about what speaker said (Brogaard 2018). The above proposal leaves these claims untouched. It is also compatible with the possibility that theorists who treat themselves as subjects and have reasons to believe that they are reliable may thereby also satisfy the internalist demand for an accessible justification, albeit in a different manner. I owe this last observation to Steven Gross.

  , ,  

125

7.6 Objections and replies I argued that we have three reasons to prefer the reliabilist strategy of justifying the use of speakers’ intuitive judgments about the meanings of utterances to the experience-based strategy. One objection that could be raised to my proposal is that it is not clear whether the reliabilist strategy, as sketched here, can address the charges raised by results from experimental studies such as Machery et al. (2004) or Sytsma et al. (2015), which suggest that speakers’ referential judgments vary with culture. Providing a detailed description of the underlying psychological mechanisms and ascribing them some level of expected reliability in the production of intuitive judgments seems orthogonal to the observed effects of the cultural variation of such judgments. If skepticism about the methodological use of speakers’ judgments is at least partly fueled by such results, then it is far from clear whether the reliabilist accounts, so construed, have resources to address the charge and to justify the contentious methodological practice.¹⁹ The objection seems to pose a genuine worry for accounts like mine or the one by Cohnitz and Haukioja (2015). In reply I will show that the reliabilist strategy has multiple resources for accommodating the results from this and similar experimental studies. First, an important qualification of this strategy is that the expected reliability of processes delivering intuitive judgments about meanings of utterances is conditional (Drożdżowicz 2018: 190; Cohnitz and Haukioja 2015: 628). It is determined by two types of factors. One type is factors related to interference effects on the underlying psychological mechanisms. It is part of the reliabilist account that, if such interferences were to take place occasionally, the reliability of the underlying process and the epistemic value of judgments would be in such cases compromised. Their overall relative reliability allows for cases of occasional unreliability that are due to such interferences (see Rey 2014b for an analogous notion of reliability developed for syntactic judgments). Another type is that of factors related to the quality of probing methods. Probing may have great impact on the quality of speakers’ judgments about meaning. In some cases using questions with semi-technical metalinguistic terms may be problematic, because participants may understand them differently and draw on their folk theories (e.g. Devitt 2013). In other cases, some more direct experimental techniques may be more suitable for probing judgments than the traditional armchair methods (Noveck and Sperber 2007; Phelan 2014). The occasional unreliability of judgments may therefore occur due to unsuitable probing techniques. The fact that this type of conditional reliability is built into the reliabilist accounts leaves open the possibility that the cases of unreliability presented here may occur. When

¹⁹ I thank Michael Devitt for raising this issue and for helpful discussion on the topic.

126

 żż

they do, the reliabilist strategy would be a starting point in explaining where the unreliability comes from by pointing to conditions that were not fulfilled and by identifying the factors that possibly interfered with the underlying psychological mechanisms and could be responsible for the observed effects. Following these observations, there are at least two ways in which the reliabilist strategy could address the results of Machery et al.’s (2004) study. One would be to question some aspects of the probing method used in the study. Such criticisms have been made and discussed in the literature (Martí 2008; cf. Machery et al. 2009). Even when responding to parallel results from a revised design (Machery et al. 2009; Sytsma et al. 2015), one could argue that the vignette method with Gödel-style cases is not suitable to capture speakers’ referential intuitions. Instead, one could argue that direct measures of linguistic performance (e.g. eye-tracking, reaction time studies) could be better suited for that task (e.g. Wendt et al. 2014; Rabagliati and Robertson 2017). A more radical way of addressing studies that show the effects of cultural variation for referential intuitions has been suggested by Cohnitz and Haukioja (2015). On the one hand, they argue that, if the cultural variation in such intuitive judgments were highly systematic, then we should conclude that theorists need to propose different theories of reference for different cultures. On the other hand, if the variation were found to be highly unsystematic and not to be explained in terms of design or confounding factors, then, they argue, we should be prepared to accept that “there simply isn’t a phenomenon of semantic reference that could be studied at the level of shared languages and that could play a substantial part in the systematic explanation of events of successful communication” (Cohnitz and Haukioja 2015: 640). Radical as they may seem, both conclusions would result in a substantial redefining of the research subject in this debate and in clarifying its underlying assumptions. Such revisions need not be detrimental to the discipline, rather they may be crucial for redirecting its goals and for making progress. The discussion in this section suggests that there are several ways in which the reliabilist strategy can accommodate the reported results of cultural variation in referential judgments. Which of the replies listed here should we prefer? Research on referential intuitive judgments is an ongoing project (e.g. Devitt and Porot 2018). It remains to be seen whether, with the growing sophistication of experimental methods, a fine-tuned and generally reliable armchair method of probing such judgments may still be attainable.

7.7 Concluding remarks Is the use of speakers’ intuitive judgments about meaning justified? I have presented here two strategies that address this question—the experience-based and the reliabilist one—and argued that we have three reasons for favoring the latter.

  , ,  

127

I have questioned the theoretical utility of the experience-based strategy for addressing this problem and suggested instead that the reliability of processes that produce such judgments seems to be a necessary condition for justifying the methodological practice of appealing to them. The chapter shows how several strands of research in the epistemology and methodology of language can be used to address important questions around the common practice of appealing to intuitions in our research.

Acknowledgments I would like to thank Nick Allott, Karen Brøcker, Andreas Brekke Carlsson, Michael Devitt, and Steven Gross as well as an anonymous reviewer for helpful comments and discussions on this chapter.

8 How we can make good use of linguistic intuitions, even if they are not good evidence Carlos Santana

[M]any of the philosophical ripostes to generative linguistics misfire because they fail to incorporate Chomsky’s fundamental methodological precept that linguistics is a science and not an a priori discipline. John Collins (2008a: 24)

Linguistics is a science and not an a priori discipline. This presents us with a trilemma: (1) If a discipline relies on intuition as its primary source of evidence, it is an a priori discipline and not a science. (2) Some subfields of linguistics—notably generative syntax and formal semantics—rely on intuition as their primary source of evidence. (3) Linguistics is a science and not an a priori discipline. We could escape the trilemma by rejecting (3), but this would be to commit the philosopher’s mistake that Collins admonishes us to avoid. A response friendlier to linguistics is to reject (1), on the grounds that some intuitions count as empirical, scientific evidence. Linguistic intuitions, this argument goes, are distinctive among intuitions in that they possess this empirical quality.¹ Thus at least one discipline, linguistics, can rely on intuitions as evidence and yet not be an a priori discipline.

¹ If you think that philosophy relies on intuition evidence and that some philosophical intuitions are a posteriori but not linguistic, then you could argue that philosophical intuitions provide a second example. Linguistic intuitions would still be distinctive by being empirical evidence, but not quite unique.

Carlos Santana, How we can make good use of linguistic intuitions, even if they are not good evidence In: Linguistic Intuitions: Evidence and Method. First edition. Edited by: Samuel Schindler, Anna Droz˙ dz˙ owicz, and Karen Brøcker, Oxford University Press (2020). © Carlos Santana. DOI: 10.1093/oso/9780198840558.003.0008

130

 

In this chapter I present the leading accounts of how linguistic intuitions might constitute empirical, scientific evidence and I provide reason to reject them. Given the failure of these accounts, I conclude that linguistic intuitions are not scientific evidence. This fact requires an alternative escape from the trilemma, and I argue for rejecting (2). Contrary to appearances, intuitions do not necessarily function as evidence in syntax and semantics, thus those fields can be safely counted among the empirical sciences.

8.1 The evidential role of linguistic intuitions Prominent linguists sometimes claim that intuitions do in fact constitute confirmatory evidence in linguistics. “In actual practice,” Chomsky (1986b: 36) writes, “linguistics as a discipline is characterized by attention to certain kinds of evidence that are, for the moment, readily accessible and informative: largely, the judgments of native speakers.” In an especially hyperbolic mood, syntacticians have even claimed that, when it comes to grammaticality, “[a]ll the linguist has to go by . . . is the native speaker’s intuitions about language” (Haegeman 1994: 8). Claims of this sort are exaggerated, since it is clear that other sorts of data are accepted and commonly used in generative linguistics. But even many linguists who acknowledge the existence of other sorts or relevant evidence see intuitions as central to linguistic confirmation. Psycholinguists Lila and Henry Gleitman, for instance, argue that “[t]he mental events that yield judgments are as relevant to the psychology of language, perhaps, as speech events themselves” (Gleitman and Gleitman 1979: 105). The attitude that linguistic intuitions are an important sort of evidence for the science of language is widely shared and informs scientific practice. This brings us back to our motivating puzzle. Intuitions are not generally counted as empirical evidence in modern science; so, insofar as linguistics does rely on intuitions as evidence, it seems out of place in the constellation of sciences. That this is a real concern is demonstrated by the fact that, throughout the history of linguistic science, critics from within linguistics have expressed reservations about the practice of using intuitions as evidence (e.g. Bloomfield 1933; Harris 1954; Labov 1996; Wasow and Arnold 2005). These reservations require a response, and the aim of the response should be to demonstrate that intuitions about language, in contrast to intuitions about, say, chemistry or economics, have a distinctive feature or features that enable them to function as scientific evidence. Linguists and philosophers have suggested three such features that distinguish intuitions about language. First, intuitions about language, as opposed to nonscientific intuitions elsewhere, lead to fruitful scientific discourse. Second, linguistic intuitions have a close causal relationship with language, whereas intuitions in other disciplines do not typically have a close causal relationship with their subject

        

131

matter. Finally, intuitions in linguistics are reliable, unlike intuitions in other disciplines. Each of these distinctive features—fruitfulness, apt etiology, and reliability—would explain why intuitions in linguistics could have the status of scientific evidence, and thus would allow us to reject (1) from our original trilemma. We shall scrutinize each one in turn.

8.2 Fruitfulness One means of justifying the scientific status of linguistic intuitions is to appeal to their role in producing fruitful scientific research programs. Gross and Culbertson (2011: 654) provide a recent example of using this justification rather than an etiological one, acknowledging that “the relation between linguistic competence and the cognitive capacities recruited in meta-linguistic judgments remains obscure.” They argue instead that one main justification for using intuitions as evidence comes from “the continued success and fruitfulness of the theories based upon them.” The argument, I take it, is straightforward: if a scientific methodology leads to fruitful theories, it is justified, syntax has produced fruitful theories, and its methodology involves intuition-evidence, therefore intuition-evidence is justified. I do not want to deny that syntax has produced fruitful theories. I reject, however, the premise that if a scientific methodology leads to fruitful theories it is justified. Fruitfulness is a weak, defeasible justification; more rigorously epistemic considerations, such as accuracy and consistency, have priority. Agreement or disagreement with evidence known to be reliable will usually trump appeals to fruitfulness when it comes to deciding what sorts of evidence to use. In large part, this is because fruitfulness suggests but hardly entails reliance on good evidence. Consider your least favorite major religion. It is a safe bet that theologians of that religion have been making fruitful inquiries for centuries. They publish prodigiously, engage in persuasive argumentation, look back on the history of their discipline and feel like they have made progress, and have reached consensus on a number of the most central issues in theology. From the theologian’s perspective, their field constantly appears to “disclose new phenomena or previously unnoted relationships among those already known,” to use Kuhn’s definition of scientific fruitfulness (Kuhn 1977: 321). All this fruitfulness, and their sources of evidence are not even evidence for anything real! The fruitfulness of their theological science is hardly a convincing reason to take their scriptures or revelations as good scientific evidence.² Fruitfulness is thus a theoretical virtue of lesser

² An anonymous reviewer suggests that, while the fruitfulness of theological science might not justify their religious claims, theologians of the major religions have been successful in advancing views in related fields such as metaphysics and logic. Shouldn’t this justify their methods? Certainly; but note

132

 

importance, more telling through its absence than its presence. A lack of fruitfulness might be damning for a particular scientific methodology, but its presence guarantees little. That generative syntax and formal semantics have been fruitful is thus not a particularly good reason to take the use of intuition-based evidence to be a successful methodology. It might, perhaps, add a little justificatory weight to the methods of the science, when considered in tandem with other reasons for thinking that those methods are sound—for example their reliability. But the bulk of the justification rests on those other scientific virtues, and those weightier justifications are the topic on which I will expend my efforts.

8.3 Apt etiology and the ontology of language: The social account Linguistic intuitions, the second argument runs, are fairly direct causal products of language itself, so they carry empirical information about language. Most other intuitions about scientific objects, however, do not have an origin in the objects themselves; intuitions about electrons are not principally the product of electrons, so we cannot study subatomic particles by using intuition-evidence. Linguistics’ anomalous use of intuition-evidence is thus explained by the unique way in which linguistic intuitions are related to language. As I discuss elsewhere (Santana 2016), however, there are multiple valid conceptions of language, and this means that there can be multiple etiological defenses of language. Most contemporary defenders of the evidential status of linguistic intuitions take the primary subject matter of linguistics to be a psychological entity, and their argument from etiology accordingly focuses on the causal connection between intuitions and the psychological faculty of language. This is the case largely because linguistic intuitions are primarily used as evidence in generative linguistics, and generative linguists are typically committed to a mentalistic ontology of language. However, at least one notable figure, Devitt, defends intuitions, on the basis of, in part, their causal connection to the social object that is language. Devitt’s approach to the etiology of linguistic intuitions is part of a larger account, ultimately meant to establish the reliability of intuitions; and, like my account, it tries to balance existing practice against a healthy skepticism toward intuitions. Nevertheless, it is useful to focus on the etiology Devitt proposes because it is a compelling alternative to the mainstream mentalistic view.

that it is not appeals to scripture and revelation that produce successful philosophical advances within theology, but just good old philosophical reasoning, which is independently justified. The success of theologians in developing logic and metaphysics do not justify the methods that distinguish theology as a science: appeals to revelatory authority.

        

133

Devitt appeals to the expertise held by all of us, who live in a language-saturated world. On his picture (Devitt 2006c: 497–8), a competent speaker “is surrounded by tokens that may . . . be grammatical, be ambiguous, corefer with a specific noun phrase and so on.” Experience with these tokens gives her the material to form a sort of folk theory of language if she is reflective about her linguistic experiences. Internalizing this folk theory allows her to “judge in a fairly immediate and unreflective way that a token is grammatical, is ambiguous, does corefer.” Devitt’s argument is not sound. Suppose that the relationship between learned theories and intuitions is straightforward: the speaker possesses a theory of grammatical correctness, unconsciously applies that theory to a sample sentence, and unconsciously produces a judgment of whether that sentence is grammatical according to the theory. Should the speaker’s judgment in this case count as scientific evidence for a syntactic theory? Clearly not. In this case intuitions are merely going to confirm the theory the speaker already possesses. If the speaker is a non-linguist, this means that the intuitions are at best evidence we can use to reconstruct that speaker’s folk theory of language. To take the layperson’s intuitions as evidence for the structure of language or of the language faculty would thus be to put more trust in folk linguistics than is warranted. What if the intuitions are those of a trained linguist? If anything, the situation is worse, since, if the linguist already accepts the theory she is trying to provide evidence for, the theory itself will shape her intuitions. Taking her intuitions as evidence will thus lead to an especially pure kind of confirmation bias, in which the theory under consideration causes her intuitions, which are in turn used to support the same theory. I am far from being the first to note this problem. Dahl writes: “It is well known among linguists that intuitions about the acceptability of utterances tend to be vague and inconsistent, depending on what you had for breakfast and which judgment would best suit your own pet theory” (Dahl 1979, quoted in Schütze 1996: 113). Note that this sort of researcher bias is different from the many unproblematic ways in which theory can inform how we interpret evidence. Interpreting data does require background theory, and that is all well and good, but in the case of this etiology of intuitions the theory plays a key role in producing the data rather than just revealing their evidential relevance. So, pace Devitt, in cases where a straightforward social etiology of linguistic judgments is true, intuitions—especially those of linguists—should not be used as evidence to confirm or disconfirm theories. A possible way out of this problem is to argue that intuitions are not direct applications of the linguist’s own theory, but the output of unconscious deductions from that theory. Consider an analogy with mathematical reasoning. The mathematician is aware of and has internalized a set of axioms and theorems. In a reflexive, unconscious process, her brain deduces that another theorem, θ, logically follows from those she already knows. The output of this process is a hunch that θ, and the mathematician conjectures that θ. It seems implausible that

134

 

linguistic theories have deductive properties of the same sort as mathematical theories, but for sake of argument let’s suppose they do. Linguists have the best theories, and their intuitions may be reflexive, unconscious deductions from already accepted theorems to new implications of those best theories. This account involves no confirmation bias. Even supposing that this account accurately described any actual metalinguistic intuitions, we would not accept those intuitions as evidence. Imagine what would happen if our fictional mathematician tried to publish a paper arguing for θ, using the fact that she had a hunch that θ as her primary evidence. In mathematics hunches play an important role in guiding inquiry, but the proof is, well . . . in the proof. If linguistic intuitions were analogous to mathematical hunches, the same standards of evidence would apply. Intuitions would be used to create hypotheses; but, if the intuited facts were actual deductions from accepted theory, the linguist would be expected to demonstrate the deduction. So appealing to unconscious deduction does not justify the use of intuitions as evidence. Perhaps Devitt sees subjects not so much as theorists, and more as recording devices. A language user soaks up, like a sponge, facts about how language is used around him. In fact, he soaks up social facts of all sorts: how people dress, what music the in-crowd listens too, the means different people use to avoid colliding on the sidewalk, and so on. When we elicit his intuitions about a social fact, including a fact about language, we are not so much accessing his worked-out theories as we are just giving the sponge a squeeze and letting the absorbed data drip out. Humans, as social animals, reliably track the social facts around them, and linguistic intuitions access the language-specific subset of these facts. Unfortunately, the human brain, although rather spongey in a literal sense, does not metaphorically soak up the social facts well enough. If it did, we could do not only linguistics but all sorts of social science from the armchair. But we cannot. Sociologists have to gather data and use sophisticated statistical techniques precisely because the human brain’s passive data gathering is not extensive enough, reliable enough, or statistically sophisticated enough to serve as basis for a scientific theorizing about social facts. So, no matter how we cash out a social etiology for linguistic intuitions, they will not serve as the type of reliable data we would accept as scientific evidence. Devitt’s etiological defense, while intriguing, is thus on poor footing.

8.4 The etiological defense: Mentalistic A more widely accepted account of the etiology of linguistic intuitions attributes their production in large part to the psychological faculty of language, or to linguistic competence. Devitt (2006b; 2006c) dubs this account “Voice of Competence,” and I will follow his nomenclature.

        

135

V  C (VoC): Metalinguistic judgments are principally the product of the mental mechanisms responsible for accurate linguistic performance. VoC has its origins in Chomsky, especially in his (1986b), but philosophers who have articulated versions of the argument more recently include Fitzgerald (2010), Maynes (2012), and Rey (2014b). In these philosophers’ view, linguists are interested in studying a particular cognitive object, the linguistic competence. Linguistic intuitions emerge from a process that involves the competence: the competent speaker considers a sentence, which serves as input to the language faculty. The language faculty then processes the sentence and in a fast, unconscious operation determines whether that sentence is grammatical (or its truth conditions, etc.). This determination is signaled to conscious cognitive systems, yielding an intuition about the sentence. Given the close causal relationship between intuitions and the workings of the language faculty, intuitions must carry information about linguistic competence. Intuitions thus provide evidence for the structure of the faculty of language. Proponents of VoC acknowledge that the information carried by intuition is noisy. In Knowledge of Language, Chomsky (1986b: 36) points out that, “[i]n general, informant judgments do not reflect the structure of the language directly; judgments of acceptability, for example, may fail to provide direct evidence as to grammatical status because of the intrusion of numerous other factors.” Although this noise interferes with the quality of the evidence, noisy evidence is the norm in science—not an insurmountable problem for VoC. If true, the VoC view gives compelling reason to accept intuitions as evidence in linguistics. We have good reason, however, to question whether the language faculty is the primary origin of linguistic intuitions. Defenders of VoC typically appeal to an inference to the best explanation (e.g. Maynes and Gross 2013). Does it not make sense that the etiology of linguistic judgments is to be found in linguistic competence? But it is unclear why the language faculty would be in the business of issuing metalinguistic judgments. The job of linguistic competence is to produce and interpret utterances, and making metalinguistic details about these utterances accessible to consciousness is neither necessary nor helpful in accomplishing this task. Maynes and Gross (2013: 717) admit that “the capacity for linguistic intuitions is a further, indeed dissociable, capacity that goes beyond the capacity for language production and comprehension.” Devitt agrees, noting that “the data provided by competence are linguistic expressions (and the experiences of using them) not any observational reports about those expressions” (2010a: 836). In other words, if VoC were true, it would be a substantive and surprising fact—the kind of fact we require good empirical evidence to substantiate.

136

 

Good empirical evidence of the causal connection between linguistic competence and linguistic intuitions has yet to be produced, as even commentators sympathetic to VoC admit. Fitzgerald (2010: 144),³ for instance, writes that “[w]e do not know how conscious judgments are derived, or the mechanics of the role the linguistic systems play in issuing in these judgments.” And Maynes and Gross (2013: 720) acknowledge that “little is currently known about the causal etiology of linguistic intuitions.” There are plausible speculations, such as Rey’s (2006; 2014b) suggestions that reports of metalinguistic judgments are analogous to reports of the output of perceptual systems, or Gross’ suggestion (this volume) that intuitions of unacceptability are a product of error signals. While these are intriguing but possible explanations of the phenomena of linguistic judgment, direct evidence is hard to come by and these possibilities remain speculative. Gross (this volume), for instance, aims not to demonstrate that error signals are the VoC, but merely to “render it a plausible hypothesis worth exploring.” And, while he does so admirably, the role of error signals in linguistic judgments requires more empirical investigation, given that theories of linguistic error monitoring (including many of those referenced by Gross) differ widely on the extent to which error signals are available to conscious introspection (Postma 2000). Similarly, Rey (this volume) acknowledges that he “make[s] no claim that the [perceptual] model is true,” since his aim is merely to present a respectable “how possibly” account of how VoC might work. I shall grant that we should take seriously Rey’s suggestions that reports of, say, unacceptability are analogous to reports of, say, visual content. But the evidence Rey cites, such as garden path effects and structural priming, is weak. Rey’s account may be more consistent with these phenomena than his reading of Devitt’s account of intuitions, but explaining these phenomena hardly requires the perceptual account, nor is this, clearly, the best explanation. Moreover, I suspect that the perceptual account would have difficulty explaining the effect of literacy on judgment, which I will detail shortly. Let us take stock. The highest number of points in favor of VoC would come from direct empirical evidence that it accurately describes the causal origins of linguistic judgments, but even its defenders acknowledge that we have almost none. No points for VoC from that domain, then, but no points for any alternative, either. VoC could also have scored by a good argument from function, but it turns out that we have little reason to think that the function of the competence includes issuing metalinguistic data. As far as I can tell, that leaves VoC with only one other chance to score: some sort of indirect evidence, such as an inference to

³ Fitzgerald denies that he (and most linguists) subscribe to VoC, but this denial seems to result from misunderstanding what Devitt means by VoC (see Devitt 2010a). He clearly accepts a position that would fall under the broad characterization of VoC I have given here.

        

137

the best explanation. What could be the principal cause of linguistic intuitions, if not linguistic competence? My answer to this question is that linguistic intuitions could result from what I will call learned theories of language. L T  L (LTL): Metalinguistic intuitions are primarily shaped by knowledge of theories of language. In presenting this alternative, I mean to undermine the inference to the best explanation in favor of VoC. I use “theory” loosely, and by “learned theory of language” I mean all the purported facts and generalizations about language that an individual learns, except those comprising linguistic competence. The sources of these theories will be diverse. Individuals pick up and internalize purported facts about language from magazine articles and high school grammar textbooks as readily as (and more frequently than) they do from linguistics courses. Much of LTL will derive from an individual’s own experiences of using the language as well as from casual discussions about language, just as we tend to develop our own personal sociological and psychological theories on the basis of our experiences of interacting with other humans. Metalinguistic judgments might then be the result of the application of these learned theories of language to linguistic samples. In this respect, LTL is close to Devitt’s account of linguistic intuitions, but two differences are worth highlighting. First, Devitt emphasizes (1) exposure to language in use and (2) the learning of scientific theories as sources of individual theories of language. LTL, by contrast, sees exposure to usage as only one among a diverse set of sources of folk theories. The empirical evidence cited in what follows, for instance, shows that people with similar exposure to the same language in use make divergent metalinguistic judgments because of some explicitly taught theory. Second, LTL does not regard intuitions as scientific evidence for primary linguistic facts. The first difference drives the second to some extent, since scientific theories and data of language in use are seemingly trustworthy foundations to build theory on, whereas the broader set of factors emphasized by LTL are not. This is because knowledge of theories of language is distinct from knowledge of language. The former is rarely involved in production and comprehension, but the latter always is. Additionally, knowledge of incorrect theories can lead to false beliefs about language, but knowledge of language itself is factive with regard to language. LTL claims that intuitions stem from knowledge of theories of language; VoC claims that intuitions stem from knowledge of language itself. At first glance, LTL seems at least as plausible a candidate explanation for the etiology of linguistic intuitions as VoC, thus undermining the inference to the best explanation, which might have justified VoC. More importantly, the connection between LTL and metalinguistic judgments is supported by a decent body of

138

 

empirical evidence, which gives it an edge over VoC. The primary sort of evidence in favor of LTL is that subjects’ metalinguistic judgments vary in accordance with the sort of learned theories of language they have been exposed to. To see this, we need data that contrast subjects who share competence in the same language but have different learned theories of language. The most ready-to-hand case of contrast—literate versus illiterate adults—is a striking example of how learning a theory of language (as opposed to merely learning a language) shapes linguistic intuitions. Experimental work overwhelmingly shows that literacy, which involves explicitly learning theory about language, significantly shapes linguistic judgments.⁴ A series of studies have found, for instance, that illiterate adults segment words into phonemes differently from literate adults. An illiterate English speaker might not agree with most literate speakers of English that the word “cat” is composed of three sound units. This effect has been documented, to give a few examples, among illiterates in Portuguese (Morais et al. 1979), Brazilian Portuguese (Bertelson et al. 1989), Spanish (Adrian et al. 1995), and Serbo-Croatian (Lukatela et al. 1995). In the last study mentioned here, less literate speakers of Serbo-Croatian agreed with the nearly unanimous judgment of literate speakers only 39.3 percent of the time, which is particularly striking given that there are only a few plausible options in a phoneme counting task. The best explanation for this effect seems to be that learning to read involves acquiring a specific LTL, which in turn drives metalinguistic intuitions. Not only do illiterates segment words into phonemes differently, but they also differ from literates in how they segment sentences into words. Gombert (1994) found that, when given sentences to analyze, adult illiterates did not identify the same word boundaries as literates (literates get the right answer 83 percent of the time; illiterates only 25 percent). He also found evidence that syntactic rather than phonological factors drove difference. Tellingly, the illiterates could be brought to perform the task in exactly the same way as literates with task-specific training: explicit teaching of a theory of language nullified the metalinguistic difference. Kolinsky et al. (1987) found a similar effect of literacy on judgments of relative word length among less literate speakers of French whose performance on the task was at chance level. Again, the theory of language that literates pick up while learning to read seems to shape their metalinguistic judgments. Of course, what we really need in order to undermine VoC is evidence that grammaticality judgments and semantic intuitions are driven by LTL. Fewer studies on these types of metalinguistic task have been published, but the ones that have support LTL better than VoC. Luria’s (1976) famous studies on syllogistic reasoning among illiterates, for example, demonstrate that learning the theory of language involved in literacy changes speakers’ intuitions about ⁴ Kurvers et al. (2006) give a good review of this literature, one that includes several of the examples I make use of here.

        

139

semantic entailment relationships between sentences. More recently, Kurvers et al. (2006) show, among speakers of several North African languages, a literacy effect on semantic judgments. For a word-referent discrimination task, literates performed at 43 percent accuracy, while less literate participants were at 17 percent. For a syllogistic inference task, literates were at 67 percent, by comparison with the 18.4 percent found among the less literate adults. As for grammaticality judgments, Karanth and Suchitra (1993) show that, among adult speakers of Hindi, literates and illiterates reach significantly different ones. Since literate and illiterate speakers of a language are equally competent in that language, the etiology of their different judgments cannot reside in competence. It must be LTL that drives the difference. Research from developmental psychology provides evidence for the same fact from a different domain. De Villiers and De Villiers (1972, 1974) show that children become grammatically competent well before they make accurate (or at least literate adult-seeming) acceptability judgments, which suggests that acceptability judgments do not develop in tandem with grammatical competence itself. In fact, the same effects of the theory of language acquired while learning to read is observable in studies that compare preliterate with newly literate children. These studies cover the same sort of task, so I will not discuss them in detail; examples of the ongoing research program include Hakes (1980), Ryan and Ledger (1984), Adams (1990), Sulzby and Teale (1991), Tolchinsky (2004), and Ramachandra and Karanth (2007). The literatures on adult illiterates and on preliterate children are explicitly brought together in Kurvers et al. (2006). Drawing on adult and child subjects competent in a variety of languages (including Moroccan Arabic, Rif Berber, Turkish, and Somali), Kurvers’ team replicated the effects found in the aforementioned experiments on phonological, lexical, and semantic intuitions. They document significant differences between adult literates and adult illiterates on eight different types of metalinguistic judgments, agreement between adult illiterates and preliterate children on five of the eight, and disagreement between literates and preliterate children on the same five. This result shows two things. First, the well-documented difference between the metalinguistic intuitions of preliterate and newly literate children cannot be accounted for by appealing to the natural development of linguistic competence, since it mirrors the effects of literacy on adults. Second, it shows that the LTL acquired while learning to read has a significant effect across multiple domains of linguistic intuition. The size of this effect is such that we cannot write it off as mere “noise” overlaying a VoC. In short, decades of research on the relationship between literacy and metalinguistic judgment undermines the claim that linguistic intuition is the VoC rather the voice of LTL. We thus have an empirical basis for trusting LTL over VoC. Culbertson and Gross (2009), however, provide data that challenge LTL. They find that lay and expert linguistic intuitions differ, which seems to support LTL; but there is a hitch.

140

 

LTL would seem to predict that the dividing line between lay and expert is exposure to syntactic theory, but Culbertson and Gross identify exposure to any sort of cognitive science as the major dividing line. Their experiment involved four groups—syntacticians, subjects who had taken at least one syntax course, subjects who had taken at least one cognitive science course, and subjects with no cognitive science experience—and the last group shows traits significantly different from the first three. From this datum, they drew the conclusion that differences in intuition are differences “between subjects with and without minimally sufficient taskspecific knowledge” (Culbertson and Gross 2009: 722). In other words, the factor producing different responses to sample sentences might not be the possession of different learned theories of language, but the fact that lay subjects do not understand how psychological experimentation works, and hence perform the task incorrectly (or at least differently). If this alternative explanation for the data is correct, it undermines the empirical support for LTL and thus weakens the challenge that LTL poses to VoC. A closer look at Culbertson and Gross’ data, however, belies the threat to LTL. We can accept that some of the variability in intuitions is explained as a result of subjects’ misapprehending the task. In particular, misapprehension might partially explain why, of the four groups tested, only the one with no exposure to cognitive science shows large within-group variability. Of course, this withingroup variability could also be explained by the possession of different folk theories of language, and it is consistent with LTL either way. The other primary datum driving Culbertson and Gross’ interpretation is that the three groups with exposure to cognitive science correlated more closely with each other than with the lay group. LTL is consistent with this datum as well, since the inter-group correlations, while strong, are not perfect, and intra-group correlations are also strong. If LTL is true, we would expect to see this. Since members of each group possess similar learned theories of language, we would expect strong intra-group correlations, and since the theories of language taught in syntax and introductory cognitive science courses are among those held by syntacticians, we would expect strong but not near perfect inter-group correlations. We would also expect that the cognitive science group would correlate more closely with the lay group than the syntax-specific groups would, and this is the case in Culbertson and Gross’ data. So, on close examination, Culbertson and Gross (2009) supports the claim that the theories of language a speaker knows affect her linguistic intuitions. Let me take stock. I have juxtaposed two hypotheses about the etiology of metalinguistic intuitions. The VoC account claims that intuitions are produced by the same mental processes that produce linguistic performance. The LTL account, on the other hand, holds that intuitions are unconsciously shaped by knowledge of theories of language, not by knowledge of language itself. We have noted the absence of direct empirical evidence for VoC, and adduced two types of empirical evidence in favor of LTL: learning to read and learning linguistic theory both

        

141

change a speaker’s metalinguistic judgments. This evidence undermines the inference to the best explanation in favor of VoC, by showing a disconnect between competence and judgment. Additionally, it highlights how LTL can explain judgment data that VoC writes off as noise. Since writing off a phenomenon as noise is to place it beyond the explanatory scope of the theory, this means that LTL explains the data better. Therefore, given our current state of knowledge, LTL is the more plausible explanation of where linguistic intuitions come from. The evidential worth of linguistic intuitions thus cannot be vindicated on the basis of their causal connection to the language faculty.

8.5 Reliability Grant me, for the sake of argument, that fruitfulness is not a sufficient guarantor of epistemic quality. Grant me as well that what we know about the etiology of linguistic intuitions fails to assure us of their scientific relevance. Neither concession matters if it can be shown that intuitions tend to be right. In other words, if intuition-data are informative and highly reliable, then they are good evidence despite worries about their origins. To show that linguistic intuitions are reliable, we need to show that they tend to agree with sources of data known to be reliable on etiological grounds, such as usage data and processing data. Obviously, experimental and observational methods of gathering usage and processing data can be applied poorly, but when applied correctly these methods are non-controversial sources of reliable evidence. We can determine whether intuitions mesh with the accepted evidence by exploring whether metalinguistic behavior (intuition) meshes with the linguistic behavior analyzed through experiment and observation. The mesh, I claim, is relatively poor. Metalinguistic judgments frequently entail predictions contradicted by experiments or usage data. This is necessarily so, given the wide variability in linguistic intuitions among competent speakers of the same language. If LTL is true, and we have seen good reason to think that it is, we would not expect intuitions to coincide with usage.⁵ Learning to read does not substantially change speakers’ phonology, but it does change their meta-phonological judgments. Learning some syntactic theory does not substantially change the structure of speakers’ utterances, but it does change their grammaticality judgments. It follows that usage and metalinguistic data from a population will not coincide. Which theories of language a speaker has internalized is not the only cause of between⁵ VoC also predicts that usage and judgment will diverge on account of “noise,” but the type of divergence predicted is lesser in degree (if it were not, the noise would become signal) and less consistent in kind. Additionally, failures of coincidence between usage and judgment tend to confirm LTL more than VoC, because LTL offers an explanation of those failures, while VoC relegates them to the black box of “noise.”

142

 

subject metalinguistic variation, of course. Schütze (1996: Ch. 4) reviews many contributing factors, but the short version is that metalinguistic variation outstrips variation in linguistic performance nearly every time the issue is studied. No one has attempted the probably impossible task of pinning down exactly how frequently metalinguistic judgments disagree with patterns of usage, but linguists who go looking for examples have no trouble finding them, which suggests that divergence is common. Labov (1996) runs through a plethora of examples as do Wasow and Arnold (2005) and Gibson and Fedorenko (2013). A handful of examples, even if they are broadly representative, do not prove that linguistic intuitions are always unreliable. They do exemplify a couple of important points, however. First, intuitions can frequently be unreliable, and we cannot predict beforehand which intuitions are good evidence and which are not. Second, if intuitions and usage data conflict, we accept the usage data and discard the intuition. So metalinguistic judgments are at least frequently unreliable— much more unreliable than usage data—but this is consistent with their being evidence for language, albeit weak evidence. But weak evidence frequently fails to meet the general standards of scientific evidence. Consider, for example, anecdotes from sources we believe to be honest. Suppose that a pharmaceutical researcher is informed by a trusted relative that a certain folk remedy is effective. Testimony from this relative is probably evidence in some strict sense, since it is slightly more likely that the remedy is effective given that testimony than otherwise. Nevertheless, the researcher would not even consider adding her relative’s testimony to a scientific article, grant submission, or application for FDA approval. Anecdotal evidence is not accepted as scientific evidence. Similarly, appeals to common sense or popular belief, while sometimes evidence in the strict sense, are not acceptable scientific evidence. Nearly half of Iceland inclines toward belief in elves (Sontag 2007), and it is probably the case that elves are more likely to exist given this popular belief than otherwise, but serious science does not accept such popular beliefs as evidence. Intuitions strike me as of a kind with anecdotes and popular belief; so, even if they are a weak sort of evidence, they probably belong to the class of evidence that is proscribed in scientific practice.⁶

8.6 The non-evidential role of linguistic intuitions My argument so far might seem excessively radical. After all, I appear to reject the methodology that is standard in much of linguistics. But this is not quite the upshot

⁶ Santana (2018) offers a more detailed defense of why we should exclude weak evidence from scientific reasoning.

        

143

of my argument, since, even if we reject the idea that intuitions are evidence, we can accept that they perform a useful and methodologically sound function. Consider the following sentences, which are merely the first sample sentences from a few of the authors sampled in Sprouse et al. (2013): (1) *Sarah saw pictures of. (2) a *Was kissed John. b John was kissed. (3) a *John tried himself to win. b John tried to win. My contention is that we are not really using our intuitions as evidence that (1) is unacceptable or that (3b) is acceptable, because we do not need any evidence for either claim. That constructions such as (1) are unacceptable is not in dispute. The background theory shared by syntacticians already entails that “Sarah saw pictures of” is not a grammatical sentence in English. The linguist is not trying to prove that “John was kissed” is better English than “Was kissed John.” No proof is needed. Instead, the authors are simply using the sentences to appeal to a part of the shared background theory, which they will build on in further argumentation. Many, perhaps most, supposedly evidential uses of intuitions in linguistics are actually just appeals to some common ground of the background theory; they do not intend to prove that a sentence is acceptable or true in a situation. The uncontested claims of a background theory need not necessarily be true, only shared. In order for a field to make scientific progress, scientists must delimit a subset of questions as the questions under examination, and assume a set of fixed background commitments. If we did not do this, we could never test a hypothesis against the data, because we could explain away any disconfirming result by rejecting a connected claim elsewhere in the web of scientific claims (Duhem 1954; Quine 1951). Over time, a question under examination may become settled and move to the realm of shared background theory, and occasionally the reverse may occur. Early Newtonian physicists did not treat the question of whether physical space is Euclidian as a question under examination. They assumed the Euclidean nature of space as a given. Nineteenth- and early twentieth-century physicists, however, found reason to raise the question again, and found the claim—which had previously been part of the background theory—to be wrong. As examples like this show, elements of the scientific common ground need to be broadly shared, but not necessarily because we have good reason to believe them. The common background allows for progress in science despite much of it being false. So, for linguistic intuitions to play the role of identifying the common ground, they need only to be shared—for whatever reason.

144

 

Most linguistic intuitions, at least in syntax, are actually appeals to the sort of facts that either we all agree on or are already entailed by existing shared theory.⁷ This has several implications both for my broader argument and for the use of intuitions in linguistics. Most importantly, it means that we can reject the evidential status of linguistic intuitions without also rejecting wholesale existing work in syntax. The grammaticality of sample sentences like (2b) is not generally in question, even when they appear as part of an argument in generative syntax; the point of the syntax paper is not to prove that (2b) is grammatical, but to prove a theoretical claim for which the grammaticality of (2b) is merely a premise. A practicing syntactician usually argues for an extension or revision to existing theory. In doing so, she is licensed to take any of the generally accepted contents of the theory for granted. Appeals to obviously grammatical or ungrammatical sentences can be understood as appeals to accepted theory. If a syntactician uses (2b) as an example of a grammatical sentence, she is licensed to do so, because according to accepted theory (2b) is grammatical. Her intuition is no more than a fast, reflexive application of shared background theory to the sentence. So, insofar as work in syntax generally uses intuitions in this benign, non-evidential way, the rejection of intuitions as evidence gives us no reason to call into question existing syntactic theory. Likewise for semantics, though my suspicion is that intuitions are less frequently innocuous in semantics than in syntax. On similar grounds, if linguistic intuitions about clear cases such as (2b) are not really playing an evidential role, then we can avoid making the absurdly onerous recommendation that linguists check every sample sentence they use against a corpus analysis or a psycholinguistic experiment. We might worry that any calls for a revision of scientific practice must be moderate, lest the costs of revision be too burdensome. Calling for a group of linguists to abandon their practice of using sample sentences backed by intuitions alone might be too burdensome by these lights. But, insofar as the grammaticality of those sentences is not really in question, my account does not call for such a radical revision to practice. Additionally, even if we proscribe the use of intuitions as evidence, intuitions could still play fruitful roles in hypothesis generation and speculative theory building, and much of what generative syntacticians and formal semanticists do arguably falls under one or the other of these activities. And the fact that many linguists, even those sympathetic to generative syntax and formal semantics, already employ various sorts of usage and processing of the data as their primary sources of evidence further mitigates the methodological impact of my position. Taken together, these considerations show that accepting my conclusions would not place undue burdens on the discipline. ⁷ Maynes (2017) makes a similar argument about “intuitions” in philosophy, suggesting that they are appeals to shared assumptions in the common ground. My claim about linguistic intuitions is slightly narrower, since shared background theory might not encompass all shared assumptions. In fact, identifying problematic shared assumptions that are not part of shared theory is an important part of scientific progress.

        

145

Finally, this account of the non-evidential role of intuitions has an implication for when the use of intuitions is sound practice. Intuitions, I have suggested, are often mere appeals to the common ground of the background theory. Although in order to be part of the background theory in this sense a claim need not be true, it does need to be shared. If the scientists in a field do not agree that a claim is true, or is at least a good working assumption, then treating it as an element in the common ground of some background theory will lead to misunderstandings and will cause researchers to talk past each other. In cases where there is substantive disagreement in the field, we should therefore try to avoid treating a claim as a part of background theory, and assume falsely that it requires no argument or evidential support. Appeals to intuition, then, are justified only when there is broad agreement on the acceptability or truth conditions of the sentence(s) in question. It is likely that, although there is broad agreement on simple sentences containing common constructions, some of the more complex and unusual example sentences constructed by linguists do not actually elicit identical judgments from all of them, and in these cases it would be inappropriate to take the author’s judgment to be accessing the common ground of background beliefs.

8.7 Conclusion We began with a trilemma: (1) If a discipline relies on intuition as a primary source of evidence, it is an a priori discipline and not a science. (2) Some subfields of linguistics—notably formal syntax and formal semantics—rely on intuition as a primary source of evidence. (3) Linguistics is a science and not an a priori discipline. At first glance, linguistic intuitions seem to provide some hope for rejecting (1). If our linguistic intuitions are no more than reflexively accessed reports of gathered data, or if they are internal reports of the workings of the language faculty, then they are empirical data about language. Unfortunately, in neither of these cases do we have good evidence that intuitions have an empirical relationship to language to the degree necessary for them to function as scientific evidence. Given this, to escape the trilemma, we must reject (2). Intuitions frequently play a non-evidential role in linguistics, functioning as shorthand for an appeal to shared assumptions or an established theory. While this allows us to preserve the status of linguistics as an empirical science without radically revising its methodology, it should encourage us to exercise caution in relying on complex or unusual example sentences, judgments about which might not belong in the shared scientific background.

PART II

EXPERIMENTS I N SYNTAX

9 The relevance of introspective data Frederick J. Newmeyer

9.1 Introduction Overwhelmingly, generative grammarians have used introspective data in formulating their theories. This is the case for the reasons outlined in Schütze (1996): these data provide material not obtainable from spontaneous speech or the recorded corpora; they supply negative information, that is, information about what is not possible for the speaker; they allow for the easy removal of irrelevant data, such as slips of the tongue, false starts, and so on; and, more controversially, they permit the researcher to turn away from the communicative function of language and thereby to focus on the mental grammar as a structural system. Furthermore, it is sometimes claimed that the factors affecting judgments tend to be less mysterious than those affecting use and that, by providing an alternative path to the grammar from language use, we ensure that “we have a basis on which to search for the common core that underlies both kinds of behaviour” (Schütze 1996: 180). There have been many different critiques of introspective data and it is not my intention to address them all. The most pervasive is that an introspection is an experiment, though a totally uncontrolled one, and therefore subject to the (normally unconscious) theoretical biases of the investigator. Therefore, the argument goes, introspective judgments are too unreliable to serve as a database for linguistic theorizing. I do not wish to deny that exclusive reliance on introspective judgments has had negative effects, though how often theoretical proposals have been led astray by the use of such judgments is an open question (see Cowart 1997 for discussion). The two major (albeit not mutually exclusive) alternatives to introspection are experimental evidence, which is favored mainly by psychologically oriented linguists, and corpus-based evidence, which is favored mainly by functionally oriented, variation-oriented, and natural language processing- (NLP-)oriented linguists. In a nutshell, the latter alternative holds that we speak in order to communicate, and so it follows that only communication-based corpora are valid as evidence when probing the nature of grammar. As Michael Tomasello put it,

Frederick J. Newmeyer, The relevance of introspective data In: Linguistic Intuitions: Evidence and Method. First edition. Edited by: Samuel Schindler, Anna Droz˙ dz˙ owicz, and Karen Brøcker, Oxford University Press (2020). © Frederick J. Newmeyer. DOI: 10.1093/oso/9780198840558.003.0009

150

 . 

The data in generative grammar analyses are almost always disembodied sentences that analysts have made up ad hoc . . . rather than utterances produced by real people in real discourse situations . . . [Only the focus on] naturally occurring discourse [has the potential to lead to] descriptions and explanations of linguistic phenomena that are psychologically plausible. (Tomasello 1998: xiii).

In other words nothing good theoretically can come out of a database that is essentially rotten. What critics such as Tomasello advocate is a near exclusive focus on natural conversation. Since, at least as they see things, conversation is the principal function of language, it stands to reason, they would say, that the properties of grammars should reflect the properties of conversation. On this view, the complex abstract structures of generative grammar are an artifact of the appeal to introspectively derived example sentences that nobody would ever actually use. The conclusion that these critics draw (see again Tomasello 1998) is that, if one focuses on naturally occurring conversation, grammar will turn out to be a collection of stock memorized phrases—“formulas” or “fragments,” as they are often called—and very simple constructions. At the very best, comments like those of Tomasello are non-sequiturs. A much older parallel criticism, by Clark and Haviland, questions the use of introspective judgments as data on the grounds that “[w]e do not speak in order to be grammatical; we speak in order to convey meaning” (Clark and Haviland 1974: 116). But Wexler and Culicover (1980) replied that, even if this is true, no logical relation can be established between the purpose of our speech and the kind of data on which to base linguistic theories. Function can also be established in biology. For example, the function of certain molecules is associated with genetic transmission. But this function once again does not dictate choice of data. We do not say that the biologist’s enterprise is odd because he uses X-ray photographs although it is not the purpose or function of the relevant molecules to provide photographs. (Wexler and Culicover 1980: 395)

Turning to experimental data, I have little to say about them, except to point out that corpus-oriented linguists tend to reject them as well: The constructed sentences used in many controlled psycholinguistic experiments are themselves highly artificial, lacking discourse cohesion and subject to assumptions about default referents. (Bresnan 2007: 297)

By and large, generative grammarians have simply ignored the critique of introspective judgments that comes from usage-oriented linguists. And when we have talked about data, it has been more a matter of defending the reliability and consistency of introspective judgments than a matter of

    

151

challenging the idea that a focus on conversation as opposed to introspection leads to a simplified view of grammar. In this chapter, however, I argue that introspective data do not lead to grammars that are markedly different from those whose database is naturally occurring conversation. It follows, then, that introspective data are no less relevant than conversational data to the construction of an adequate grammatical theory. For all the existing hostility to introspective data, not that many linguists have actually taken the trouble to argue that conversational data lead to wildly different grammars from those based on introspective data. Among the few who have, at least as far as syntax is concerned, are Sandra Thompson (with her past and current students) and Paul Hopper. Hence, in the following section, I discuss and critique a paper jointly written by Thompson and Hopper (2001) that argues that conversational data refute a key hypothesis of generative grammar that was arrived at by using introspective data. The following sections take on some general issues about the value of introspection vis-à-vis conversational corpora. My database is the Fisher English Training Transcripts, to which I will refer as the ‘Fisher corpora.” The transcripts comprise a 170 MB corpus of over 11,000 complete telephone conversations, each lasting up to ten minutes and containing over 6,700,000 words. All the examples in this chapter are drawn from the Fisher corpora, unless noted otherwise.

9.2 A critique of “Transitivity, clause structure, and argument structure: Evidence from conversation” One construct that is central to virtually every formal approach is “argument structure,” roughly the sorts of dependants that can occur with particular heads. In fact, in most formal theories, syntactic structure, in one way or another, is a projection of argument structure. Therefore it is not surprising that Thompson and Hopper would want to zero in on argument structure as part of their attack on formal linguistics. They begin by criticizing the use of introspective judgments about verbs such as spray, load, cover, pour, and so on in the argument structure literature (see Levin 1993 and the references cited there). They do not deny that spray and so on have pretty well-defined complement types. But their feeling is that, since these verbs are so rare in conversational speech, they are simply irrelevant to deep grammatical analysis: [T]he apparent importance of [argument structure] may be an artifact of working with idealized data. Discussions for argument structure have to date been based on fabricated examples rather than on corpora of ordinary everyday talk. (Thompson and Hopper 2001: 40)

152

 . 

Their position is that, to fully understand how grammar works, one needs to focus on grammatical behavior of the most common verbs, which, they write, leads to the conclusion that argument structure is a theoretically useless concept. Why? Because, according to Thompson and Hopper (2001: 49), “the more frequent a verb is, the less likely it is to have any fixed number of ‘argument structures.’ ” If that were true, then we might have good reason to rethink the importance of argument structure, as well as the introspective data that led us to regard argument structure as important. Space limitations do not permit a discussion of Thompson and Hopper’s entire paper. I limit my focus to one key part of it, namely their discussion and analysis of the verb get. Get is the most frequently used verb in English conversation, after have and be. They write: “Get is a prime example of a verb with no easily imagined argument structures, precisely because it is used in so many lexicalized ‘dispersed’ predicates and specific constructions” (Thompson and Hopper 2001: 49). An examination of the Fisher corpora shows nothing like that to be true. I looked at over 1,000 instances of get, which I divided up proportionally to the frequency of their morphological variants (get, gets, getting, got, and gotten). My conclusion was that, pace Thompson and Hopper (2001: 49), get does indeed have “easily imagined argument structures.” Get certainly has more argument structure possibilities than the average verb, but there is nothing difficult to imagine about them. The breakdown is in (1): (1)

Subcategorization frames for the verb get: ___NP ___AP ___PP ___Past Part ___NP XP ___to VP ___other TOTAL

511 329 88 44 42 19 7 1040

49.1% 31.6 08.5 04.2 04.0 01.8 00.8

As can be seen, over 95 percent of the instances of get occurred in only seven subcategorization frames, and over 80 percent of its occurrences were before a bare NP or a bare AP. The seven leftover cases were hard for me to classify. For example, get occurs before home and here: (2)

a ___home b ___here

4 1

The categorial status of home or here is not obvious. There was also an example of get with a quantifier phrase (QP) and one before a wh-complement:

     (3)

153

a B: how old how old um how much did you get when you start b B: i had gotten where i was taking a i think it was uh some some brand

In (3a), the QP how much is probably best analyzed as an AP, though that is open to question. Example (3b) was my only example of get before a subordinate clause. What is interesting is where one never gets get. There is nothing in my corpus like the subcategorizations in (4):¹ (4)

a b c d e

___# ___ADV ___(that) S ___for NP to VP ___NP’s Ving

(*I got) (*I got easily; *tickets for the concert get easily) (*I got (that) he finally believed me) (*I got for him to believe me) (*I got Mary’s helping)

In other words, get occurs in more argument structures than spray or load, but that is hardly an interesting fact, in my opinion. Finally, it is not the case that the uses of get are more construction-specific than those of other verbs. Get occurred before a past participle forty-four times in my sample. Twenty-four different participles were employed: accepted, affected, arrested, asked, called, called for, carded, cashed, exposed, hit, ignored, interrupted, married, paid, past, plagued, raided, reassigned, rejected, set up, stationed, stored, stuck, treated. Given the space, I could provide similar arguments for the next most common verbs in conversation: say, go, know, think, see, come, want, and mean. There is nothing about their behavior that challenges traditional views of argument structure based on introspective judgments.

9.3 The grammatical complexity of everyday speech Conversational speech reveals a depth of syntactic knowledge that is absolutely stunning and that supports the standard introspective judgments we find in the literature. Consider, for example, a few cases of long-distance wh-movement: (5)

a B: so what do you think that um we should do um as far as w- we’re standing right now with our position b B: getting back to this subject where do you want to go with it c A: when do you expect to get together

¹ An anonymous referee has found the following sentences in the Corpus of Contemporary English (COCA) database: I get that he wants me to be happy; I get that he has other things going on; and I get that he has a big portfolio. All these appear to illustrate (4c). They all involve get in the sense of “understand,” “appreciate,” though they do indeed illustrate that get can take a that-complementizer.

154

 . 

Along the same lines, speakers and hearers can link embedded gaps in relative clause constructions to their antecedents: (6)

a B: you know when i move away and get the things that i want to have and retire early and enjoy you know what i mean b A: actually following the rules that they need to be following they are doing things that they shouldn’t be doing c B: that right if i had time to cook the things that i like to cook then it would be in home

To produce and comprehend utterances like these, the language user has to hold in mental storage a position for an unexpressed direct object in a different clause and to link a fronted wh-element or lexical antecedent to that position. It is not a matter here of “fragments” or “formulas,” but rather of a sophisticated engine representing grammatical knowledge. Anaphoric relations are among the most difficult grammatical phenomena to extract from corpora, given the difficulty of formulating the right search criteria. But persistence provides some interesting (and, I would say, surprising) results. For example, one often hears that cataphors (i.e. backwards anaphors) only occur in linguists’ introspective judgments or possibly in educated speech or writing. But in fact they are found in conversation, and both in pronominal and elliptical form: (7)

a A: when their sons die with with money he rewards the parents and and the parents are quite happy about it b A: um overseas we i don’t know why we don’t but everybody has flags here we have huge flags on the street

There are also examples of both forward and backward sluicing in conversation: (8)

a A: i know i know i’m going to get married some time but i don’t know when b B: we’re supposed to just give one another’s opinion about uh if you like eating at home or if you like eating out more and i guess why

(9)

a A: i just i don’t know why but i don’t usually get sick in the winter time b B: i don’t know why but for whatever reason every night the cat comes and like meow outside our door

After decades of research, we still do not know what the conditions are for appropriate cataphors and sluices. Nevertheless, speakers handle the relevant structures without effort.

    

155

The question is why so many corpus-oriented linguists consistently underappreciate the value of introspective data in probing the resources of ordinary speakers. I suspect that a big part of the problem is simply an effect of the small size of many of the corpora that are used. For example, Thompson and Hopper (2001) is based on only 446 clauses from three face-to-face multiparty conversations. And the problem is not confined to this study. One of the major booklength studies of conversation, Miller and Weinert’s (1998) Spontaneous Spoken Language, limits itself to an English corpus of only 50,000 words. It is not surprising, then, that all the constructions in (10) are absent from its corpus: (10)

Constructions missing from Miller and Weinert (1998): a adverbial clauses of concession introduced by although b adverbial clauses of reason introduced by since c gapping d conditional clauses signaled by subject–auxiliary inversion e accusative–infinitive sequences (“exceptional case marking”) f gerunds with possessive subjects g gerunds with an auxiliary h initial participial clauses preceding a main clause i infinitives in subject position j infinitives with auxiliaries

Yet all these occur in the Fisher corpora: (11)

a [adverbial clauses of concession introduced by although] B: although they may not agree with war then they are going to support the u._s. government and they’re going to support the u._s. soldiers b [adverbial clauses of reason introduced by since] A: since i’ve never been much of a power grabber myself i don’t really understand people that that are c [gapping] A: but at the same time you might not have not being in that situation might have had gave you a different outlook on the world on the world and life and such and and me the same d [conditional clauses signaled by subject–auxiliary inversion] A: had i known then what i know now e [accusative-infinitive sequences (“exceptional case marking”)] A: um i consider myself to be a pretty open minded person f [gerunds with possessive subjects] A: you know going back to his firing of his economic advisors g [gerunds with an auxiliary] B: i was kinda surprised they’d i could i could fit in because of my having been born in england i i i thought it would just be americans

156

 .  h [initial participial clauses preceding a main clause] A: hoping i never get that far i just wanna make sure that i don’t end up on every committee i [infinitives in subject position] A: to yeah to to get to where they need to do so sunday would kinda be like the first day of the festivities j [infinitives with auxiliaries] A: yeah you know i wouldn’t have wanted to to have brought -em up in a in a Christian controlled

The absence of all these ordinary English constructions from Miller and Weinert’s small database would be inconsequential, if it had not led them to the inevitable conclusions about the bankruptcy of formal linguistic theory. They write that “[t]he properties and constraints established over the past thirty years by Chomskyans [are based on sentences that] occur neither in speech nor in writing [or only] occur in writing” (Miller and Weinert 1998: 379). And, on the basis of this mistake, they go on to question whether English grammar could be anything like what formal linguists propose. But when one considers that the average speaker utters about 16,000 words per day (Mehl et al. 2007), it is clear that one cannot conclude very much at all about grammatical knowledge from a corpus of 50,000 words. As we have seen, introspection supposedly leads to sentences that are confined to a large degree to literary genres. But the differences between spontaneous conversation and what is found in literature are almost always quantitative rather than qualitative. Consider Douglas Biber’s (1988) Variation across Speech and Writing. Biber takes sixty-seven grammatical features of English, some of them pretty exotic, and calculates their frequency in twenty-three different genres, some spoken and some written. Only three of these features occurred in face-to-face conversations at a frequency of less than 0.1 times per thousand words: (12)

Rare features in the Biber (1988) corpus: a present participial clauses (e.g. stuffing his mouth with cookies, Joe ran out the door) b past participial clauses (e.g. built in a single week, the house would stand for fifty years) c split infinitives (e.g. he wants to convincingly prove that)

It is hard for me to imagine that any of these sentence types would be rejected by English speakers’ introspective judgments. And all three features are rare in academic prose as well: 1.3, 0.4, and 0.0 times per thousand words respectively in that genre. Nevertheless, it was not difficult to find examples of all three in the Fisher Corpora:

    

157

Table 9.1. The most frequent grammatical features in two English genres RANK

FACE-TO-FACE CONVERSATIONS

ACADEMIC PROSE

1 2 3 4 5 6

nouns present tense adverbs prepositions first-person pronouns contractions

7 8 9 10

type–token ratio attributive adjectives BE as main verb past tense

nouns prepositions attributive adjectives present tense adverbs type–token ratio (the number of different lexical items in a text, as a percentage) nominalizations BE as main verb past tense agentless passive

(13) a B: having angst i don’t have any like firsthand experience with separations or anything cause i mean b A: but compared to the comedies now it it’s tame c B: right and they tried they tried to really make it so people wouldn’t get along Table 9.1 gives the ten most frequent grammatical features in the two genres. The only features that made the top ten in face-to-face conversations, but not in academic prose, were (unsurprisingly) first-person pronouns and contractions. Facts like these suggest that the gulf between introspective data and data drawn from corpora, whether conversational or scholarly, is insignificant.

9.4 Some general issues regarding conversational corpora The remainder of this chapter is devoted to some general issues regarding conversational corpora. Section 9.4.1 stresses the ways in which they are of great value, while the following 9.4.2 points to some of their limitations.

9.4.1 Some positive features of conversational corpora There is clearly no substitute for conversational corpora, if one’s interest is the study of the structure of conversations and broader discourses. This point should be uncontroversial. But such corpora are also quite useful to grammatical theorists. We have just seen how they can be used to rebut extravagant claims about what is supposedly not found in ordinary usage. And that works both ways. Syntacticians using only introspective data tend to be much too quick to label a

158

 . 

sentence type “ungrammatical” when it is easy to find evidence that that sentence type is commonly used. Consider an example from Bresnan, Cueni, Nikitina, and Baayen (2007). Most treatments of the dative alternation say that the verb give appears with a prepositional object only if there is movement to a goal. So supposedly the (a) sentences of (14–15) are grammatical and the (b) sentences are ungrammatical: (14)

a The movie gave me the creeps. b (*)The movie gave the creeps to me.

(15)

a The lighting here gives me a headache. b (*)The lighting here gives a headache to me.

But, as Bresnan et al. demonstrate on the basis of conversational evidence, sentences like those in (b) are not infrequent: (16)

a This life-sized prop will give the creeps to just about anyone! b That smell would give a headache to the most athletic constitution.

9.4.2 Some negative features of conversational corpora The following subsections outline briefly the limitation of conversational corpora. The first limitation of conversational corpora: They do not provide ungrammatical sentences No corpus can provide sentences that do not occur. Yet ungrammatical sentences have played a key role in the development of grammatical theory. Even the absence of a construction type from a conversational corpus of millions of words is no guarantee that it does not form part of the linguistic competence of a native speaker. Ironically, even Thompson and Hopper (2001: 40) appeal to ungrammatical sentences to help underscore their points. In other words, if for no other reason, there will always be a place for introspective data. Consider an important related point. Corpus-oriented linguists focus almost exclusively on language production and tend to ignore comprehension completely. After all, how might one reliably extract comprehension data from a corpus? No doubt carefully designed experiments are possible, but introspective judgments are still our best hope for deciding whether a sentence is comprehensible not. Along the same lines, linguists such as Hopper and Thompson never discuss our ability to make judgments about sentences that we have never heard. Interpreting novel strings and making judgments of well-formedness require computational

    

159

ability, that is, they require a grammar, not just a memorized stock of formulas and simple constructions. The second limitation of conversational corpora: They conflate the grammars of speakers of different varieties of the same language. The second limitation is based on the fact that nothing can necessarily be concluded about the linguistic competence of an individual speaker on the basis of corpora that contain utterances from various speakers who are not all members of the same speech community. The Fisher transcripts go out of their way to include American English speakers from different walks of life, different regions, and different income levels. This is a very good thing if one wants to get a feel for the diversity of the language spoken across the country. But it is a disaster if one is probing the grammatical competence of an individual speaker. And this, after all, is what generative grammatical theory is all about. Psychologically speaking, there is the I-language of an individual and there are the universals common to all grammars, but really nothing in between. In grammatical theory there is no such thing as the concept of a “pan-American English,” which can be motivated by pooling the output of a large number of speakers. Take a concrete example. Many speakers of American English produce what is called “positive anymore” sentences. In fact these occur in the Fisher corpora: (17)

a B: most of my time is leisure anymore so and b A: it’s a fact of life anymore c A: well you know the ones that are on t._v. anymore are getting pretty racy

I had never heard a “positive anymore” sentence until I was away at university; and the first time I heard one I could not parse it. I am still not sure how to use it appropriately. How could my grammar conceivably be the same as that of an English speaker who uses the construction natively? In fact, Labov (1972) has used positive anymore sentences as an argument against pan-dialectal grammars, given that most people who do not use the construction do not know what it means. By the way, despite what is often said, it is not a simple synonym of “nowadays.” So (18a) (a constructed example) is possible, but not (18b): (18)

a I was dealt good hands when we started playing bridge an hour ago, but anymore they’re really crappy. b * I was dealt good hands when we started playing bridge an hour ago, but nowadays they’re really crappy.

160

 . 

Consider, too, the nonstandard usages by one speaker in the Fisher corpora. Speaker A in (19) uses the constructions outlined in (19a–f): (19)

a. The invariant be construction: A: aw i don’t have a best friend my best friend is god so he be the one to give it to me b. Coordinated object pronouns in subject position: A: yeah you know me and you must be in the same situation c. Been as a simple past: A: my best friend but uh my best friend is acting up and he been acting up and i’m tired of him acting up so i think i’m going to go about my business so d. Go to as a synonym for start: A: because people go to acting funny when they get money e. Ain’t got to as a synonym for don’t have to and negative concord: A: yeah well i pay i pay for it where i ain’t got to worry about no mortgage f. Uninflected third person singulars: A: and you should see my my matter of fact i was just in my bathroom and my bathroom look like a million dollars

I understand all these sentences, but there is no reason to think that they are generated by my grammar. And, without question, there are construction types licensed by my grammar that would be totally foreign to the grammar of Speaker A. In brief, there is no way in which one can draw conclusions about the grammar of an individual from usage facts about communities, particularly communities from which the individual receives no speech input. There are many nonsequiturs in the literature that arise from ignoring this simple fact. So, Manning (2002) observes that Pollard and Sag (1994) consider sentence (20a) grammatical, but that they put an asterisk in front of sentence (20b): (20)

a We consider Kim to be an acceptable candidate. b *We consider Kim as an acceptable candidate.

Manning then produces examples from the New York Times of sentences like (20b). Perhaps (20b) is generated by Pollard’s grammar and perhaps it is not. Perhaps (20b) is generated by Sag’s grammar and perhaps it is not. But we will never find out by reading the New York Times. The point is that we do not have “group minds.” No input data that an individual did not experience can be relevant to the nature of his or her grammar. Now one might object that in this chapter I have been as guilty as Manning (2002). After all, I have been appealing to the Fisher Corpus to make claims about grammatical competence in English in general. But I am simply trying to meet

    

161

Thompson and Hopper on their own terms. What we need is decent-sized corpora of the linguistic behavior of particular individuals, or at least of individuals in a particular speech community, narrowly defined. Do they exist? I do not think so. Until they do, for this reason alone, introspective judgments are irreplaceable. The third limitation of conversational corpora: They lead linguists to exaggerate the importance of text frequency to the shaping of grammar. Frequency of use, as calculated on the basis of evidence from conversational (and other) corpora, is uncontroversially an important factor in directing grammatical change. Frequency drives the grammaticalization of locative nouns to adpositions, of pronouns to person markers, of auxiliaries to tense and aspect particles, and much, much more. But a word of caution is necessary here. Joan Bybee, for one, has often pointed to the effect of frequent use on constituent structure. For example, Bybee and Scheibman (1999) give some pretty good evidence that in frequent phrases like I don’t know, the subject and the auxiliary form a surface constituent, not the auxiliary and the verb. So consider the two bracketings in (21a–b): (21)

a [I] [don’t know] b [I don’t] [know]

[“Classical” analysis] [Bybee and Scheibman analysis]

Bybee and Scheibman appeal to evidence that supports (21b) in order to dismiss traditional generative constituent analysis. But other tests—binding relations for example—support the traditional analysis. What is going on, then? As far as I can see, what we have here is another example of a “bracketing paradox,” that is, a situation where a single string requires different analyses at different levels of grammar. Example (22) presents two well-known cases of bracketing paradoxes: (22)

a transformational grammarian (lexically [transformational] [grammarian], but semantically [transformational grammar] [ian]) b this is the cat that ate the rat (syntactically [this is] [the cat that ate the rat], but phonologically [this is the cat] [that ate the rat])

I assume that I don’t know should be handled in more or less the same way. I certainly do not see anything there that would challenge standard models of grammar. Another important point is that the frequent use of a construction type in one language is not necessarily a reliable guide to what occurs cross-linguistically. For example, most English speakers control both “preposition stranding” (23a) and “pied-piped” PPs (23b):

162

 . 

(23)

a B: this is joe pinatouski who am i speaking to b A: to whom am i speaking

But stranding is used vastly more often than pied-piping. In the Fisher corpora, the PP to whom occurs only eight times, while the full sentences Who am I speaking to? and Who am I talking to? occur twenty-four times and twenty-six times respectively. One might predict on this basis that stranding would be more common than pied-piping cross-linguistically. Such is not the case, however. Stranding is attested only in English and in the Scandinavian languages (and, marginally, in Dutch and in French). Here is one more example of how frequency fails to predict typological distribution. Keenan and Comrie (1977) showed that, if a language can form relative clauses at all, then it can form them on subjects. One might predict, then, that subject relatives would be used more often than object or oblique relatives. Apparently this is not consistently the case. Fox and Thompson (1990) found that, with nonhuman referents and when the head NP is a matrix subject, 77 percent of English relative clauses are object relatives. In sum, frequency is an important factor leading to the shaping and reshaping of grammar. But appeals to frequency should never be used as a substitute for careful grammatical analysis. Frequency generalizations derived from conversational corpora do not challenge theories constructed on the basis of introspective judgments. The fourth limitation of conversational corpora: They are chaotic The fourth limitation is probably the most serious one. The fact is that conversation is unbelievably messy. What we say is constrained in large part by our grammars, of course, but also by so much more. When we talk we get distracted, interrupted, and we often change our minds about what we want to say. And all this happens in mid-stream. Consider a typical exchange from the Fisher corpora: (24)

B: A: A: B: B: A:

do you have that problem well i i you know i think i’m not sure yeah flu in your lungs you’re saying not just your uh yeah right just kind of the up you know your chest and your throat and your uh nasal cavities or whatever the heck it is and all that stuff B: i get it all in my head and my throat but very very seldom ever any chest problems or any anything that makes me you know nauseous that’s not very

     A: A: A: B:

163

right right mm common for me

In this discourse, we superficially have predicates without subjects, three complement-taking verbs stacked up one after the other with no complements, say used seemingly intransitively, and up used with a definite article. Only the most dyed-in-the-wool empiricist would argue that English grammar licenses sentences like these. There is such a gulf between the “syntax of conversation” (if one would want to call it that) and our mentally stored grammatical competence that it is plainly dangerous to draw too many conclusions about the latter from the nature of the former. Perhaps there is some way of not involving introspection to filter out the dysfluencies, but I am not sure how that might be done.

9.5 Concluding remarks The bottom line is that no one form of data is theoretically privileged with respect to any other. Introspective data, conversational data, experimental data, data from ancient manuscripts, and so on all have their place and their pitfalls. Generative grammarians have undeniably tended to appeal to introspective data. But that has been more a function of convenience than of anything else. It is interesting that Chomsky, at least in the early years, was very critical of introspective data. In Syntactic Structures he wrote: It is also quite clear that the major goal of grammatical theory is to replace this obscure reliance on intuition by some rigorous and objective approach. (Chomsky 1957: 94)

And eight years later in Aspects of the Theory of Syntax he wrote: Perhaps the day will come when the kinds of data that we now can obtain in abundance will be insufficient to resolve deeper questions concerning the structure of language. (Chomsky 1965: 21)

Has that day come? Quite possibly. Very few generativists would argue that introspective data are “sufficient” to resolve the deeper questions. To conclude briefly, when introspective data go head-to-head with data drawn from corpora of conversation, there is little reason to think that the theory of

164

 . 

grammar derived from one would differ greatly from a theory derived from the other. Hence one cannot dispute the relevance of introspective data.

Acknowledgments I would like to thank Karen Brøcker and an anonymous referee for their helpful comments on an earlier version of this chapter.

10 Can we build a grammar on the basis of judgments? Sam Featherston

10.1 Introduction This chapter is intended as a contribution to the debate on introspective judgments as an appropriate basis for a grammar. I think traditional practice within the field of grammatical theory has been, and in its unreformed form still is, inadequate, and that more efforts are consequently needed to enable syntax to develop and make advances. Proposals for new analyses need to be more than speculation, we need to have a clear cycle of hypothesis generation, testing, and rejection for progress to be made. This is not new: more people have made this point than me, but combating dataphobia in the field of syntax is like losing weight or dealing with climate change: it is not something that happens instantaneously, it is a long-term process. The pressure needs to be kept up, or else people will sink into complacency. It is partly for this reason that I perceive with dismay a sort of countercurrent of data quality skepticism going around. Some linguists feel themselves encouraged to pursue grammatical study with little regard to the quality of the evidential base because some papers have been published which seem to allow this or encourage it. The aim of this chapter is to respond to this trend. One of the ways in which I will do this is by discussing a couple of papers which are frequently referred to on this topic, namely Sprouse and Almeida (2012c) and a widely circulated early form of Sprouse, Schütze, and Almeida (2013)—henceforth SA12 and SSA13. These are excellent articles, which have been conscientiously composed and which constitute very real contributions to the field. I have no argument with their content, but I do wish to take issue with some aspects of their interpretation. For the most part my conflict is with conclusions that are drawn from these papers but not made within the papers, though sometimes the texts do seem to invite strong claims. I will argue that these papers do not in fact legitimize data-light theory building by taking a closer look at what the studies reported in these papers actually show. My view is that these two papers do not, on

Sam Featherston, Can we build a grammar on the basis of judgments? In: Linguistic Intuitions: Evidence and Method. First edition. Edited by: Samuel Schindler, Anna Droz˙ dz˙ owicz, and Karen Brøcker, Oxford University Press (2020). © Sam Featherston. DOI: 10.1093/oso/9780198840558.003.0010

166

 

closer inspection, justify armchair linguistics, as they are sometimes taken to do.¹ In fact I shall argue that they yield fairly strong evidence that the single judgments of the individual linguist do not form an adequate basis for further theory construction. There is something that I need to clarify before we get under way. The title of this article asks whether we can build “a grammar” on the basis of introspection, but in reality that is just a short form of the real question; what is fundamentally at stake here is whether we can build a good grammar, a better grammar than we have had up until now. Perhaps we should talk about the grammar, the grammar which is uniquely specified by the data. This is one way in which this chapter asks a rather different question from those addressed in SA12 and SSA13. While those texts test whether the judgments employed in the linguistics literature were adequate as a basis for the grammar so far, I would seek to query whether they are sufficient for us to make progress in grammar research. If they are not, then I would see these judgments as sub-optimal, even if they have been a sufficient basis to get us to the point where we now are. If you are driving the Paris–Dakar rally, you need a vehicle that will first get you down through France, and then across the Sahara. The fact that you can do the first 1,000 km on tarmac in a family car does not change this. Schütze (1996) relates this ambition to Chomsky’s (1965) claim that linguists’ own judgments are adequate for the “clear cases.” It was true then, but theory development is quite rightly trying to move beyond the clear cases and requires more finely grained distinctions in order to continue. I think a forward-looking perspective is necessary because our knowledge and understanding of syntax are still very sketchy. One of the clearest signs of this situation is that a range of different grammar models exist, with clearly divergent underlying architectures. If linguists cannot agree whether the form of the grammar is overgenerate and filter, declarative, generate and economize, or winner takes all, I do not see that we can yet be satisfied with our knowledge. Worse still, data-light linguistics offers us little in the way of infrastructure to support progress. In armchair linguistics, analyses come and sometimes go, but others remain in spite of having little empirical support. The issue is that claims that are only weakly linked to the database make few testable predictions and are thus impervious to the standard scientific tests of corroboration or falsification. Experimental data types are more exact and thus go hand in hand with a more rigorous datadriven academic praxis. It is tough on linguists because you can be proved wrong, but that is science. In the following I first present the sort of evidence that persuades me that we should beware of armchair judgments. In this section I show some experimental work done by myself and my colleagues but also refer to some of the results of ¹ Phillips (2009) talks about armchair linguists, and I will use the term too, contrasting “armchair” judgments and “experimental” judgments, because I find these terms expressive.

         ?

167

SSA13 which seem to me to confirm what I am suggesting. I then move on to discuss why some people seem to think that the papers SA12 and SSA13 validate armchair judgments. Here I highlight three factors that lead me to interpret these two articles more cautiously. I will finish up by arguing in favour of quantified judgments, because only these support more complex grammar models.

10.2 The quality of armchair judgments There is an extensive literature on judgments which debates whether they are a suitable basis for linguistics (see extensive citations in Schütze 1996; Featherston 2007; SA12; SSA13). We can perhaps identify four main problematic features of introspective judgments as they have been traditionally used in syntactic literature: (1) (2) (3) (4)

single judgments are coarse-grained; single judgments have a high noise component; there is conflict of interest for the linguist as data source and interpreter; judgment data patterns do not match theoretical predictions.

I shall chiefly address the first two here. My basic point is that these reproaches are justified in the case of armchair judgments, but less so for experimental judgments.

10.2.1 Armchair judgments are less sensitive In order to support this claim, I shall revisit the findings of Sprouse and Almeida (2012c) a paper often held to support the value of informal judgments. This paper is an early version of SSA13, an exceptionally detailed and well-performed investigation into the quality of judgments reported in Linguistic Inquiry. The authors have tested whether a sample of the intuitions reported can be replicated using experimental techniques, applying a range of different statistical tests. The results are scrupulously reported, which means that the reader can build their own picture of the findings—or indeed reuse their findings, as I do here. Figure 10.1 here—Figure 9 in Sprouse and Almeida (2012c)—shows the success rates in replicating judgments from the Linguistic Inquiry sample in graphic and tabular form. The authors understandably highlight the very high success rates of the experimental methods in capturing the judgments from Linguistic Inquiry. With sufficient participants, the forced choice method reaches a 95% replication rate, while the other methods (yes/no question, magnitude estimation, seven-point

168

 

% of phenomena in LI

100 90 80 70 60 50 40 30 20 10 0 sample size

FC, 80% power LS, 80% power ME, 80% power YN, 80% power 0

20

40

60

80

Sample 6 10 15 20 25 30 35 40

FC 35% 69% 78% 83% 89% 91% 94% 95%

YN 15% 39% 58% 64% 72% 75% 77% 80%

ME 29% 58% 70% 74% 77% 79% 81% 82%

LS 28% 57% 69% 74% 76% 79% 81% 82%

100

Figure 10.1. The percentages of judgment replications in Sprouse and Almeida (2012c), on the left in graphic form and on the right in tabular form

rating scale) all manage 80% or more. But here’s the rub: it requires 40 informants to reach that sort of level. Below 25 participants, the success rate drops off steeply. When we look at a sample size of 6, the lowest participant number given, the success rate is between 15% and 35%. The figures for single judgments are not given, but we can see where the lines in the graphic are tending. So these carefully processed data give us a clear message. We need 20 or, ideally, more indpendent judgments to achieve satisfactory power. Fewer than 10 judgments must give a real cause for concern. Judgments by a single person are an inadequate basis for any firm conclusions. Now we need not take these figures too exactly. All linguists who actually use judgments as a data type (unlike Labov 1996) would agree that even a single person’s intuitions can under certain circumstances make quite fine distinctions. But one aspect of this data set is beyond discussion: the sharp downward curve in the success rate with fewer participants. Fewer informants undoubtedly produce less powerful results. So this is one clear way in which experimental judgments are a much better basis for grammar development than armchair judgments. Another way in which experimental judgments with multiple informants can produce better data concerns the revelation of finer differences. We present an example of this in the next section.

10.2.2 Armchair judgments are noisy One reason why armchair judgments are less sensitive is that introspective judgments contain a relatively large noise component. I have discussed this in the past (e.g. Featherston 2007), but any linguist who has gathered judgments from groups and looked at them comparatively knows this. I will illustrate this using the patterns of data that we obtain from our standard experimental items, which

         ?

169

instantiate the so-called cardinal well-formedness values (Featherston 2009; Gerbrich et al. 2019). We have developed a set of fifteen sentences that are reliably judged at five different levels of well-formedness, labeled from A to E, with three examples at each level. These provide a useful fixed external comparison set that allows us first to obtain something approaching absolute well-formedness values from studies that collect relative judgments and second, to make meaningful comparisons across experiments. The standard items provide a stable reference set here because they have been tested many times, in lots of different experiments. Their levels of well-formedness are established. In Figure 10.2 I show the results of the complete set, judged by 32 informants. These rating were gathered in a experiment testing aspects of NP movement using the thermometer judgments method with a two-stage practice phase (Featherston 2009). Native speaker informants were recruited via the Prolific experiment participant portal and paid £2 for their participation. There were eight versions of the experiment, with four participants per version. Each person saw a total of 55 sentences in a pseudorandomized order, fifteen of which were these standard items. The scores were normalized to z-scores to remove variation in the use of the scale between participants. This procedure aligns each person’s scores so as to give them a mean value of zero and a standard deviation of 1. The y-axis on the chart thus shows these converted scores, and the zero value is the mean of all judgments. Higher scores represent greater perceived acceptability. The results are displayed as error bars showing the means and 95% confidence intervals of these normalized ratings. We choose this chart type to show that the groups clearly distinguish the

95% Confidence interval z-scores

2.0

1.0

0.0

–1.0

–2.0 A

B

C

D

E

Cardinal well-formedness values

Figure 10.2. Mean results of the standard items instantiating the five cardinal wellformedness values A–E from Gerbrich et al. (2019). Each error bar represents 96 data points

170

 

cardinal well-formedness levels; the error bars of the confidence intervals do not overlap. But now look at Figure 10.3, which shows the same conditions as Figure 10.2, but split up so that each participant’s ratings are shown separately. Note that the judgments of the individual informants are shown as bars, each bar representing the mean of three judgments. The participants’ results are arranged by the experiment version that they saw, but the standard items were identical across the versions, so it is no surprise that these do not differ. Most people most of the time get the five values in the right order, but sometimes they do not. In fact 19 out of 32 get them all in the correct order (counting ties as success), so 13 do not. Person 1 of version 1 in the top left-hand corner is a case in point: the bars A, B, D, and E (see labels in baseline) are in the right order, but the three judgments making up value C have a higher mean value than those of A and B. The noise in an individual’s judgments is plain to see even in these means of three judgments; but single judgments are naturally even noisier. Figure 10.4 shows each single judgment as a bar from the participants who did versions 1–3; this chart is thus an expansion of the first three columns of chart 3. The finding is unambiguous: while the group judgments produce clear and robust distinctions, the single judgments are noisy. This is consistent with the finding of very low statistical power reported for the sample size of 6 judgments in SSA13. While group results show clear patterns, even the means of three are still quite noisy, and the individual judgments show the trend in the data but not much more.

ABCDE

ABCDE

ABCDE

4

5

6

7

8

ABCDE

ABCDE

ABCDE

ABCDE

ABCDE

Person

2

1

3

3

3.0 1.0 –1.0 –3.0 3.0 1.0 –1.0 –3.0 3.0 1.0 –1.0 –3.0 3.0 1.0 –1.0 –3.0

2

4

Mean z-scores

Version 1

Cardinal well-formedness values

Figure 10.3. Mean judgments of each participant of the three standard items instantiating each of the five cardinal well-formedness values A–E. The labels A–E on the baseline apply to all charts vertically above them

A1 A2 A3 B1 B2 B3 C1 C2 C3 D1 D2 D3 E1 E2 E3

Cardinal well-formedness values

A1 A2 A3 B1 B2 B3 C1 C2 C3 D1 D2 D3 E1 E2 E3

A1 A2 A3 B1 B2 B3 C1 C2 C3 D1 D2 D3 E1 E2 E3

3

4

Figure 10.4. Single judgments of the standard items by the participants from versions 1–3. This chart distinguishes the three individual judgments of each well-formedness level

–3.0

–1.0

2

3

1.0

–3.0 3.0

1

2

–1.0

1.0

–3.0 3.0

–1.0

1.0

–3.0 3.0

–1.0

1.0

3.0

1

Mean z-scores

Version

Person

172

 

To summarize, we thus have converging evidence that individual judgments offer only poor-quality evidence, but judgments from groups are much more reliable. Since one of the key factors differentiating armchair judgments from experimental judgments is the use of informant groups, this would seem to suggest that experimental judgments are preferable as a source of evidence. This seems obvious to me, but I meet people who are persuaded that the papers SA12 and SSA13 showed that most armchair judgments are correct. In the next section we will examine how these contrasting views can come about.

10.3 Reassessing SA12 and SSA13 I mentioned in the chapter introduction that I would contrast my findings with those of SA12 and SSA13, which are thought to present a much more positive picture of the success of armchair judgments. In order to see why people might interpret the results this way, we will first need to look at their studies in more detail. Let me repeat that I do not in any way wish to criticize or devalue this work. I only wish to clarify what it shows, so that it is not misinterpreted. SA12 wished to establish whether the quality of armchair judgments had led to erroneous assumptions in the field of linguistics. This is a useful approach: rather than addressing the ability of individuals to make accurate judgments, they investigate whether in practice this has caused erroneous judgments to be entertained in the field. In order to do this they took 365 judgments from the introductory textbook Core Syntax (Adger 2003). They examined the judgments in the book and broke them down into different types, focusing on those that could most easily be tested. They looked at judgments of paired examples, where the claim is that one example is good and the other bad, and at existence judgments, where the claim is either that a particular sentence type is grammatical, and thus a part of the language, or ungrammatical, and thus not part of the language. They tested 219 of the first type using magnitude estimation (Stevens 1975) and 250 of the second type by giving a yes/no binary choice. A total of 240 informants carried out the magnitude estimation experiment, while 200 did the forced choice experiment. The core finding for the magnitude estimation experiment was that 3 out of 115 contrasts were not replicated (max. 2.6% non-replication rate); for the yes/no task it was that 5 out of 250 did not replicate (max. 2% non-replication rate). The authors therefore claim that they see “no reason to favor formal methods over traditional methods solely out of a concern about false positives” (SA12: 634). It is this conclusion that has led some people in the field to believe that armchair linguistics has been validated. There are however a number of reasons why one might consider this belief to be hasty. I will address them in turn here.

         ?

173

10.3.1 What counts as success? The first reason why we might relativize this interpretation of the results is the relative criterion that the authors choose for test for replication. For the magnitude estimation experiment, they state: “we will define replication as the simple detection of a significant difference in the correct direction between the conditions in a phenomenon” (SA12: 615). And for the existence example with a yes/no test, they state: “we will define replication as the observation of significantly more yesresponses than no-responses for the sentences that were reported as grammatical by Adger, and as the reverse (more no-responses than yes-responses) for the sentences that were reported as ungrammatical by Adger” (SA12: 615). They discuss this choice at some length and explicitly admit that other criteria might have been applied. But the reasoning is clear: “First and foremost, we believe that the simple detection of a difference in the predicted direction is closest to the intent of Adger (2003)” (SA12: 616). I would suggest that an alternative interpretation is possible. Adger (2003) is an introduction to the central aspects of generative grammar. The model of syntax and well-formedness that I would expect such a textbook to apply is one where the successful grammar generates all and only the sentences which are part of the language, and no others. In this model, therefore, well-formedness (“grammaticality”) is absolute, by which we mean that every sentence has a value on a twopointed scale, with no possible further gradations; but well-formedness is also inherent, which means that this value is independent of any structural relationship or comparison with other sentences. Looking back at Adger (2003) we can see support for exactly this position, since the text uses almost exclusively the contrast of starred and unstarred examples and not intermediate symbols (as the authors of SA12 themselves note). In fact the text is admirably explicit on this point. It distinguishes three reasons for unacceptability (Adger 2003: 1.1.2); an example can be unacceptable because it is • hard to parse: I looked the number that you picked out with a needle [ . . . ] up. • implausible: The amoeba coughed. • or else simply not licensed by the grammar: By is eaten monkey banana that the being. Adger distinguishes between unacceptable for performance reasons and ungrammatical in the technical sense: “Remember that we assign a star to a sentence if we think that the explanation for its unacceptability is that it does not conform to the requirements of the grammar of the language under discussion” (Adger 2003: 5). This leads me to conclude that he is dealing with the classic model of grammaticality. Note also that (un)grammaticality is structurally driven: “we assume that speakers can’t assign a structure to the particular string in question at all”

174

  1.0 0.5 0

* *

–0.5 *

–1.0 A

B

C

Figure 10.5. This chart illustrates the differences in interpretations of Adger’s (2003) model of well-formedness. The pairs of error bars represent paired examples in SA12’s data set. The unstarred bars represent examples that Adger judges grammatical. The starred bars represent their ungrammatical comparison examples. The question is whether pairs B and C would count as replications

(Adger 2003: 5). Ungrammaticality is absolute and categorical: “We cannot make the [ungrammatical] sentence any better” (Adger 2003: 3). In the light of these very clear statements in Adger (2003), it seems to me that the most obvious interpretation of Adger’s assumptions is that grammatical sentences should be judged to be acceptable and ungrammatical sentences should be judged to be unacceptable, and it is this criterion that we shall pursue here. This contrasts with the interpretation in SA12, which is that Adger assumes only a relative difference between grammatical and ungrammatical. We can illustrate the contrasting predictions in Figure 10.5. In Figure 10.5 we see three example pairs and their hypothetical experimental results shown as error bars the y-axis represents normalized judgments, with higher values representing greater perceived acceptability. On SA12’s criterion for replication, all three pairs count as successful replications, because only a relative difference is required between the pairs; there is no requirement for an absolute level of acceptability. If we adopt my own interpretation of Adger’s model of grammaticality as the criterion, pair A is a replication, but pairs B and C are not. This model of grammaticality would demand that the grammatical example be in the acceptable range and the ungrammatical one in the unacceptable range for successful replication. Since the authors of SA12 gather continuous data in their magnitude estimation studies, there is no simple division between the acceptable and unacceptable ranges. But SA12’s magnitude estimation study tested exactly equal numbers of examples claimed to be grammatical and ungrammatical (109 of each, plus one with a question mark which we ignore). We can therefore, exceptionally, treat the mean value of the normalized judgments as a meaningful threshold. This is of course only an approximation, but as long as we keep the approximation in mind

         ?

175

it seems legitimate. This adjustment to the success criterion results in 15 of the 115 contrasts in the magnitude estimation experiment failing to be replicated, which is a 13% failure rate, instead of the “maximum” 2.6% failure rate that the authors of SA12 report. To their credit, the authors of SSA13 report a calculation similar to this on the equivalent data sets from their Linguistic Inquiry sample. They look at the normalized data from the magnitude estimation task and the seven-point scale task, which produce results on a continuous scale, and examine how many sentences fall into the “wrong” half, that is, how many examples judged to be well formed are given ratings similar to those in the ill-formed group and how many examples judged to be ill formed fall among the well-formed group. They do not fix a threshold on the basis of an external criterion, but look for the threshold that minimizes the number of examples on the wrong side. They find that the fewest possible items on the wrong side is 28 in the magnitude estimation data and 27 in the seven-point scale data. These figures represent 9.8% and 9.5% of the total data included in the analysis. Applying this minimization approach to the magnitude estimation data in SA12, we find the smallest number of errant data points with the threshold set at 0.105, which yields a minimum number of 13 failures, which is an 8.8% failure rate. Now in fact I do not think that the exact numbers are very important. But I think it is important that it is clear that even subtly different assumptions may radically affect the result. To summarize, if we assume a classic generative model of well-formedness with a grammatical–ungrammatical dichotomy, these tests of armchair judgments are producing minimum failure rates of 8.8% (SA12) and 9.5% and 9.8% (SSA13). This is one of the reasons why I think the results in these papers are misinterpreted when they are taken as giving armchair linguistics a blanket clean bill of health.

10.3.2 Do binary oppositions make a grammar? In this section I would like to highlight another reason why I think that the implications of SA12 and SSA13 need to be given a nuanced interpretation. The authors of SA12 and SSA13 are addressing the very specific question whether judgments in published material are erroneous, and for this they need an easily applicable test for successful replication. But the issue in this wider debate is whether different types of judgments can form an appropriate basis for a grammar. Whether a data type is reliable is only one criterion; the amount of relevant information contained in the data type is another. Whether a judgment type is adequate thus also depends on what model of grammar and well-formedness we assume or strive for.

176

 

Relative judgments yield evidence for the existence of a difference between minimal pairs, and thus for the existence of the factor that distinguishes them; but relative judgments offer us little more. They do not directly tell us anything about the size of the effect, or about where the individual items might be located on a scale of well-formedness; the authors themselves note this (SSA13: 225). Relative judgments are therefore fairly restricted in their information content. We can see this in the results of SA12 in the multiexample contrasts. While most of the examples from Adger (2003) that they tested were pairwise comparisons or existence examples (“this is/is not possible in the language”), there were also 11 example sets consisting of multiple examples. The authors note that the failure rate in these examples was 3 out of 11 cases, which is 27%—a far higher rate than in the pairwise tests (they discuss the finding at some length). I present here an example of these non-replications to show how this occurred. This set of examples was designed to illustrate the well-known superiority and discourse linking effects (Chomsky 1973; Pesetsky 1987). Table 10.1 below gives the sentence codes from SA12, which include the example judgments (g = grammatical, * = ungrammatical), the sentence itself, the magnitude estimation score, and an abbreviation for the syntactic condition (my addition). I illustrate these scores in Figure 10.6, because the graphic presentation allows us to take in the situation at a glance. Table 10.1. Table showing results of magnitude estimation experiment on superiority and discourse linking from SA12; wh indicates a bare wh-item, wx indicates a discourse-linked wh-phrase Code

Sentence

mean

9.120.g 9.120.* 9.124.g 9.125.g

Who poisoned who? Who did who poison? Which poet wrote which poem? Which poem did which poet write?

0.11 0.54 0.40 0.10

structure type wh-subj wh-obj wh-obj wh-subj wx-subj wx-obj wx-obj wx-subj

0.6 0.4 0.2 0 –0.2 –0.4 –0.6

* wh-subj wh-obj wh-obj wh-subj wx-subj wx-obj

wx-obj wx-subj

Figure 10.6. This chart shows the magnitude estimation scores of the four sentences that make up the illustration of superiority and discourse linking from SA12

         ?

177

Traditionally (e.g. Pesetsky 1987), this pattern has been thought of as consisting of three good examples and one bad example, and Adger (2003) follows this; so the authors of SA12, with their relative replication criterion, consequently test it as three “better” and one “worse” examples. But the experimental scores show one good (0.40), two medium (0.11, 0,10), and one bad condition ( 0.54), and these three levels are roughly equidistant. The conception of well-formedness as a mere relative distinction that the authors of SA12 apply has difficulties dealing with such a pattern. When it finds a difference, it attributes the status “well formed” to the better one and “ill formed” to the worse one. That is all it can do. So it naturally encounters problems when applied to a data set like this one, with more than two levels, as superiority and discourse linking standardly show (e.g. Featherston 2005a, 2005b). The results will probably show us that the two medium conditions are different from both the bad condition and the good condition. At this point the model crashes, because these intermediate bars are assigned both the status “well formed” and “ill formed.” A purely relative well-formedness model, as applied in SA12 and SSA13, simply does not make the distinctions required to deal with this sort of data pattern. Using a more sophisticated well-formedness model would capture this data set; but it would make higher demands on what counts as descriptive success, which would make it more difficult to achieve with only armchair judgments. In fact we can derive a generalization from this example. We find this multilevel data set because there are two things going on: there is both the superiority constraint and the separate discourse-linking effect, which the data reveal to be two independent factors. SA12’s relative model can capture the existence of individual constraints; but it cannot reliably capture the relations between constraints. Since a grammar is surely more than just a collection of isolated noninteracting constraints, this is a real drawback. Constraint interaction is a key component of a grammar. We see this in Adger (2003), chapter 2, where the basic sentence The pig grunts is varied in various ways to show how underlying features determine grammaticality. Linguistic data sets do not consist only of binary pairs; they must contain multiple contrasts if they are to show the full picture (Table 10.2).

Table 10.2. Table showing variations of the form of the basic sentence The pig grunts, illustrating how multiple factors interact to produce complex well-formedness patterns (based on Adger 2003: chapter 2) The pig grunts The pigs grunt *The pig grunt *The pigs grunts

The sheep bleats *The sheeps bleat The sheep bleat *The sheeps bleats

*The pig grunteds The pigs grunted The pig grunted *The pigs grunteds

178

 

No set of four, much less any pair, is enough to show the full picture. Only comparisons in multiple parameters can capture how factors interact to produce the complex grammatical patterns of even such apparently simple phenomena as subject–verb agreement. Chomsky underlines this too: it is clear that we can characterize unacceptable sentences only in terms of some “global” property of derivations and the structures they define—a property that is attributable, not to a particular rule, but rather to the way in which rules interrelate in a derivation. (Chomsky 1965: 12)

The question that this chapter asks is whether we can use judgments to build a grammar. The purely relative judgments instantiated in the replication criterion in SA12 and SSA13 yield pairwise distinctions, but they do not contain enough information to serve as a basis for a grammar, since this consists not only of a list of individual effects, but also of a specification of how these interact with each other. It follows that, even if it were the case that armchair judgments reliably made relative distinctions, this would still not demonstrate that these judgments can function as a sufficient basis on which to build a grammar. This test cannot therefore answer this question.

10.3.3 Do armchair judgments make too many distinctions? One more general reason why I do not think that the studies in SA12 and SSA13 can be seen as legitimizing armchair linguistics relates to the sort of test that these papers apply. Whenever they report replication success, the authors are careful to note that this concerns only the false positives, and they discuss the implications of this in some detail. In spite of this, I repeatedly meet colleagues who think that these papers demonstrate the validity of armchair judgments conclusively. It is thus apparent that not all linguists grasp what the restriction to false positives means. So let me restate it. Armchair judgments and experimental judgments are more or less the same data type, except that for experimental judgments we take more trouble with design and procedure: we ask more people, we use more different lexical forms of the structures, we may use a more precise scale. These factors give us more exact information. It is therefore quite implausible that armchair judgments should produce more detail than experimental judgments. But the test for false positives is asking whether the armchair judgments in the literature contain too many distinctions, that is, distinctions which cannot be replicated.² ² It is interesting to consider what the source of these additional distinctions might be, since they cannot derive from the data type itself. One possibility is that they are instances of linguists generalizing

         ?

179

Now this reproach of being too powerful, producing too much differentiation, is one which is fairly commonly made to experimental syntacticians and experimental data by practitioners of armchair linguistics. The idea is often that some differences revealed by experiments are not grammatically relevant, which is perhaps not without some justification. It is the other way round with armchair judgments; their major disadvantage—in my opinion—is that they produce insufficient detail. But the false positives test does not address this problem; it should therefore not be interpreted as validating armchair linguistics. In a court of law, we must swear to tell “the truth, the whole truth and nothing but the truth.” The false positives test addresses whether judgments are telling nothing but the truth. But it does not establish whether they are telling the whole truth. A validation of armchair judgments would require that too. To continue this metaphor: armchair judgments are like a short-sighted witness who had forgotten his glasses on the day in question. They don’t even see the whole truth. Whether they are prone to exaggeration (false positives) is not our major concern.

10.4 Towards a better grammar We have seen in 10.2 that there is robust evidence that armchair judgments are able to make fewer distinctions than experimental judgments and that they are noisy. In 10.3 we looked at SA12 and SSA13, which address some aspects of the accuracy of judgments in the literature. They are interesting and valuable work, but the implications should not be overstated. They certainly do not provide a blanket validation of armchair judgments—nor do their authors claim this. In this final section I should like to sketch out how I think that our field should move on from armchair judgments. Let us note here that the authors of SA12 and SSA13 are quite aware of this alternative approach and address it in these papers, but their focus is on testing the database of work done in the past rather than on designing tools for the future. I think syntactic theory needs to look forward; but it also needs to be true to its roots. In order to make progress the field needs to continue its move toward becoming evidence-based. Claims need to be based upon a more solid database than just a single person’s judgments, since these are so noisy and subjective as to be unfalsifiable. This step will enable us to develop a clearer research cycle of hypothesis testing and rejection. This in turn will help clear the field of its legacy of their own noisy judgments. We saw in 10.4 that people giving judgments sometimes make idiosyncratic distinctions that are not replicated in the group results. Another possibility is that distinctions derive from a particular lexical instantiation of a minimal pair. Armchair judgments usually contain lexical variants, but experimental judgments control for this by systematically testing structural contrasts with multiple lexical items.

180

 

theoretical fossils and allow their replacement. Armchair judgments are insufficiently finely grained and determinate to support this process. But this does not demand that we abandon judgments as our basic data type. One factor in the success of the generative enterprise has been that it seeks to account for our intuitions of well-formedness, which we can access at any time. This both makes it into a field of study with psychological relevance and grounds it in our personal experience. We can build on this and stay true to our roots by building grammars as models of well-formedness judgments, in line with Chomsky’s suggestion that “there is no way to avoid the traditional assumption that the speaker-hearer’s linguistic intuition is the ultimate standard that determines the accuracy of any proposed grammar” (Chomsky 1965: 21). But, if we are going to take judgments as our ultimate standard, we need to faithfully reflect what we see in the judgment data, not abstract from these data. We cannot pick and choose what aspects of judgment data we will adopt in our analyses, unless we can show that a particular factor is unrelated to the linguistic stimulus; and, if we argue this, then we should systematically control for this factor. But our aim should be to account for the whole of the remainder.

10.4.1 Crime and punishment So what does this mean in practice? In terms of the metaphor of telling “the whole truth,” what is the whole truth? Let us look again at the magnitude estimation results for superiority and discourse linking from SA12 in Figure 10.6 (repeated here as Figure 10.7, for convenience) to remind ourselves what the primary data look like. When we look at such results, we observe again and again that the basic driver of the patterns we see are quantifiable effects, not just rules which are broken or not broken. 0.6 0.4 0.2 0 –0.2 –0.4 *

–0.6 wh-subj wh-obj

wh-obj wh-subj

wx-subj wx-obj

wx-obj wx-subj

Figure 10.7. This chart shows the magnitude estimation scores of the four sentences that make up the illustration of superiority and discourse linking from SA12

         ?

181

The traditional (e.g. Pesetsky 1987) conception of this phenomenon is that there is a rule violation for the inversion of subject and object wh-items, which is however nullified if the wh-items are wh-phrases of the form which poet and link into the discourse. But that is not what the data show. Instead we observe that there is a cost to inverting the order of subject and object wh-items which is largely independent of the wh-item type: this superiority effect just makes the structure seem less acceptable by a roughly constant amount. But there is a second effect in these data too, which is discourse linking. The observation is that informants judge these multiple wh-questions better if the wh-items involved are full whphrases of the form which poet or which poem rather than simple wh-pronouns such as who or what. This effect too causes a difference in perceived wellformedness. But here is the big point: the two effects are independent of each other, but they add up. The best condition is that with no superiority violation and discourselinked wh-items wx-subj wx-obj, which takes the form Which poet wrote which poem? The worst condition is wh-obj wh-subj, in which both factors are negative. The two conditions in the middle each have one bad and one good factor. We cannot describe the status of these examples without reference to both these factors, but we cannot capture their interaction without quantifying their effects.

Table 10.3. Table showing results of magnitude estimation experiment on sentential subjects and extraction from SA12. ClsSubj indicates clausal subject, itSubj indicates an expletive subject, ex indicates extraction Code

Sentence

Mean

10.90.g 10.91.g 10.93.* 10.92.g

That Peter loved Amber was obvious. It was obvious that Peter loved Amber. Who was that Peter loved obvious? Who was it obvious that Peter loved?

0.06 1.02 1.04 0.39

Structure type ClsSubj itSubj ClsSubj ex itSubj ex

1.5 1 0.5 0 –0.5 –1 –1.5

* Cls Subj

it Subj

Cls Subj ex

it Subj ex

Figure 10.8. This chart shows the magnitude estimation scores from SA12 of the four sentences from Adger (2003) that show the interaction of sentential subjects and extraction

182

 

This is all the more necessary as these two factors do not have the same “strength”: the superiority effect is larger than the wh-item type effect. Our grammar has to specify what the restrictions are, but also what happens if they are violated. We find this sort of pattern not only in this data set but systematically. Table 10.3 shows another set of four items that SA12 tested from Adger (2003). I illustrate these results in a bar chart too, as this makes the pattern more immediately discernible (Figure 10.8). Adger (2003) uses these examples to present a sentential subject island constraint, so the set should consist of three good examples and one starred one. But here too we can see that there are two effects in operation which independently affect the perceived well-formedness of the example sentences and produce multiple levels of acceptability. Having a clausal subject entails a cost in acceptability; the extraction also comes at a cost. As in the superiority examples above, the effects add up. Again, we cannot model this without quantifying the effects, because we cannot add categorical differences.³ We therefore see two strong reasons to use a model of well-formedness which quantifies effects. First, the observed patterns of data require this information. The patterns of judgments show (among others) additive effects, and addition entails quantification. Since we have committed ourselves to native speaker intuitions as the ultimate criterion, we should therefore include this information. Second, approaches which do not use this information are making things unnecessarily difficult for themselves. In both the discourse-linking case and the subject island case, the traditional assumption is that the grammar must capture a rule infringement that applies only in specific circumstances: superiority only with a certain wh-item type, extraction only from clausal subjects. The data show these assumptions to be questionable: on the face of it, both apparent island constraints look like cumulative effects which apply across the board (see the articles in Sprouse and Hornstein 2013, especially Phillips 2013). This matters, because it means that syntacticians have been looking for something that does not exist (cf. Hofmeister and Sag 2010). There are thus both theoretical and practical reasons to gather quantifiable data, and this will contribute to the production of more detailed and more empirically grounded grammar models. The use of experimental methods will assist us in this, since these enable us to obtain more detailed results. This does not necessarily mean that only fully fledged experimental data can be useful, however, as I shall suggest below.

³ Let me note here that there are other examples where effects seem to have more complex interactions, superadditive for example (see Sprouse and Hornstein 2013). There are also ceiling and floor effects. But for my purposes it suffices if the contraints interact in some way.

         ?

183

10.4.2 Gathering quantified judgments Setting ourselves higher goals in terms of data quality and descriptive detail brings challenges. While experiments on syntactic phenomena are becoming commonplace, the numerical results are still of restricted comparability. We can look at the size of an effect in numerical form and compare it to the size of another effect within the same experiment, but we cannot compare it to any other value, because there is no common scale. This limits the value of the quantification: for example, Savage (1970) argues that “measurement” requires a scale of units. The way forward I will suggest here is to use the cardinal well-formedness values I briefly introduced in 10.2.2. We have developed 15 sentences which are reliably judged at five roughly equidistant points along the scale of perceived syntactic well-formedness. These five points are labeled A, B, C, D, and E and are designed to provide the same sort of comparison points that the cardinal vowels do for vowel quality. The highest value, A, is very natural and familiar. The lowest value, E, represents a level of ill-formedness which is as unnatural as can be achieved while still maintaining analysability. I provide one example of each here in the text; the full set is in the appendix. Cardinal well-formedness value A The patient fooled the dentist by pretending to be in pain. Cardinal well-formedness value B Before every lesson the teacher must prepare their materials. Cardinal well-formedness value C Most people like very much a cup of tea in the morning. Cardinal well-formedness value D Who did he whisper that had unfairly condemned the prisoner? Cardinal well-formedness value E Historians wondering what cause is disappear civilization. Although this was no part of the origin design, Gerbrich et al. (2019) suggest that the five values can be associated with the fairly standard annotations of degrees of acceptability used in the literature. Values A and B are fully acceptable and thus unannotated, but value C can be thought of as anchoring the value question mark (?), level D two question marks (??), and level E the asterisk (*). Not every linguist might use these annotations for exactly these values, but the association of the cardinal values with these conventional degrees of wellformedness seems useful. We include the fifteen items in every experimental study, where they offer the advantage that they provide an external set of comparison points. This is useful because experimental judgments are often gathered on a continuous scale. This

184

 

provides much more detailed information, but it does not in itself provide any statement about how good in absolute terms the sentences are. Syntacticians often wish to know not only what relative differences there are between items, but also where on the scale of well-formedness they lie. The standard items provide this information by supplying anchor points along the scale of perceived wellformedness. We therefore include the scores of the standard items in the results graph of experiments. A further advantage is that the items permit us to compare results directly across studies. This can be done informally, by just looking at the results of two experiments in relation to the five cardinal values. But they can also be used to create a quantified comparison across studies. When we analyze the results of a judgment experiment, it is useful to normalize the data in order to remove the variation in the participants’ use of the scale: some people give better scores, others worse; some use a wider spread of scores, others only a narrow range. In order to compensate for this, researchers often transform experimental judgments into zscores. This manipulation involves subtracting from each score from the participant’s mean score and dividing the result by the participant’s standard deviation of scores. The normalized scores of each participant then have the mean value zero and the standard deviation 1, which removes a degree of inter-participant variation. Now, instead of using the participant’s mean and standard deviation of all scores as the basis for normalization, we can use just their mean and standard deviation of the standard items as the basis. Each individual’s scores will thus be expressed relative to that person’s ratings of the standard items. This provides a directly comparable quantification of judgments even across experiments, as long as the standard items were included in both and the procedure and context of the two experiments is reasonably similar. Some caution in such comparisons is of course required, because many factors can distort judgment studies (Poulton 1989). We chiefly use this technique to compare experimental results; but it can have advantages for informal judgments too. The availability of local anchor values can for example permit more exact judgments. The authors of SSA13 discuss how the forced choice methodology produces finer differentiation than the other methods they tested, because it involves the discrimination of difference between two stimuli rather than the placing of one stimulus on a scale. It is known that local comparison points, that is, those with very similar well-formedness values, provide even better support for fine judgments (e.g. Laming 1997); this is also the reason for the intermediate cardinal vowels, for example half-open and halfclosed. The standard items provide local comparison points along the full range of syntactic judgments, so that there is always a close comparison point. A linguist wishing to evaluate an example X will (effectively but perhaps unconsciously) perform a series of forced choice judgments of the example

         ?

185

relative to the standard examples. If she determines that the well-formedness of X falls between the B value and the C value, she can additionally consider whether X is nearer to B or to C, and perhaps decide that it is closer to B and therefore assign it the rating B-. The standard items thus allow us to import the advantages in exactness of comparative judgments into an anchored scale. This is an example of the way in which improved data quality does not necessarily have to come from full-scale experimental studies. If the field collectively adopted cardinal wellformedness values, there would be a marked improvement in exactness. The scale also makes judgments more communicable. Our linguist can thus test whether her judgments correspond to those of another person, and they can debate whether a B or a C+ is more appropriate. They can then report their conclusions to a third party, and the judgments will still be meaningful because the anchor points are accessible to any speaker of the language. When they give an example in a paper, they should also give their rating using the five cardinal values A–E, additionally distinguished by plus and minus signs. So I would give the examples in Table 10.1 above the ratings here in (1): (1)

a. b. c. d.

(B−) Who poisoned who? (D+) Who did who poison? (B+) Which poet wrote which poem? (C+) Which poem did which poet write?

This gives both information about the absolute ratings of the examples and about the amplitude of the factors that differentiate them. The addition or other interaction of different constraints becomes visible, which, as I have argued above, is an essential component of a grammar. Even if these values do not have the exactness of experimental data, they still go some way toward allowing the field of syntax to make progress. On the one hand, they are explicit and falsifiable, because they are related to the standard items. If another linguist wishes to contest these judgments, they can. If it turns out that example (1c) is generally judged to be worse than the B-level items, then my rating here is in trouble. This is already one big improvement from the traditional situation, where a paper could assert or assume that a particular example was grammatical, but that claim would be so vague as to be incontestable. I think this will have a knock-on effect: if I know that someone can easily demonstrate that my claim is wrong, I will take much more care when choosing the judgment that I give and will ask some other people beforehand.

10.5 Summing up This chapter had three parts. In the first instance, I wished to highlight some facts that lead me to treat an individual’s judgments with caution. The contrast we saw

186

 

in 10.2 between group results and individual results shows the apparent random variation present in single judgments and even in the means of three judgments. This finding was supported by the work in SSA13 on the power of different judgment methods which shows little power for small informant sample sizes. But the answer does not always have to be experiments: there are such things as careful judgments. I too use judgments and have some confidence in them, and I suspect that I could manage more consistently to approach the group data of the cardinal well-formedness levels in Figure 10.2 than some of the individuals in Figure 10.4 do. But even I, who have been working on and with judgments for 25 years, do not believe that I can achieve anything like the quality of results available from groups judging carefully constructing sets of materials. It helps too that I know what the data pattern of judgments looks like: I know that I am looking for values on a continuum. Some linguists in the past have tied themselves in knots trying to fit multiple levels into a two-level categorial schema. The failure to recognize the interactions in the superiority and subject island data sets in Figure 10.6 and Figure 10.8 are a testament to this. The second part addressed the papers SA12 and SSA13, which have been pointed out to me as providing definitive validation for armchair judgments. These are high-quality papers, but I nevertheless do not see them as providing a blanket validation for armchair judgments; nor do the texts themselves make this claim. I advance three arguments to support my position. First, if we change the criterion for replication to reflect the traditional model of categorical wellformedness, which is at least arguably what Adger (2003) assumes, then the replication failure rates rise considerably, reaching a minimum value of a little under 10%. This strikes me as being more a cause for concern than a cause for satisfaction, but this is naturally a matter of individual interpretation. Much more significantly, the errors tested for are only the false positives, the claims of distinctions in the literature where none have been shown to exist in finer grained data. But making too many distinctions is not the chief problem that armchair judgments have: their main weakness is that they capture too few of the distinctions which finer grained judgments reveal to be present. This poor definition and indeterminacy of the database has been a real brake on progress in syntax for decades, a matter of vastly greater significance than whether some unsupported claims are made in the literature. My final point in this section also relates to the well-formedness criterion adopted in SA12 and SSA13, but addresses it from the perspective of the question that this chapter raises, namely whether judgment data can form the basis for a grammar. I therefore discuss how much and what information we need in order to construct a grammar, contrasting this with the only weakly informative pairwise distinctions. I suggest that a grammar is more than just a rule list, rather it must also specify how rules interrelate to produce sometimes complex patterns of data. Taking two of the multi-item example sets in SA12 as examples, I argue that the

         ?

187

development of the grammar requires the quantification of rule violation costs, because only this will permit us to capture they way that rules interact. In the final section I look forward. Even if it were the case that armchair judgments had proved to be the ideal data basis so far, it would not follow that they are ideal for further development. The genie is out of the bottle: we have seen all the extra detail that experimental judgments contain, the radically different analyses that it supports; we cannot pretend it is not there. We need to sort out how to deal with the new information that has become accessible. We must seek ways of capturing it, recording it, and using it to build grammars. As an example of this, I return to the five cardinal well-formedness values and their instantiation in standard items. I seek to show that this simple device permits even individuals to give more exact, communicable judgments, which are, furthermore, intersubjectively anchored. They thus permit judgments to be assigned something approaching absolute values and allow the inclusion of important information about the strength of factors. I should finish by answering the question raised in the title. We can build grammars on the basis of judgments. But the quality of the database is a limiting factor on the quality of the grammar. We have in the past used relatively coarse judgments to build grammars; in consequence the quality necessarily suffered. We now have easy access to a range of ways to improve our database and consequently our models of grammar. I see no reason for nostalgia.

10.6 Appendix These standard items anchor the five cardinal well-formedness values. The German examples from Featherston (2009) are well established; the English examples here (from Gerbrich et al. 2019) are still undergoing beta testing. Naturalness value A The patient fooled the dentist by pretending to be in pain. There’s a statue in the middle of the square. The winter is very harsh in the North. Naturalness value B Before every lesson the teacher must prepare their materials. Jack doesn’t boast about his being elected chairman. John cleaned his motorbike with which cleaning cloth? Naturalness value C Anna loves, but Linda hates, eating popcorn at the cinema. Most people like very much a cup of tea in the morning. The striker fouled deliberately the goalkeeper.

188

 

Naturalness value D Who did he whisper that had unfairly condemned the prisoner? The old fisherman took her pipe out of mouth and began story. Which professor did you claim that the student really admires him? Naturalness value E Historians wondering what cause is disappear civilization. Old man he work garden grow many flowers and vegetable. Student must read much book for they become clever.

11 Acceptability ratings cannot be taken at face value Carson T. Schütze

11.1 Introduction 11.1.1 Motivation In the ongoing debates over the empirical base of linguistic theory, there have been increasing attempts in recent years to conduct experimental tests of acceptability on naïve speakers. One way in which this has often been done has been to take sentences directly from the linguistics literature and present them, via computer, to subjects who are then asked to rate these sentences, for example on a 1–7 Likert Scale (LS) or using magnitude estimation (ME), without the researcher further engaging with the subjects (Sprouse and Almeida 2012a; Sprouse et al. 2013, henceforth SSA; Munro et al. 2010; Song et al. 2014; Mahowald et al. 2016; Häussler and Juzek 2017; Langsford et al. 2018; Linzen and Oseki 2018). Amazon Mechanical Turk (AMT) has been a frequent tool in conducting such studies. While the overall result is that linguists’ judgments are mostly replicated,¹ there is debate over the importance of the number and nature of cases when they are not. For the purposes of the present chapter it does not matter what conclusions, if any, one wishes to draw from these previous studies: all that is relevant is that one believes that gathering judgment data from naïve speakers is sometimes useful. Given this view, my goal is to make the point indicated by the chapter title: finding that subjects’ ratings on a set of experimental stimuli do not align with the

¹ Linzen and Oseki (2018), one of the two non-English studies cited, replicated 11 out of 18 (61 percent) Hebrew contrasts and 14 out of 18 (77 percent) Japanese contrasts, but these relatively low numbers reflect biased sampling: samples were deliberately chosen so as to focus on judgments that the authors believed were incorrect. By contrast, Sprouse et al. or Mahowald et al., for example, tested random samples. Lizen and Oseki’s conclusion is worth quoting: “We stress that our results do not suggest that there is a ‘replicability crisis’ in Hebrew or Japanese linguistics.” Song et al.’s (2014) Korean study replicated 102 out of 118 (86 percent) contrasts exhaustively sampled from two volumes of a journal. It seems premature to draw any conclusions about published judgments in English vs. other languages.

Carson T. Schütze, Acceptability ratings cannot be taken at face value In: Linguistic Intuitions: Evidence and Method. First edition. Edited by: Samuel Schindler, Anna Droz˙ dz˙ owicz, and Karen Brøcker, Oxford University Press (2020). © Carson T. Schütze. DOI: 10.1093/oso/9780198840558.003.0011

190

 . ̈  

published judgments of linguists does not necessarily represent a genuine data discrepancy. I provide empirical evidence that, at least in many instances, subjects’ responses have resulted from factors other than (un)acceptability of the structure that the linguists were actually interested in. How and why this occurs will be expounded in detail. The lesson is that, for the field to make progress, we need to go beyond observing and counting mismatches in the naïve way we have been doing and to strive to understand their causes. Furthermore, such understanding is quite unlikely to be achieved simply by conducting more large-scale crowdsourced acceptability experiments. What I advocate and demonstrate in this chapter is that, as the field continues to collect large amounts of quantitative data in this manner, as it surely will, it needs in parallel also to collect data in a very different way and for a very different purpose, namely to answer the “Why?” question. Why are naïve speakers rejecting sentences that linguists claim are acceptable, or accepting ones that linguists claim are ill formed? If it turns out, for example, that the reason is that subjects’ ratings are based on some irrelevant alternative parse of a sentence, this could lead to the construction of less ambiguous materials for a subsequent rating experiment. If it turns out that subjects are not interpreting a critical word in the way intended, this could lead to the use of a preceding sentence to provide context in a subsequent rating experiment. Of course, the first attempted “fix” might not be perfectly successful. If the results do not change, or change only for one subset of respondents, we need to keep asking why subjects are giving the ratings they do. What I describe in this chapter is a kind of experiment that seeks to answer this question, and thus should complement rating questionnaire experiments by helping us to interpret their results and to successively refine them to the point where (ideally) they tell us only about our subjects’ grammars. At a gross level, this kind of experiment is very familiar: psychologists know it as “interview,” linguists know it as “elicitation.” What is (perhaps) novel is how I propose that the field deploy this tool alongside its other, (relatively) new tool—the large-scale judgment survey. I suggest that they be used in tandem: a survey experiment yields an unexpected result, you explore it with an interview experiment, you develop hypotheses, you test these in another survey experiment, and, if the results do not resolve the issue, you repeat the process. Thus the chapter seeks to make a general point by using a specific example. The general point is that collecting lots of numerical ratings from subjects (of anything, not just sentences) is useful only to the extent that you are confident you know what they are basing those ratings on. If you are not very confident, you should find out, and often the best way is to ask them (generally via a separate experiment). Now to the specific case at hand. In syntactic argumentation, we are trying to pinpoint one very specific thing that is making a type of sentence go bad (or at least go worse than a very similar type of sentence), but any particular example sentence will have oodles of other properties that naïve subjects could react to, so

       

191

generally we should have low confidence in our knowledge of the reasons behind their ratings. If those ratings do not match linguists’ claims, is the most likely explanation that the intuitions of linguists were wrong? This is certainly possible, but I will demonstrate that alternatives are abundant; and my hunch is that, as a whole, these alternatives are more likely. The reader may disagree, but the takehome message is that this is an answerable empirical question, and therefore one we should strive to answer each time it comes up, rather than assuming the worstcase answer—namely that linguists got it wrong.

11.1.2 The approach This section is designed to help the reader understand better how subjects actually respond in crowdsourced experiments of the sort cited in section 11.1.1. Subjects in the lab first underwent an abbreviated version of one of those experiments, interacting only with a computer. The abbreviated experiments were partial replications of SSA, details of which appear below;² thanks to my involvement in that study, I already had ideas about what aspects of the stimuli might have been problematic. Then subjects were interviewed about their responses, (hopefully) to reveal why they reacted to the sentences as they did. This is not unlike how linguists have traditionally gathered data from native speaker consultants, or indeed from each other. When linguists elicit judgments, they may start by asking “Can you say X?,” but a “Yes”/“No” answer is rarely the end of the matter. An interactive discourse typically ensues in which the “subject” can ask a range of relevant questions. If the subject happens to be another linguist, these questions can be formulated in technical terms, for example “Do you want she to refer to Mary? Do you want the modal to scope over negation? Do you care if the elided pronoun gets a strict or sloppy interpretation? Can I put focus on word X? Are you asking the reason for the telling or the buying? What’s a scenario where you would want the sentence to be true? What’s a discourse in which you would want the sentence to be felicitous?” Naïve speakers may seek such information too, or the linguist may offer it up front. The point is that it is common for a sentence that is being judged not to be fully “self-contained,” in the sense that it lacks information relevant to rendering a judgment that bears on the issue the linguist is interested in. What is going wrong in crowdsourced judgment experiments, I contend, is that some sentences are being tested for which such extra information is relevant, but obviously there is no way for subjects to ask for or receive it.

² Choosing that study as a starting point was a matter of convenience and should certainly not be read as a claim of superiority: indeed, it should become clear that I am particularly well positioned to identify shortcomings in the materials.

192

 . ̈  

Having subjects come to the lab was crucial: most often, linguistic consultations take place in person (or via an audiovisual computer link such as Skype)—a situation where the use of spoken language provides a much richer signal to work with than purely written materials. By contrast, to my knowledge no one has attempted systematic large-scale acceptability studies on naïve subjects using auditory presentation, presumably because the researcher would immediately face the conundrum of showing that the prosody used for (potentially) unacceptable sentences is appropriate, in other words that it makes the sentences sound neither better nor worse than they “deserve to sound.” This is a challenging methodological problem that I hope the field will begin to tackle. In the meantime, my results will reinforce the value of prosodic information in linguist–subject interactions; this could not have been demonstrated with written follow-up questionnaires.

11.1.3 Roadmap Section 11.2 recaps the critical details of the study that I am following up on. Section 11.3 provides the methodological details of the new experiments. Section 11.4 reports major qualitative findings. Section 11.5 considers consequences and future directions.

11.2 Background 11.2.1 What we did SSA tested English syntactic examples sampled randomly from ten years’ worth of Linguistic Inquiry articles, restricted to sentences whose acceptability (we hoped) could be assessed without our having to present any supplementary information concerning interpretation, such as that conveyed by referential indices, the (typically struck through) interpretation of elided material, and so on. The stimuli consisted of 148 “pairwise phenomena”: these presented purported contrasts in acceptability between two sentences that, ideally, were identical in all respects, except for the syntactic issue at hand, as for example (1): (1)

a. *Who do you wonder what bought? b. What do you wonder who bought?

In the original articles, one member of the sentence pair had an annotation indicating some degree of degradation (“?”; “*”; etc.), while the other was unannotated. For ease of reference, I call these the “bad” and “okay” members of an

       

193

item pair.³ In a few instances where the okay member of the pair was not explicitly provided but was clear from context, SSA’s authors supplied it themselves. SSA tested eight tokens of each pairwise phenomenon: generally the one(s) that appeared in the article (the “original token(s)”) verbatim, plus ones we made up that were intended to have the same structural properties but different open-class content (the “new tokens”). (As we shall see in section 11.4, the challenges of inventing new tokens that have just the same properties as the original one(s) turn out to be myriad.) One goal in creating these new tokens was to find evidence that the original linguists’ assertions were empirically true for the full range of structures that the associated theoretical claims would apply to, rather than being an accident of idiosyncratic properties of the particular example(s) presented in the article. (In retrospect, SSA might have been overly ambitious in this regard: see again section 11.4.) Marantz (2005) notes that linguists are seldom explicit about the fact that examples in articles are, typically, stand-ins for large (often infinite) sets of sentences that are claimed to behave identically, and that the crucial empirical claims are about such sets as a whole. Examples in articles are invitations for the reader to verify the claims they are meant to illustrate, often by pondering relevantly related examples beyond—or in place of—those actually provided. It would rarely be of any interest to linguistics to know that one particular sentence is (un)acceptable, if such (un)acceptability were not reflective of a large class of relevantly similar sentences.⁴ Consequently, it is actually not relevant to syntactic theory whether native speakers disagree with the token judgments explicitly reported in an article (this could be due, inter alia, to quirks of particular words, whose lexical entries might vary across speakers), so long as they agree that the relevant class of sentences generally patterns as claimed. We should guard against a potential sort of naïve falsificationism in this regard. SSA conducted its experiments via AMT, testing the full stimulus set in three different tasks with 312 subjects each: single-sentence presentation with ME responses; single-sentence presentation with 1–7 LS responses; and pairedsentence presentation with forced choice (FC) responses (“Which is better, a or b?”). In the first two tasks, a given subject saw only one member of each token pair, while in the third a subject saw both members together. ³ When linguists present pairs of this kind in articles, they are not always explicit as to whether they are claiming only that the okay member of the pair is substantially better than the bad one, or also that there is “nothing wrong” with the okay member. SSA essentially ignored the potential “nothing wrong” claim, which means that the linguists’ judgments were considered confirmed if subjects’ judgments reliably differed between the bad and okay tokens (in the right direction), no matter the absolute rating of the latter. It is not obvious to me how to operationalize claims of “nothing wrong” with a sentence, actually, nor is it clear how important such claims might be for linguistic theory. ⁴ Interestingly, the same is not true in the sentence-processing literature, where it is of great interest that structurally identical sentences may be treated very differently by the parser as a function of lexical idiosyncrasies; see for example the contrast between the trivial (ia) and the intractable (ib). (i) a The mine buried in the sand exploded. b The horse raced past the barn fell.

194

 . ̈  

11.2.2 What we found and what we can and cannot conclude Results were assessed mainly in terms of whether each pairwise phenomenon showed a significant contrast in the direction claimed by the linguists or not; we ignored issues such as whether “?” sentences were better than “*” sentences. For the vast majority of the 148 pairwise phenomena, our naïve subjects did replicate the published judgments—specifically, this was true for 93 percent of the phenomena in the ME and LS tasks and 95 percent of the phenomena in the FC task. We referred to these as rates of “convergence” between the judgments of linguists and naïve speakers. (The non-convergent phenomena showed a variety of patterns detailed below.) The question that occupied us at the time (and has been taken up in subsequent literature) is whether these convergence rates are a cause for celebration or consternation: that is, do they indicate that the informal way in which linguists have traditionally gathered and presented their data is reasonable, or do they cry out for the field to adopt “more rigorous” standards? Opinions have differed on this point,⁵ but most responses to these and similar results have made an assumption that the present chapter is designed to question: people have generally taken for granted that the non-convergent phenomena constitute bad data, in other words that linguists’ claims about them must have been wrong.⁶ For the sake of discussion, let us suppose that, if naïve speakers’ judgments of these phenomena “genuinely” differed (in a sense that I make clear) from those reported by linguists, we ought indeed to reject the linguists’ judgments and to adjust our theories so as to account for the naïve judgments instead.⁷ I want to be emphatic that we would not be justified in taking that step yet, because the premise is most likely false: the non-convergent data probably do not represent genuine judgment disagreements. The crucial point is that there is a logical gap between the experimental observations and the conclusion: knowing that subjects gave non-convergent responses to a set of strings does not entail that their grammars do not conform to the pattern claimed in the source articles—it could be that subjects’ responses

⁵ In reviewing an earlier draft of this chapter, Ted Gibson emphasized two points that he states he and his colleagues have made in their responses to SSA (Gibson et al. 2013; Mahowald et al. 2016). In fact, the former paper was responding partly to an early draft of SSA, written before the published experiment was even run, and mostly to Sprouse and Almeida (2012a); and both points are arguments against the suggestion that syntacticians do not need to run formal experiments, a suggestion that appears nowhere in SSA. Furthermore, since this chapter is about interpreting the results of such experiments, I obviously am not making that suggestion. ⁶ Thus Gibson et al. (2013) repeatedly use the phrase “error rate” to refer to the non-convergence rate. By contrast, Mahowald et al. (2016: 626) are more cautious: “It should not automatically be concluded that . . . these sentences [represent] failures on the part of the researchers.” ⁷ As Jon Sprouse and I have discussed (separately and jointly) in previous work, it is not obvious that this move would be appropriate. Among other things, we would want to know whether linguists agree among themselves about the judgments in question or whether the original authors were outliers, whether we might be dealing with genuine dialectal differences, etc.

       

195

reflected factors other than the acceptability of those sentences on the intended interpretations. This would obviously be true if, for example, subjects systematically misread a set of stimuli—and I show that this can happen. More generally, we cannot establish genuine judgment disagreements until we are certain that the subjects read and understood the stimuli in a manner relevant to linguists and that the low ratings given individually or the dispreferences shown in pairs for bad items were due to the property that the linguists identified as their flaw, rather than to some completely orthogonal issue. The experimental results to follow suggest that it is common—though they cannot indicate how common it is—for at least one of these preconditions not to be met: in many cases, the subjects’ responses will be demonstrated to be based on a structural parse that differs from the one of interest to the linguist, or on disliking some property of the sentence that has nothing to do with its purported grammatical violation. Consequently, the results of SSA and similar studies may at best⁸ indicate a lower bound on what the true rate of convergence is (for whatever range of phenomena its authors examined): there could well be some genuine judgment disagreements among the non-convergent items, but I suspect the majority are spurious: only further experimentation designed to avoid the confounds I exemplify below can answer this question. To be frank, the reason why I can illustrate many ways in which judgment tasks can go wrong is partly that there were abundant flaws in the materials. But those flaws, I would argue, are less the result of carelessness and more the result of ambition—the desire to test the claims actually being made in the source articles, rather than the more straightforward but less informative alternative of testing the properties of the particular sentence that happened to be chosen to illustrate a given claim or truly trivial variants thereof.⁹

11.2.3 Why we should not be surprised Upon reflection, it is not surprising that the results of crowdsourcing experiments such as those of SSA are challenging to interpret. It is naïve to think that AMT could be the panacea for linguistic data gathering that some seem to have hoped it to be: in addition to the domain-independent problems of not knowing much about who the subjects are or where, when, under what conditions, and with what ⁸ I say “at best” because I cannot exclude the possibility that some convergent results were not genuine judgment agreements: subjects might have given the “right” answers for the wrong reasons. Indeed, we shall see some potential cases of this type. ⁹ SSA could have simply replaced John with Bill, Bob, Fred. . . and cat with dog, rabbit, hamster, and so on, but generally we aimed higher. We might well have achieved slightly greater convergence rates if we had taken the easier road, but we will learn more in the long run as a result of the fact that we did not. Indeed, to the extent that the materials in the present experiments diverge from those of SSA, the intent was to be still more ambitious, creating more opportunities to (fail and hence) learn.

196

 . ̈  

intentions they may be engaging in our tasks, the nature of language in general and of grammaticality judgments in particular induces domain-specific problems. Subjects are reading isolated sentences. There is no context to help them zero in on the intended meaning, only scant hints as to how the sentence should sound, and they have no opportunity to seek clarification, or even indicate confusion: all they can do is pick a number between 1 and 7, or pick one of the sentences. The communication bandwidth between experimenter and subject is incredibly narrow by comparison with how linguists traditionally gather data, be it from a colleague down the hall, a class of undergrads, or a fieldwork consultant. While the general problems with AMT may simply contribute noise, the domain-specific ones can lead to systematic confounds. Furthermore, examples that appear in linguistics articles are generally not designed for presentation to naïve subjects, but rather addressed to the primary readership of such articles: other linguists, who understand what issue(s) an example is intended to bear on and who will therefore ignore aspects of the example that are irrelevant to those issues. (Indeed, examples are sometimes further shaped by a desire to make them entertaining to fellow linguists.) I suspect it is common for linguists, as they read articles, to consciously observe that applying their “raw judgments” to such examples may yield results that do not conform to the author’s claims, but that they can construct related examples that factor out orthogonal problems in order to verify the crucial claims. (By contrast, when psycholinguists design materials for experiments, great care is generally taken to avoid ambiguity, unnecessary processing complexity, pragmatic implausibility, garden pathing, uncommon words, etc.) Trying to test examples verbatim from articles was, in retrospect, asking for trouble. From this perspective, convergence rates of 93 percent to 95 percent across a broad sample of linguistic phenomena are almost miraculous, in my view— testimony to the robustness of the underlying cognitive structures. Still, we clearly must take the non-convergences seriously; and I do not advocate excluding platforms like AMT from the linguistic enterprise, though they surely can never obviate the need for face-to-face encounters.

11.3 Experimental methods 11.3.1 Properties common to all three experiments Participants All subjects were UCLA undergraduates from the psychology department’s subject pool who had not taken a linguistics course; they received course credit for participation. They self-identified as native speakers of North American English

       

197

(not necessarily monolingual, because that restriction would have excluded too large a proportion of the subject pool). Procedure Subjects came to our lab, where the experiment was conducted in a quiet room. As dictated by subject pool constraints, it lasted at most one hour. After informed consent was granted, the first phase of the experiment used a computer to collect judgments on sentences presented visually on the screen without interaction with the experimenter. The second phase consisted of an interview in which the experimenter presented some of the sentences seen in the first phase on paper and asked subjects about their reactions to them. This phase was audio-recorded and later transcribed by the same experimenter, who also took contemporaneous notes that could be used to add clarification to the transcripts. Between the two phases a computer program was run to generate the list of sentences for subsequent discussion; two copies of this list were printed, so that the experimenter and the subject could each read it easily. Items were numbered for ease of reference, and the subject’s response from the first phase was indicated. The program used heuristics to identify, on the basis of a given subject’s responses, sentences that were likely to be informative to discuss; then it ordered them, placing those likely to be of greatest interest first. (This was done because of the time constraint: interviews often ended before all sentences on the printout could be discussed. To further maximize the breadth of information gathered, the experimenter had the discretion to skip items exemplifying a phenomenon already discussed.) The details of the heuristics were complex and are not claimed to have any scientific validity, but they did contain some bias, which should be kept in mind when considering the results that follow. For items whose mean outcome from SSA was convergent, responses that were probably replicating that result¹⁰ were usually considered to be of little interest, while opposite responses were likely to be flagged for further investigation. The reverse was true for items that were divergent in SSA. Consequently, if there are items that naïve subjects generally give the same ratings to as linguists but do so for the wrong reasons (a phenomenon we might call “false convergence”), the current study is less likely to uncover them (but we will see a couple of possible cases). In the interview phase, discussion of each item began with the experimenter asking the subject to read the sentence(s) out loud. This was intended both to

¹⁰ The caveat “probably” is necessary because of the logic of the design of the LS experiments. In such experiments a given subject sees only one member of each minimal pair token (for example sentence (1a) or (1b)), but the aggregate results consider the direction of difference in mean ratings between members of minimal pairs. Thus the heuristics can only look at the absolute (raw or normalized) ratings of individual sentences (comparing them, say, for example to those of SSA) or compare condition means of tokens drawn from different minimal pairs of the same pairwise phenomenon.

198

 . ̈  

reveal any potential misreadings (for example, skipped words) and to elicit prosody that could indicate parsing choices the subject might have made. Nontrivial misreadings were corrected; trivial misreadings, such as definitely for definitively or subject for suspect, were not. How the discussion proceeded thereafter was left to the experimenter’s discretion, but usually this included a request to paraphrase the sentence and, if the subject had given it a low rating (or if the sentence was the dispreferred member of a pair in the FC task), a request to identify the part or aspect of the sentence that sounded bad to the subject. There could then be follow-up questions that would further elucidate the subject’s responses. Materials The stimuli were created from the items used by SSA, in particular from thirtyfour of the 148 unique pairwise phenomena. Many items were retained verbatim, while some underwent changes of various extents, all intended to address concerns that became apparent after running that study.¹¹ (Unfortunately those changes themselves sometimes introduced new problems, as I confess in section 11.4.) The thirty-four phenomena break down as follows: 4 pairwise phenomena that came out significantly in the unexpected direction in at least one of the three tasks in SSA; 2 pairwise phenomena that came out numerically in the unexpected direction in at least one of the three tasks; 6 pairwise phenomena that came out numerically but non-significantly in the expected direction; 10 pairwise phenomena that came out significantly in the expected direction in some but not all tests; 12 pairwise phenomena that consistently came out significantly in the expected direction.¹² The number of token pairs per phenomenon ranged from one to six in experiments 1 and 3 (this variation reflected suspicions that differences among tokens

¹¹ Many of the flaws are discussed in the online supplement to SSA at http://sprouse.uconn.edu/ papers/SSA.Materials.xlsx. In our defense I can only point out the enormity of our task: we had to create 148  7  2 = 2,072 items for those experiments. ¹² All these descriptions suppress some detail: SSA ran three different kinds of significance tests on each task’s results, which did not always agree, and for some phenomena the numerical results fell in opposite directions across tasks, particularly when comparing FC to the other two, so the outcomes resist brief summary. Suffice it to say that I included all the phenomena that raised doubts about the original linguists’ claims, along with phenomena where particular tokens seemed to behave suspiciously. Note that, of the subsets listed here, both the 4 + 2 and the 6 could be seen as failures to replicate linguists’ claims (non-convergence), though the 6 could also be consistent with convergence without sufficient statistical power. Viewing all of them as replication failures would add up to 12/148 = 8.1 percent non-convergence; SSA’s non-convergence percentages (their Table 6) range from 1 percent to 14 percent because they were broken down by task and statistical measures.

       

199

might have been important),¹³ but was fixed at four in experiment 2. The two systematic changes made vis-à-vis SSA were that items in the present study could have one or two words printed in caps, to indicate intended prosodic emphasis, and could have commas added, in cases where this could be done identically in both members of a minimal pair and was thought to be helpful in bringing out the intended reading. In addition to the target items just described, each experiment included catch items that were believed to be uncontroversially acceptable or uncontroversially unacceptable (also labeled respectively “okay” and “bad” in what follows) and were intended to assess how closely subjects were paying attention. Since the targets represent a wide variety of sentence types, no additional fillers were included. Instructions Apart from describing how the experiment would be carried out, the instructions included information on how to interpret “grammaticality.” This information read as follows, with slight variations for the FC task (this was different from the wording used by SSA): After you read each sentence, you will rate it based on how “grammatical” it sounds to you. For the purpose of this experiment, a grammatical sentence is one that would seem natural for a native speaker of English to say. In contrast, an ungrammatical sentence is one that would seem unnatural for a native speaker of English to say. The response scale goes from 1 (definitely ungrammatical) to 7 (definitely grammatical). To make it clear what we would like you to base your responses on, here are some things that we are NOT asking you to rate: (1) How understandable the sentence is: A sentence may be perfectly easy to understand even though native speakers of English would agree that it is ungrammatical. For example, if you heard someone say “What did you ate yesterday?,” you would have no difficulty answering the question, but you would still give the sentence a low rating. (2) How the sentence would be graded by an English teacher or writing tutor.

11.3.2 Properties unique to individual experiments Participants Experiment 1 involved twenty-three participants, three of whom had their responses excluded on the basis of their ratings of the catch items.¹⁴ Experiment 2 ¹³ In a few cases, I made refinements to particular tokens from one experiment to the next. Since the results here are pooled across experiments, some lists of items contain tokens that did not appear in the same experiment. ¹⁴ Exclusion criteria were a mean rating for the two okay items lower than the mean rating for the two bad items, or neither of the bad items being rated lower than 3.

200

 . ̈  

involved sixteen participants and had no exclusions on that basis. Experiment 3 involved twenty participants and no exclusions either.¹⁵ Procedure Experiments 1 and 2 elicited ratings of individual sentences on a 1–7 LS (they were LS tasks).¹⁶ Subjects pressed a number on the computer keyboard that corresponded to one of the seven possible ratings. On the screen, potential ratings were labeled only by number, but to the left of “1” was the string “definitely ungrammatical” and to the right of “7” was the string “definitely grammatical”; these reminded subjects of the direction of the scale. To familiarize subjects with the task and to attempt to anchor the response scale, four practice trials were presented before the experiment proper. Experiment 3 elicited FC responses to minimal pairs of sentences (it was an FC task). Subjects clicked on a radio button next to the sentence they considered more grammatical. Sentences were displayed one above the other, counterbalancing which position was occupied by the okay member of the pair. I was concerned that this presentation mode might induce a strategy whereby subjects could identify where the two sentences diverged by visual matching without actual reading, and then base a response upon only the mismatching substrings. To deter them from doing this, one member of each pair was indented (the choice was counterbalanced against both vertical position and bad/okay status)—as can be seen for example in (2) or (3). (2)

Who do you wonder whether saw John? Who do you wonder whether John saw?

❍ ❍

(3)

Who do you wonder whether saw John? Who do you wonder whether John saw?

❍ ❍

Four practice trials were presented before the experiment proper. They were meant to familiarize subjects with the task. Materials Experiment 1 contained 63 target sentences, which represented 33 of the 34 pairwise phenomena, plus 4 check sentences. Experiment 2 contained 88 target sentences, which represented 22 of the 34 pairwise phenomena, plus 4 check sentences. In experiments 1 and 2, each subject saw only one member of each token pair (half okay, half bad); item order was randomized for each subject.

¹⁵ The exclusion criterion was making the expected choice on fewer than 6 of the 8 catch item pairs. ¹⁶ LS was chosen over the ME task for two reasons: it is easier for subjects to understand, and the results are less noisy statistically (SSA; Weskott and Fanselow 2011).

       

201

Experiment 3 contained 63 target sentence pairs, which represented 33 of the 34 pairwise phenomena, plus 8 check pairs. In this experiment four lists were generated to accommodate the four presentation variants of each sentence pair (okay vs. bad on top, crossed with indented vs. unindented on top); item order was randomized for each list.

11.4 Results The purpose of the discussions that follow is not to try to establish what the “true” acceptability of the target sentences is, but rather to illustrate the ways in which numeric ratings by themselves can be misleading if we know nothing about the basis on which the ratings were offered. The forthcoming presentation of results is characterized by an absence of statistics, either descriptive or inferential. This is deliberate and necessary. As can be gleaned from the preceding discussion, neither the full set of items in the first phase of the experiment nor the subset discussed during the second interview phase in any sense constituted random or balanced samples relative to the original full SSA random sample. Hence quantitative summary statistics would not be meaningful. Results have been subjectively selected for reporting: they have been selected (1) on the basis of their potential to inform our understanding of how subjects approached the tasks in general and the assessment of specific sentence structures in particular, and (2) with the aim of representing the wide variety of challenges involved. I attempt to give some sense of how common a given type of reaction to an item was, keeping in mind that there was not much control over how likely a given subject was to be interviewed about a given item.

11.4.1 General observations Before delving into details, a few general observations are in order, all of which will be exemplified in the case studies to follow in section 11.4.2. One concerns misreadings: the most common error in reading out loud was the omission of short words, and when this was pointed out to subjects they typically remarked that they had made the same error during the first phase of the experiment, which means that the computer-elicited responses were not judgments of the intended string. The potentially dire consequences of such errors were brought home by the fact that several subjects gave high ratings (e.g. “7”) to the following original catch item: (4)

Who do you suspect I would be capable of committing such a crime?

presumably because they missed the word I. The lesson is that, when possible, crucial contrasts should be expressed using more and/or longer words rather

202

 . ̈  

than hinging on one short one,¹⁷ but as discussed below this was sometimes not possible. A second observation concerns the attempt to manipulate prosody by capitalizing words intended to receive emphasis. The subjects’ out-loud readings sometimes did not follow these “hints,” either because they forgot the relevant instruction or because their attempt to implement it yielded an unexpected result. In such cases the experimenter would normally eventually say the target string with the intended prosody (after trying to determine how the subject originally understood the sentence), and this often yielded a reaction indicating that this had revealed an interpretation the subject had not previously been considering. Again, this means that the computer-elicited ratings were not judgments of the intended structure. This result highlights the importance of implicit prosody in silent reading tasks and the need to find (more reliable) methods for conveying intended prosody when this is important. A third observation concerns word choice. In the LS experiments subjects sometimes identified the flaw that triggered a low rating as a particular word that they found unfamiliar or so uncommon that “No one would say that.” Given the close matching within minimal pairs of items, these words virtually always were identical in the bad and okay versions of a sentence, making these responses orthogonal to the point that the original linguists were illustrating. While subjects occasionally remarked on these words in the FC task too, they obviously could not use them as a basis for preferring one member of a pair over the other. Methodologically, it might thus seem advantageous to present sentences as minimal pairs rather than as singletons, in order to avoid such irrelevant responses. (Of course, such presentation need not be combined with an FC response task— one could still elicit numeric ratings of each sentence individually.) However, pairwise presentation also draws conscious attention to the manipulation of interest, which might have unintended consequences, as we shall see. A fourth observation is that, as is evident below, some minimal pairs express identical (intended) meanings while others, by virtue of the syntactic contrast they address, cannot possibly do this. In the latter situation, responses in both tasks could be influenced by differences in the plausibility of the scenarios described, and we shall see evidence that this happened. Such responses could in principle lead to false convergence or false divergence. The way to avoid this would be to pre-test a large number of candidate sentence pairs in order to identify a subset for which plausibility ratings are matched. A fifth and final observation is that, in contrast to the impression one might get from just the first case study presented here in (5)–(13), instances where subjects felt they did not understand a sentence they read were rare; much more commonly, they arrived at an understanding that was different from the one the linguists had intended.

¹⁷ And indeed I was replaced with she once this pattern was noticed.

       

203

11.4.2 Case studies In this section, pairwise phenomena are always presented with the bad member labeled (a) and the okay member labeled (b), but no annotations of ill-formedness are included. Example pairs marked with “†” are original tokens (except for caps, which were never in the original). The citation at the end refers to the article whence the phenomenon was drawn by SSA’s random sampling procedure (in that source it may have represented an original claim by the authors, or reported a previous or “standard” judgment). Items have been grouped into subsections according to the properties of the sentences. For each pairwise phenomenon, the relevant set of tokens is listed, followed by discussion of subjects’ responses. Antecedent-contained deletion (5)

a. John wants for everyone YOU do to have fun.† b. John wants for everyone to have fun that YOU do.18

(6)

a. Sophia is anxious for everyone YOU are to arrive. b. Sophia is anxious for everyone to arrive that YOU are.

(7)

a. Ben is hopeful for everyone YOU are to attend. b. Ben is hopeful for everyone to attend that YOU are.

(8)

a. We want for everyone that you do to have fun. b. We want for everyone to have fun that you do.

(9)

a. Sophia is excited for everyone that you are to arrive. b. Sophia is excited for everyone to arrive that you are.

(10)

a. The coach is thrilled for every player that you are to play. b. The coach is thrilled for every player to play that you are.

(11)

a. Valerie is excited for everyone YOU are to graduate. b. Valerie is excited for everyone to graduate that YOU are.

(Fox 2002)

These are instances of antecedent-contained deletion (ACD) modifying the subject of an embedded infinitival clause. Both items in a pair have the same intended meaning, for example for (5), John wants (for) [DP everyone OPi that you want [ti to have fun]] to have fun; thus do is intended to be the dummy auxiliary introducing an elided VP, and the emphasis on you is meant to contrast it

¹⁸ This is, syntactically, a non-minimal pair, in that the complementizer in the relative clause is overt in (b) but silent in (a). Fox makes no comment on this, and presumably he could just as easily have had overt that in (a), so I included it in some tokens. The same is true for the subsequent data set, which is based on Bhatt and Pancheva (2004).

204

 . ̈  

with John¹⁹ and to encourage deaccenting do. These were undoubtedly the most difficult structures in the experiments for the subjects to figure out. To the extent that participants were able to paraphrase the sentences (often they said that they were unable to), their paraphrases were almost never the intended ones. For example, for (5a) subjects offered “John wants everyone that you do something for, or do something with, to have fun” and “John wants everyone to have the same amount of fun, so, all the fun that you had, John wants that for everybody.” These paraphrases make clear that the intended antecedent for VP ellipsis was not identified: in the first, do was apparently interpreted as a main verb and there was no ellipsis at all, while in the second the elided material seems to be (have) fun, that is, John wants everyone to have the same amount of fun that you have. Similarly, in (6a) a subject offered a “no ellipsis” interpretation where everyone you are contains a copular relative clause and means “every aspect of you.” Sentence (7a) got paraphrased as “Ben is hopeful on behalf of everyone that you will attend.” But the (b) versions fared little better: one subject paraphrased (8b) as “[We] want everyone to have the same amount of fun as you have.” Two participants paraphrased (9b) thus: “Sophia is as excited as you are for everybody to arrive”; a third rendered it as “Sophia is excited . . . uh . . . something about that you’re arriving.” Even a subject who paraphrased (5b) accurately gave it a “2” rating because “it just starts off funny . . . I wouldn’t say a sentence like that.” Sentence (9a) grammatically allows an unintended parse that involves no ellipsis, on which that you are to arrive is a CP meaning “that you are going/ expected to arrive,” such that the whole sentence means roughly Sophia feels excitement on behalf of everyone about the fact that you will be arriving. Several subjects reported this interpretation (which often got a high rating).²⁰ Similarly, (10a) was wrongly interpreted by several subjects as The coach is thrilled for [every player [whoi you are going/expected to play (with/against) ti]], which was then sometimes rated low because it seemed odd that a coach would be thrilled for opposing players rather than their own players. In the FC task, a subject who read both members of (11) (and supplied emphasis where intended) gave them different paraphrases, taking (11a) to mean “Out of all graduating people, Valerie is glad that you’re graduating too” and (11b) to mean “Valerie is more excited for you to graduate than other people.” (12) a. I expect that everyone YOU do will visit Mary.† b. I expect that everyone will visit Mary that YOU do. (Bhatt and Pancheva 2004) ¹⁹ It was felt that putting the matrix subject in all caps in such examples was not critical: out of the blue, a non-pronoun in this position seems likely to get a pitch accent anyway. ²⁰ I suspect that the overtness of that in (9a) and the absence of contrastive capitalization made this parse more readily available. The fact that not all tokens in this paradigm were given the capitalization treatment was intended to probe for such effects.

       

205

(13) a. I suspect that all the couples that HE does will kiss. b. I suspect that all the couples will kiss that HE does. This pairwise phenomenon differed from the previous ACD structure by virtue of the relativized DP being the subject of a finite clause. Several subjects reported that they could not discern the intended meaning of the sentences. Some reported that the meaning became clear once the experimenter read the example with the intended prosody. One subject took (12a) to mean “I expect that everyone you visit will visit Mary,” which seems to involve the antecedent for VP ellipsis coming after the ellipsis site, and requires ignoring Mary as the object in that antecedent. One thought that (13b) might mean “Everyone he kisses, those couples will kiss,” which seems to involve taking the elided VP to be headed by kiss (and transitive, despite the fact that the antecedent would be intransitive). Another paraphrased (13b) as “I suspect that all the couples will kiss the way that he does.” Extraction from PP within DP (14) a. Who did they find a parent of guilty? b. Who did they send a parent of to an unpleasant meeting? (Bruening 2010) (15) a. Who did the citizens elect an enemy of mayor? b. Who did the coach trade an enemy of to another team?21 (16) a. Who did officials proclaim an associate of the winner? b. Who did officials dispatch an associate of to the embassy? Both members of each pair are intended to include a DP containing a PP with a stranded preposition, for example [a parent of twho]; at issue is whether extraction from the subject of a small clause (e.g. complement of find) is harder than from the (direct) object of send. Unfortunately, (14a) is conducive to a garden path effect in a way that (14b) is not: the word following of could be the beginning of an overt complement to of, as in [a parent of [guilty children]]. The correct parse would typically differ from the incorrect one in that it contains a prosodic break after of, but it was not obvious how to convey that prosody in visual presentation. Tellingly, when some subjects read the sentences out loud for the first time, there was sometimes no such break, as if a parent of guilty were a constituent. It is apparently on that basis that some of them rated the sentence “1,” perhaps because that would leave no obvious position for the trace of who. These were therefore cases of getting the right result for the wrong reason (false convergence). ²¹ The pair in (15) differs on the noun in the matrix subject because SSA’s authors felt that this maximized the plausibility of each sentence (at the cost of a lexical mismatch).

206

 . ̈  

(Some subjects were subsequently able to identify the correct interpretation after the experimenter read the sentence with the intended prosody.) Another sort of misparse seems to be reflected in a subject’s paraphrase of (15a) as “Who did the citizens elect that was an enemy of the mayor?” When SSA’s authors created (16), we apparently failed to notice that it, unlike the other tokens, is not just temporarily but in fact globally ambiguous: instead of the trace of who being the complement of of, it could be the subject of the small clause: Whoi did officials proclaim [SC ti [an associate of the winner]]. At least two subjects seem to have gone for this parse, on the basis that their prosody grouped the last five words as a constituent. Control (17) a. I told Mr. Smith that I am able to paint the fence together.† b. I told Mr. Smith that I wonder when to paint the fence together. (Landau 2010) (18) a. Jennie discovered that her boyfriend dared to kiss in front of his parents. b. Jennie discovered that her boyfriend hoped to kiss in front of his parents. The intended meaning of both sentences in (17) was that together should refer to Mr. Smith and I, that is, two people together would do the painting. The hypothesis was that the PRO subject under wonder when could take those two DPs as a joint antecedent because, as an interrogative predicate, it allows Partial Control, whereas able, a modal predicate, requires Exhaustive Control and would only allow its complement subject PRO to be interpreted as I, providing no grammatical antecedent for together. The same contrast was predicted for (18) on the grounds that dare is implicative (Exhaustive Control) while hope is desiderative (Partial Control). Subjects who gave (17a) a high rating did so because of a different interpretation on which paint the fence together means “paint the fence such that it would be/stick together,” where together is a resultative predicated of the object. Subjects described this meaning as involving, for example, painting on glue to fix the fence. Other subjects apparently treated together as an object depictive, meaning that the fence is “all in one piece” or “already assembled.” For (18), several subjects in the FC task reported a preference for (18a) over (18b) on the grounds that the latter describes a much less plausible scenario: given that the boyfriend wants to kiss in front of his parents and assuming that the parents will be displeased by this, dare acknowledges the danger and might suggest a rationale (demonstrating his love is more important?), while hope does neither. (19) a. The bed was slept in naked. b. The bed was slept in wearing no clothes.

(Landau 2010)

       

207

(20) a. The club was entered shirtless. b. The club was entered wearing no shirt. The intended meaning of (19) was that the implicit agent, the one who did the sleeping, was naked or wearing no clothes. Landau’s claim was that the participial in (19b) should be able to be interpreted with the “weak” implicit agent as its subject because it heads a Control structure [PRO wearing no clothes], whereas the bare adjective in (19a) should not because it combines with the main clause through predication. However, several subjects gave the (b) sentences very low ratings because they took them to mean that the bed was wearing no clothes, the club was wearing no shirt, and so on, a meaning they (reasonably) found anomalous. Other reasons for the low ratings of (19b) included “You never say ‘the bed was slept in,’ you just say ‘I go to bed.’” Examples like (20), where the same noun (shirt) appears suffixed by -less and following no, seem closely synonymous; this triggered responses in FC preferring (a) because it is less wordy or more concise.²² This is a case where drawing attention to the difference between the members of a minimal pair might have had an undesirable consequence: it is possible that, if subjects find such pairs equally well formed, they fall back on concision as a basis for establishing a preference. (Of course it is possible that subjects in the LS experiments were also responding on the basis of concision, despite only seeing one member of such pairs, by thinking up (a) as a synonym for (b). Indeed, thinking of other strings that would express the meaning of a target string appears to be a common strategy; see the section “A-movement” in this chapter.) Floating quantifiers (21) a. All the winners are unlikely to have all been notified already. b. The winners are unlikely to have all been notified already. (Costantini 2010) The alleged problem with example (21a) is two instances of all “floated” off the same DP chain, but numerous subjects rated such sentences very high (“6” or “7”) in the first phase; when asked to reread them in the second phase, many recalled not having noticed this redundancy, which means that their initially high rating was based on a misreading of the string. (This is an example of the “one short word” problem for which there is no obvious work-around.) However, other subjects seemed perfectly happy with the two alls, reading the (a) sentence correctly and affirming their initially high ratings (e.g. “7”). This warrants exploration of a potentially genuine data disagreement. ²² The present design does not allow us to work out whether this was actually the basis for subject responses in the first phase of the experiment or whether they came up with it as a way to rationalize their preference after the fact during the second phase. Either way, Landau’s claim is called into question: one would probably not expect such responses if the (a) version violated the grammar in a way in which the (b) version did not.

208

 . ̈  

Embedded tense and aspect (22) a. George remembered that he would have made the phone call by the time he leaves work. b. George remembered that he will have made the phone call by the time he leaves work. (Martin 2001) The adjunct by the time he leaves work was intended to be interpreted as part of the embedded clause in both examples; this should lead to a tense clash within that clause in (22a), but not in (22b). However, if the adjunct is instead construed as modifying the matrix clause, it causes a tense clash (with remembered) in both examples. Some subjects reported this interpretation and consequently rated (22b) low. Complementizer omission (23) a. It seems as of NOW David had left by December. b. It seems as of NOW that David had left by December. (Bošković and Lasnik 2003) In (23) the original claim was that omitting the complementizer that when a temporal PP like as of now intervenes between seem and its complement is degraded. But many subjects gave (23b) a very low rating (“2” or “3”), apparently for independent reasons: they tended to comment on the seeming temporal inconsistency between seems and now versus had left and by December. (This problem was introduced by SSA: the article’s original example used simple past throughout. However, it contained another confound that SSA was attempting to avoid: It seemed at that time David had left† should be fine if the PP is parsed as part of the embedded clause; Bošković and Lasnik indicated with bracketing that this string was purportedly bad only if the PP is parsed as part of the matrix.) (24) a. Brittany knew that Morty had lost his wallet, but Morty had lost his KEYS, Brittany DIDN’T believe.23 b. Brittany knew that Morty had lost his wallet, but that Morty had lost his KEYS, Brittany DIDN’T believe. (Bošković and Lasnik 2003)

²³ In SSA we followed the example from the article, which contained only the structure following but, for example (That) John likes Mary, Jane didn’t believe†. But I was very doubtful that subjects would parse such strings with the first clause as the fronted complement of the second, so I “modeled” the underlying structure in the new first half of the sentence, using contrast to attempt to motivate the fronting in the second half.

       

209

The problem in (24) was omission of the complementizer in the second conjunct of (a): a bare TP, unlike the CP in (b), allegedly cannot be fronted. However, the omission in (24a) allows a garden path whereby Morty had lost his keys is directly conjoined with the material preceding the first comma; if one adopts that parse, Brittany didn’t believe is missing a (not quite obligatory) complement for believe, and its relationship to the rest of the sentence is obscure. It was this unintended parse that was rated fairly low by some subjects—another example of getting the right result for the wrong reason. (25) a. What the conductor believes is the train will crash. b. What the conductor believes is that the train will crash. (Bošković and Lasnik 2003) (26) a. What the bidders hope is they will get the house. b. What the bidders hope is that they will get the house. The alleged badness of complementizer omission in these examples is due to the embedded clause being pseudoclefted. One subject misread (25a), supplying the missing that, and rated it “7” on this basis. Another rated (26b) “2” because that participant did not like the use of the pseudocleft, which seemed unnecessarily wordy out of context. (27) a. They AFFIRMED, but we DENIED, Mary would tell the truth. b. They AFFIRMED, but we DENIED, that Mary would tell the truth. (Bošković and Lasnik 2003) Here complementizer omission is alleged to be blocked by right node raising (RNR). But (27b) was sometimes rated low because of an apparent tense mismatch: subjects were expecting Mary told the truth; other subjects rated it low because they thought affirmed was an odd verb to use in this context. (In SSA, following the original authors, clauses were conjoined with and rather than but and there were no commas: They suspected and we believed Peter would visit the hospital†. But such strings do not force a RNR parse, given that suspect is compatible with null complement anaphora: subjects could have parsed them as [They suspected Ø] and [we believed (that) Peter would visit the hospital].) (28) a. MARY believed Tommy drank his milk, and JANE he ate his vegetables. b. MARY believed that Tommy drank his milk, and JANE that he ate his vegetables. (Bošković and Lasnik 2003)

210

 . ̈  

In this instance, gapping of believed is claimed to preclude omission of the following complementizer.²⁴ One subject initially skipped he while reading (28a), which created a perfectly grammatical string yielding a high rating. (In SSA, as in the original article, there was no comma and there was a proper name in place of he: Mary believed Peter finished school and Bill Peter got a job†. Given that many first names in English can also be last names, such strings could be parsed without gapping (which requires highly marked prosody): Mary believed [[Peter finished school] and [[DP Bill Peter] got a job], which is why I made the changes.) Head–complement intervention (29) a. What did the children say at that time that the band played? b. At that time, what did the children say that the band played? (Bošković and Lasnik 2003) Example (29a) was meant to be parsed as Whati did the children say [PP at that time][CP that the band played ti]?, with an adjunct intervening between the verb and its complement. This might be bad if that adjunct forces the embedded CP to be extraposed, since subsequent extraction from it might be blocked (e.g. by Freezing). However, subjects found a different parse, namely Whati did the children say ti [PP at that time [CP OPj that the band played tj]], taking the CP as a relative clause, which unsurprisingly got high ratings. (This problem was introduced by SSA: Bošković and Lasnik’s original example had an obligatorily transitive verb in place of played, but only half of SSA’s tokens preserved that property.) (30) a. The video showed definitively the suspect to be in the kitchen. b. The video showed the suspect definitively to be in the kitchen. (López 2001) (31) a. The lawyers proved decisively the defendant to be innocent. b. The lawyers proved the defendant decisively to be innocent. (32) a. We proclaimed in the newspaper Ralph to be generous. b. We proclaimed Ralph in the newspaper to be generous. These are exceptional case-marking (ECM) structures containing an expression that is meant to modify the matrix verb, which in the (a) versions intervenes between that verb and the embedded subject, violating the adjacency requirement ²⁴ The absence of the complementizer from the first conjunct in (28a) introduces a potentially confounding difference from (b), but I suspect that Bošković and Lasnik (2003) felt that (28a) would degrade because of an asymmetry between its first and second conjunct if the first complementizer had been overt.

       

211

on Case assignment. The (b) versions are supposed to be acceptable because the embedded subject has raised to a matrix object Case position. Two subjects gave (30b) very low ratings because they were unfamiliar with the word definitively. One subject rated (31b) “1” because decisively “wasn’t too relevant . . . it adds so little, and it’s interrupting so much, that it doesn’t have to be there.” A FC subject pronounced (32b) as if Ralph in the newspaper were a DP; since no such parse is available for (32a), this subject was comparing apples and oranges. (Unfortunately, all these problems arose from attempts to generalize beyond López’s original example, where the intervenor was a PP argument.) Superiority (33) a. I know that the teacher bought some gifts and I know that she plans to give them to her favorite students, but I don’t know to whom she will give what. b. I know that the teacher bought some gifts and I know that she plans to give them to her favorite students, but I don’t know what she will give to whom. (Richards 2004) Such pairs were meant to show the effect of Superiority: wh-moving the theme while the goal-PP stays in situ should be better than wh-moving the goal while the theme stays in situ, if the theme is higher than the goal. But many subjects rated both sentence types low simply because they did not like the word whom, as opposed to who. (However, other subjects might well have disliked the choice of who; future studies need to include items with both variants.) On the other hand, several other subjects rated the (a) versions “7,” potentially raising a challenge to the empirical generalization. However, SSA’s and Richards’s original examples were matrix questions (To whom did you give what?†); I worried that out of the blue such multiple wh-questions would be so odd for naïve subjects that the results would be meaningless, so I decided to add contextualizing preambles. Unfortunately, it is possible that, in so doing, I allowed for whom and what to receive (quasi-)D-linked interpretations (which students/gifts), which are known to evade Superiority. A-movement (34)

a. There is unlikely a fleet of enemy ships to appear. b. There is unlikely to appear a fleet of enemy ships.

(35)

a. There is likely a bookshelf to stand against the wall. b. There is likely to stand a bookshelf against the wall.

(Hazout 2004)

212

 . ̈  

These examples challenge the suggestion that the infinitival complement to a raising predicate has an EPP feature that could drive overt movement to its Spec-TP, deriving the (a) order from the (b) order.²⁵ But it should be noted that the base (b) orders are themselves rather marked out of context (Hazout’s original was even more so: There is likely to appear a man†). Sentences such as (34b) were rated low by many subjects, who would have preferred them to be worded differently, for example “A fleet of enemy ships is unlikely to appear,” or “There is a fleet of enemy ships that is unlikely to appear,” or “It is unlikely for a fleet of enemy ships to appear.” In the FC task, items such as (35) yielded responses that highlighted the possibility of an unintended parse for the (a) version, involving a purposive infinitival adjunct and a non-raising use of likely, paraphraseable as There is probably a bookshelf (meant) for us to stand against the wall. No such parse is available for the (b) variant, so such responses were apples-to-oranges comparisons. (Use of unlikely for likely could have avoided this issue.) Pro-forms (36) a. The politicians said that we should use less gas, but the actual doing of so has proved very challenging. b. The politicians said that we should use less gas, but the actual doing of it has proved very challenging. (Haddican 2007) (37) a. Alex said we should take Sunset Blvd, but the actual doing of so was slowed down by heavy traffic. b. Alex said we should take Sunset Blvd, but the actual doing of it was slowed down by heavy traffic. The claim here was that the doing of it is fine while the doing of so is degraded. However, one subject rated (36b) rather low because proved should have been proven. Another subject rated (37b) low because they wanted to replace doing of it with driving. (Both of these “irrelevant” responses might have been avoided in the FC task, where proved and doing of would have appeared in both alternatives.)

11.5 Conclusions and future directions The general conclusion from the results presented here should be obvious: judgments collected from naïve subjects in computer-based acceptability experiments

²⁵ Even if it does, examples like (34a) would be expected to be well formed only if the DP surfacing there can get Case via its relationship with the upstairs expletive there.

       

213

cannot be taken at face value, given the way in which virtually all such experiments are currently conducted. We need to know the reasons behind subjects’ responses in order to assess whether they are germane to the linguistic question that the judgments are meant to address, and most of the existing studies make no attempt to ascertain those reasons. Although this research stemmed from attempts to confirm linguists’ published judgments, I believe the conclusion is not restricted to such studies: we should never underestimate subjects’ creativity in finding ways of looking at sentences that would not have occurred to us, or in being bothered by aspects of sentences that we find mundane. This is not to deny that there have been judgment experiments focused on narrow theoretical questions where the construction of stimuli was more constrained and careful than in the experiments presented here, allowing concomitantly greater confidence in the results. But the only means of fully dispelling concerns of the sort I have raised is to run a version of the experiment in which subjects’ reactions can be probed in the ways I have suggested (and perhaps in additional ways I have not thought of), seeking to ascertain what structure + interpretation they were judging and, unless the sentence was found unobjectionable, what they disliked about it. We may think of this as “pilot” work, but I would encourage reporting the results of such work, at least for the final version of stimuli. (In the ideal case, this could be done in a single sentence, say this one: “In open-ended discussions, the pilot subjects never reported irrelevant parses/interpretations and never offered irrelevant reasons for judging the sentences unacceptable.”²⁶) Then we can be (more) confident that the results of a large-scale acceptability questionnaire administered by computer bear on the linguistic issues we intend them to. A couple of more specific methodological conclusions suggest themselves. I was pleasantly surprised at how much information could be gleaned simply by having a subject read a stimulus sentence out loud: I highly recommend not skipping this seemingly trivial step. Also, I believe I have demonstrated the value of presenting stimuli as minimal pairs rather than singletons (which, as noted, does not preclude eliciting LS or ME ratings of each sentence): the problems that this solves seem to greatly outnumber the few it may occasionally create. (Of course, all results should eventually be replicated using multiple methods.) As for the use of crowdsourcing platforms such as AMT, the results I have presented should invite the field to think about how we can enrich our interactions with subjects via ²⁶ I do not mean to suggest that we generally expect subjects to be able to articulate what is wrong with unacceptable sentences. But, from conducting the interviews described above, I believe we can interpret their judgments as being relevant (“for the right reasons”), for example on the basis of asking which portion of the sentence they think sounds wrong/strange/etc., how they would change the sentence to make it sound better, and, in cases where just a single (purportedly) bad sentence has been presented, subsequently asking for a judgment on the okay counterpart. Of course, the experimenter conducting such an interview must have extensive linguistic background; this is not a task for a firstyear research assistant.

214

 . ̈  

these platforms. If we are willing to allow more time (and money) to be spent gathering each data point, we can be more expansive both in terms of the information we send (we could provide greater guidance on how to read the sentence, for instance by embedding it in a context)²⁷ and in terms of the information we receive (we could elicit not just numerical ratings or rankings but also qualitative feedback). Finally, it should be clear that cutting and pasting examples from linguistics articles directly into experimental stimuli will generally yield uninterpretable results. Naïve subjects need stimuli that are exquisitely crafted, controlled, normed, and piloted: this is no less true for testing the empirical claims of linguistic theory than it is for testing hypotheses about language processing.

Acknowledgments These experiments would not have been possible without the help of undergraduate research assistant Ethan Chavez, who was heavily involved in all the aspects of the design, in addition to running most of the subjects. Thanks to Jon Sprouse, Jesse Harris, and Colin Wilson for discussions about these experiments, to an audience at UConn for feedback, and to Henry Tehrani for the implementation of Experiment 3. Thanks to Sam Schindler for his patience and encouragement, and to non-anonymous reviewer Ted Gibson for feedback on an earlier draft. This research was supported by a UCLA Academic Senate COR grant.

²⁷ Ted Gibson, one of the two reviewers of this chapter, suggests that “all of the problems” that I observe in this chapter “could be solved by providing an adequate context.” I am not so optimistic: I find it unlikely that contexts will reduce the propensity to skip over short words, to give low ratings on the basis of unfamiliar/rare words, etc. More importantly, how could we determine which contexts are “adequate” for removing confounds, if not through something like the interview method I have described? (If I reran the SSA study with contexts and found 100 percent convergence, which reaction would be more likely to come from skeptics who share the concerns of Gibson and colleagues: (a) “Great news, linguists have perfectly reliable judgments after all!” or (b) “I want to understand how those contexts are affecting the way subjects interpret the target sentences”?)

12 A user’s view of the validity of acceptability judgments as evidence for syntactic theories Jon Sprouse

12.1 Introduction Acceptability judgments are ubiquitous in syntactic research, but their use is not without controversy. For some researchers (typically generative syntacticians), acceptability judgments often serve as the primary evidence for the construction of theories. But for other researchers (often language researchers outside of the generative syntax tradition), acceptability judgments are used as pilot data at best, because acceptability judgments are not viewed as valid data for the construction of syntactic theories. At an abstract level, empirical variety is a virtue—researchers should be free to choose the data type that most appropriately addresses their question of interest. But at a practical level, the fact that there is an empirical split between two otherwise closely related groups of researchers (often working on identical or substantially related questions) is stifling for the field. It is clear that it will take work along multiple parallel lines to resolve this division. The present volume represents an important step in that direction. In my chapter I would like to contribute to this endeavor by attempting to lay out my view—as a user of acceptability judgments—of the validity of acceptability judgments as explicitly as I can. My hope is that, by making this view explicit and by discussing the areas where we do and do not have empirical results to support it, this chapter, like the others in this volume, can help facilitate future discussions about, and perhaps even future studies of, the validity of acceptability judgments. For organizational purposes, I will divide the primary discussion in this chapter into three components. The first is a discussion of the theory underlying acceptability judgments: what their theoretical purpose is, what their (cognitive) source is, how they are used (logically) in theory construction, and how their success (or failure) is evaluated within generative syntax. My hope is (i) to show that users of judgments often have a theory of judgments that would plausibly yield data that

Jon Sprouse, A user’s view of the validity of acceptability judgments as evidence for syntactic theories In: Linguistic Intuitions: Evidence and Method. First edition. Edited by: Samuel Schindler, Anna Droz˙ dz˙ owicz, and Karen Brøcker, Oxford University Press (2020). © Jon Sprouse. DOI: 10.1093/oso/9780198840558.003.0012

216

 

are as valid as other data types in language research, and (ii) to provide a foil for discussions of alternative theories of judgments. The second component is a discussion of our current state of knowledge about the empirical properties of acceptability judgments: their reliability, their sensitivity, and their susceptibility to theoretical bias. My hope is to show why it is that users of judgments often believe that judgments show the empirical properties one would expect of a useful data type—namely reliability and sensitivity that rival (or surpass) those of other data types in language research. The third component is a discussion of the practical question facing generative syntacticians: should we continue to use acceptability judgments, knowing that some language researchers might not accept them as valid evidence (and therefore might not accept the resulting theories), or should we adopt other methods that other researchers appear to already accept as valid? My take on this is that we do not yet have the systematic evidence that we need to make that assessment definitively, as neither group of researchers has done the work that this would require. With the little evidence that we do have, my impression is that there is no scientific reason to prefer acceptability judgments over other sentence-processing methods like reading times, eye movements, or scalp voltages, but that there is some evidence that judgments might be preferred for practical reasons such as that they yield less variability than other measures for the phenomena of interest to syntacticians. Before beginning the discussion, two quick disclaimers may be appropriate. The first is that there are at least two questions in the literature about the validity of acceptability judgments: (i) the general question of whether judgments are a valid data type, and (ii) the more specific question of how best to collect acceptability judgments. I will focus on the general question of the validity of acceptability judgments in this chapter, because I believe that this is the more pressing challenge. There is no chance for collaboration between linguists and other cognitive scientists if those cognitive scientists do not believe that the data underlying the theory are valid. I will not have much to say about the nittygritty details of judgment collection (even though much of the empirical evidence that I will discuss is also relevant for questions about the validity of traditional judgment methods, and has appeared in journal articles focused on that question). My thoughts on this topic are relatively prosaic—I think that researchers should be free to use the method that is most appropriate for their specific research question, which entails making scientific judgments about the various factors that influence data quality. The second disclaimer is that, as the title suggests, this chapter skews more toward an opinion piece than a typical research article. I do not attempt to accurately represent anyone else’s position or anyone else’s interpretation of the empirical results—only my own. My hope is that this will help facilitate future discussions of the validity of acceptability judgments, perhaps by spurring others to make their opinions equally explicit, or by identifying areas where additional empirical studies might be valuable.

 ’       

217

12.2 A theory of acceptability judgments In this section I will attempt to lay out an explicit theory of acceptability judgments from the perspective of a judgment user, drawing heavily on previous work (e.g. Schütze 1996; Cowart 1997) and my own impressions from the field. I will divide the theory into four components (which are by no means exhaustive): (i) a statement of the goal of syntactic theory, (ii) a proposal for the source of acceptability judgments, (iii) a discussion of the logic that is used to convert judgments into evidence for syntactic theories, and (iv) a discussion of the criteria that are used to evaluate the success or failure of judgments. To my mind, laying this out makes it clear that syntacticians have a theory of acceptability judgments that is at least as well worked out and as plausibly valid as the theories underlying other data types in language research.

12.2.1 What is the goal of syntactic theory? Generative syntacticians use acceptability judgments to build syntactic theories. The theory of judgments should connect to this goal directly. To my mind, there are two fundamental assumptions driving generative syntactic theory: (i) that there is an underlying combinatorial math to human syntax, and (ii) that this math is (at some level) a description of a cognitive ability. I take the goal of syntactic theory to be the specification of that underlying math in a way that can (one day) be integrated into broader theories of language as a cognitive ability (a theory of language acquisition, a theory of language processing, a theory of language use, etc.). Chomsky (1957) argued that, to study this combinatorial math, syntacticians must first divide word strings into those that are possible in the language and those that are impossible. Chomsky assumed that the underlying math yielded a binary classification, or two discrete sets of word strings (grammatical and ungrammatical). We now know that this is an open empirical question; it is possible that the underlying math could yield more than two sets, or fuzzy membership in two or more sets, or even a truly continuous spectrum of word strings. But the fundamental goal remains the same: to classify word strings in some way, so that syntacticians can investigate the properties of these word strings that are relevant to the underlying combinatorial math. Chomsky (1957) suggested that acceptability judgments might be a good method (potentially among many) for making this classification.

12.2.2 What is the cognitive source of acceptability judgments? If pushed to give a one-sentence definition of acceptability judgments, I would probably say something like this: acceptability judgments are the conscious report

218

 

of the perception of an error signal that arises automatically during the processing of a sentence. There are a number of important claims in that definition. The first is that there is an error signal that arises during sentence processing. I think all syntacticians assume that there are multiple factors that impact that error signal— grammar (phonology, morphology, syntax, semantics, pragmatics, etc.), language processing (parsing strategies, working memory, predictive processes), real-world knowledge (e.g. plausibility), task effects, and so on. I think syntacticians sometimes assume that there is a single unitary error signal that is itself a composite of the multiple factors that influence it, but as Colin Phillips once pointed out, that claim has never been tested empirically (personal communication). It is possible that speakers can distinguish different sources of errors systematically. Nothing in the way in which judgments are currently used hinges on this assumption, as far as I can tell. The second claim is that the error signal is generated automatically—it cannot be consciously disengaged. I believe this is a critical assumption for most syntacticians. If the error signal were consciously driven, like a learned skill, I believe that syntacticians would be less inclined to use judgments as the foundation of syntactic theory. I know of no explicit research into the automaticity of the error signal underlying judgments. One paradigm that has been used in the event-related potential (ERP) literature to investigate automaticity involves repeating a condition over and over, to see whether the response is suppressed (suggesting that it is under some amount of control, though it is not clear whether this control is conscious or not) or whether it persists (suggesting that it is automatic). Hahne and Friederici (1999) showed that repeated exposure to one type of ungrammatical sentence in German leads to the suppression of the P600 response, suggesting that it is controlled, but to no suppression of the ELAN response, suggesting that it is automatic. The judgment satiation literature could potentially be viewed as analogous, but I do not think that the analogy carries through. In judgment satiation the judgments change after repeated exposure (typically, they increase in acceptability); but they are not suppressed. Judgment satiation seems more consistent with the idea that one of the components that contribute to the error signal has changed (perhaps even the syntactic component) than with the idea that the error signal itself has been suppressed. The third claim is that judgments are a conscious report of a perception (of the error signal). I believe that this is also a critical assumption for syntacticians. Syntacticians, like most cognitive scientists, reject introspection in the Wundtian sense—we do not believe that humans have conscious access to cognitive mechanisms, therefore the claim is not that judgments are a direct report of the syntactic representations or syntactic mechanisms underlying language. Instead syntacticians, like most other cognitive scientists, believe that humans have conscious access to percepts (such as the brightness of light). If participants have conscious access to percepts, then it seems reasonable to ask participants

 ’       

219

to report their perceptions, as long as appropriate methodological controls are in place (section 12.3). Schütze (1996) has a terrific discussion of this issue in his seminal book on the topic of acceptability judgments. In his discussion, he distinguishes an introspection-based definition of judgments from a perceptionbased definition by formulating the research questions that each would answer: Introspection: What must be in the minds of participants for the sentence to have the [syntactic] status that they claim it has? Perception:

What must be in the minds of participants in order for them to react this way to the sentence?

In my experience, the second formulation (perception) more accurately reflects the research questions that syntacticians attempt to answer. However, I do see why some language researchers get the impression that syntacticians are attempting to use the first formulation (introspection). I believe this is an illusion that arises because syntacticians tend to assume a relatively direct mapping between syntactic well-formedness and acceptability judgments. I believe that this is partly due to the assumption that syntactic well-formedness has relatively large effects on acceptability (while other factors have smaller effects; see sections 12.3 and 12.4), and partly due to the fact that syntacticians use experimental logic to control for effects from other factors and thereby isolate the effect of syntactic wellformedness—a topic I turn to presently.

12.2.3 What is the logic that is used to convert acceptability judgments into evidence? Syntacticians use the same logic that all cognitive scientists use—experimental logic. This is because syntacticians are interested in establishing a causal relationship between one or more syntactic factors and the resulting acceptability judgments. The simplest case of experimental logic is the minimal pair; for syntax, that would be a pair of sentences that share all possible (judgment-affecting) properties except one: the syntactic property of interest. More complex cases, such as multifactorial designs, can be used when it is impossible to hold all the factors that might influence judgments constant, or when the syntactician is interested in quantifying the effects of multiple properties and their interactions simultaneously. In this way, the only difference between syntax and other domains of cognitive science is in the mechanisms of interest and in the data types used to investigate them. I am aware that the presentation of acceptability judgments in the syntax literature is not always transparent about the use of experimental logic. At times

220

 

there are sentences that appear in isolation, as if they are not part of any experimental logic. I think there are (at least) two ways in which this occurs. The first is through an implicit use of experimental logic. Though the target sentence appears in isolation, there are, typically, one or more other conditions implicated in the claim being made. These other sentences may occur explicitly in earlier passages of the text (due to the flow of the scientific narrative), or may be implicit, the author assuming that other syntacticians can generate the relevant conditions from the theory itself. I admit that the style of presentation in syntactic articles can be opaque in this way. This may be a consequence of the sheer scope of syntax articles: each article typically contains a large number of data points and explores a relatively large number of hypotheses. This in turn may be a consequence of the ease with which judgments can be collected, making it feasible to test a large number of hypotheses in one project. The second way in which sentences can appear in isolation is when they are not acting as evidence in support of a syntactic theory. For example, judgments can be used as a criterion to identify constructions that might warrant additional study, perhaps because they are below a certain threshold in acceptability. Though identifying potential phenomena is undoubtedly a critical part of the scientific enterprise, it is logically distinct from gathering evidence in support of a theory. Once a sentence is identified as warranting additional study, syntacticians will invariably use experimental logic to identify the cause of the low acceptability, either explicitly or implicitly. I have, in conversation but not in print, encountered syntacticians who wonder to what extent there is a logical connection between the type of syntactic theory that one assumes and the type of evidence that one is able to use. The specific question seems to be whether binary syntactic theories, which divide strings into two types (grammatical and ungrammatical), might be able to use judgments of single sentences, whereas gradient theories of syntax, which divide strings into more than two types (either some finite number, or an infinite number), might somehow be more amenable to the comparison of multiple sentences. I do not see how binary syntactic theories can make use of the acceptability of standalone sentences (as previously discussed); nor do I see how gradient syntactic theories could make better use of experimental logic than binary syntactic theories. The question of whether syntactic theory distinguishes two or more types of strings appears to me to be orthogonal to the question of how to build causal theories from acceptability judgments. It is of course true that the two types of theories have different mechanisms available to them to explain the fact that acceptability itself is a continuous measure: binary theories must rely on extrasyntactic factors to explain the gradience of acceptability, whereas gradient theories can explain the gradience directly with syntactic mechanisms (in addition to extrasyntactic factors). But, again, this empirical issue appears to be orthogonal to the question of the logic that syntacticians can use to identify those mechanisms. To my mind, that is always experimental logic.

 ’       

221

12.2.4 What are the criteria for evaluating the success (or failure) of acceptability judgments? The question of how to evaluate the success (or failure) of acceptability judgments is intimately tied to (if not identical to) the question of validity from psychometrics. A test is valid if it measures what it is intended to measure. Though validity is rarely discussed explicitly in the syntax literature, it is my impression that syntacticians do critically evaluate the success of acceptability judgments as a data type. It is also my impression that syntacticians evaluate acceptability judgments using the same criteria that other language researchers use for other data types. These criteria are indirect. It is not currently possible to directly measure the error signal that gives rise to acceptability judgments, just as it is not possible to measure the processing mechanisms that underlie reading times or the neural computations that underlie scalp potentials. No data type in language science is held to that kind of direct validity requirement (and, in fact, it would defeat the purpose of having these data types, since at that point we could just measure the underlying cognitive mechanism directly). Instead, the criteria that syntacticians and other language researchers use build on the logic that a valid data type will have certain properties, and invalid data types will not. In this section I will mention three criteria that syntacticians appear to use to evaluate the validity of acceptability judgments. The first, and perhaps most important, criterion is that acceptability judgments can be used to create internally consistent syntactic theories. Syntactic theories make predictions about one phenomenon on the basis of the mechanisms proposed for another phenomenon. They succeed in explaining multiple phenomena with a relatively small number of theoretical constructs. They make predictions about the space of cross-linguistic variation. They interact with other domains of language such as phonology, morphology, semantics, acquisition, and sentence processing in exactly the way that one would expect of a syntactic theory. We would not expect this kind of internal consistency if syntactic theories were built on random or unrelated data. This is not to say that there are not debates about the details of syntactic theories. And this is not to say that the theories constructed from acceptability judgments are necessarily theories of syntax; it is logically possible that all these properties hold in a world where acceptability judgments do not provide evidence about syntax at all, but rather provide evidence about some other cognitive system. Nonetheless, these properties increase the probability that acceptability judgments are providing meaningful information about syntax. It is my impression that this is precisely the same criterion used for other measures in language science. Reading times, eye movements, and scalp potentials are considered valid measures of comprehension processes because the resulting theory has the properties we would expect of a theory of comprehension processes.

222

 

A second criterion is the fact that acceptability judgments generally correlate with other measures, such as sentence-processing measures, corpus/production measures, and language acquisition measures. To be clear, there is a nuance to this correlation. It is not the case that syntacticians believe that acceptability judgments perfectly correlate with these other measures (otherwise, syntacticians could simply use these other measures directly; or psycholinguists and acquisitionists could use acceptability judgments directly). Nor do syntacticians claim that the relationship between acceptability judgments and these measures will be simple. Nonetheless, there is quite a bit of evidence spanning these literatures that there is a general correlation. As Phillips (2009) points out, we can even see it in the types of research that the field publishes. It is not a publishable research result to find that a sentence with low acceptability also yields a reading time effect or an ERP effect, or has low frequency in a corpus. The publishable result is when we find the opposite—that a sentence has low acceptability but no sentenceprocessing effects or high frequency. This suggests that the field has accepted the general correlation between acceptability judgments and other measures as the most probable configuration of the language faculty. Of course, given this general correlation, one could ask whether there would be value in syntacticians adopting other data types in addition to, or perhaps instead of, acceptability judgments. That is a question that I will turn to in section 12.4. There is a third criterion that is worth mentioning, if for no other reason than that it receives relatively little discussion in the language research literature, and that is face validity. Face validity is when a measure appears, on its face, to measure the property of interest. Face validity is probably one of the weaker criteria for validity, as one could imagine measures with high face validity that are not ultimately valid, and measures with low face validity that are ultimately valid. But face validity is also a core component of most measures in language research, because researchers typically invent tasks that seem as though they will give the information that the researcher wants. Sometimes the task works, sometimes it does not. The evaluation of the task usually involves other criteria, but the first criterion is almost always face validity. As such, it is perhaps unsurprising that most measures in language research, including acceptability judgments, have high face validity.

12.3 The empirical properties of acceptability judgments Whereas the previous section focused on the fundamental assumptions that form the theory of acceptability judgments, which are typically only testable indirectly, this section focuses on three empirical properties that we would expect from a valid data type that are directly testable: reliability, independence of bias, and sensitivity.

 ’       

223

12.3.1 The reliability of acceptability judgments Reliability is the propensity of a measure to yield the same results when repeated under the same conditions. One of the most frequent empirical questions about acceptability judgments raised in the literature is to what extent they are reliable. I take this to indicate that most language researchers expect relatively high reliability from measures. The specific concern about the reliability of acceptability judgments is that the relatively informal collection methods that are typical in syntax might lead to unreliability, because informal methods may be contaminated by confounds of various sorts: theoretical bias, if professional linguists are used as participants; an outsized contribution of specific lexical items, if relatively few tokens of each condition are tested; distortion of judgments, if there are no fillers to mask the theoretical goal of the experiment; or misinterpretation of the results, if inferential statistics are not used as part of the data analysis (e.g. Edelman and Christiansen 2003; Ferreira 2005; Wasow and Arnold 2005; Featherston 2007; Gibson and Fedorenko 2013; Branigan and Pickering 2017). These concerns are also often accompanied by the claim that the more formal collection methods typically used in psycholinguistics would not suffer from such confounds, and would therefore lead to higher reliability. As such, this concern is not about the reliability of judgments in general, but rather about the reliability of the set of informally collected judgments that have been published in the literature (and, by extension, about the theories that have been constructed from those judgments). There has been quite a bit of recent work exploring the reliability of acceptability judgments, at least in English. Two of my own studies, Sprouse and Almeida (2012a) and Sprouse et al. (2013), investigate the impact of informal methods of collection by re-testing two large sets of informally collected judgments through formal methods. Sprouse and Almeida (2012a) re-tested all the data points from Adger’s (2003) syntax textbook Core Syntax and found a replication rate of 98–100 percent, depending on the definition of replication. Sprouse et al. (2013) re-tested a random sample of 300 data points that form 150 two-condition phenomena taken from the journal Linguistic Inquiry (2001–10), and found a replication rate of 88–99 percent, depending on the judgment task and the definition of replication, with a margin of error of 5 for the full population of data points in the journal over that time period (because the sample was random). The Linguistic Inquiry results were replicated directly in Sprouse et al. (2013), then replicated again in Mahowald et al. (2016), using a different sample of data points from the same time period of the journal. The interpretation of these replication rates is subjective. Speaking for myself, I find these replication rates to be exceedingly high (compare Open Science Collaboration 2015 for estimated replication rates in other areas of experimental psychology: these rates are in the range of 36–53 percent, using similar definitions of replication).

224

 

Therefore, to my mind, these results suggest that the differences between informal and formal judgment collection methods have relatively little impact on the resulting acceptability judgments; and this in turn suggests that acceptability judgments are remarkably reliable, at least for English. Sprouse et al. (2013) also yields information about between-task reliability, since we tested three distinct tasks: a 7-point scale task, the magnitude estimation task, and a two-alternative forced-choice task where participants selected the more acceptable item in a pair of conditions. Figure 12.1 shows the (between-subjects) correlation between the judgments of the 300 individual conditions using the 7-point scale and magnitude estimation tasks. The correlation is nearly perfect. Langsford et al. (2018) tested both between-task reliability and withinparticipant (test–re-test) reliability using a subset of the materials from Sprouse et al. (2013), adding in a yes–no task and a forced-choice task based on the Thurstone method, which tests random pairs of sentences rather than the theoretically constrained pairs in the Sprouse et al. (2013) test. Langsford et al. report impressively high rates of reliability for both between-task and within-participant reliability. Taken together, all these studies suggest that judgments are likely fundamentally reliable; they appear to be reliable across both informal and formal collection

Magnitude estimation rating

2

1

0

−1 r = .97 slope = .95

−2 −1.0

−0.5

0.0 7−point scale rating

0.5

1.0

Figure 12.1. Correlation between acceptability judgments using the 7-point scale and magnitude estimation tasks for the 300 sentence types randomly sampled from Linguistic Inquiry (2001–10), by Sprouse et al. (2013). The ratings are z-score-transformed.

 ’       

225

methods, across samples of participants and items, across judgment tasks, and across time within the same participants. The primary limitation of these findings is that they have focused almost exclusively on English, leaving open the possibility that reliability may vary by language. There are a number of large-scale studies in progress carried out by a number of research teams in other languages that may address this question in the near future.

12.3.2 Theoretical bias in acceptability judgments One of the potential confounds frequently mentioned in discussions of informal collection methods is the fact that syntacticians often report their own judgments and those of their students and colleagues. Given that linguists have the potential to recognize the experimental manipulation, and perhaps even the hypotheses under consideration, this raises the possibility of theoretical bias contaminating the reported judgments. To be clear, the concern is typically not that linguists will purposefully report judgments that confirm (or disconfirm) their theoretical beliefs, but rather that linguists’ theoretical knowledge will subconsciously influence their judgments. The reliability results reported in the previous subsection put a potential upper bound on the effect of this bias: 0–2 percent for the Adger textbook data set and 1–12 percent for the Linguistic Inquiry data set. We can also look more closely at those results. We can ask what sorts of replication failures we would expect to find if theoretical bias were present in the judgments in the syntax literature. One possible prediction is that we would expect to find sign reversals: a change in direction of the effect, between the informally collected judgments and the formally collected judgments. To be clear, sign reversals can arise for reasons other than theoretical bias (e.g. low statistical power in either the informal or the formal experiments; see Sprouse and Almeida 2017b for a mathematical discussion of this feature). But the prediction here is that theoretical bias could be one generator of sign reversals between experiments that involve professional linguists and experiments involving naïve participants; therefore the presence of sign reversals would be a potential indicator of theoretical bias. In Sprouse et al.’s 7-point scale results, there were 2 statistically significant sign reversals, 9 null results, and 137 statistically significant replications (2 of the 150 phenomena were not analyzed because of errors in the experimental materials). In the forced-choice results, there were 3 statistically significant sign reversals, 6 null results, and 139 statistically significant replications. Thus it seems that not only are there relatively few replication failures in the Linguistic Inquiry data set (6–7 percent of the sample), but within those replication failures there are very few sign reversals (1–2 percent of the sample). In short, there is very little evidence for theoretical bias in this data set.

226

 

12.3.3 The sensitivity of acceptability judgments

Forced−Choice

100 80 60 40 20 0 100 80 60 40 20 0 100 80 60 40 20 0 100 80 60 40 20 0

Likert Scale

Magnitude Estimation

Yes−No small

30

37

34

53

11

17

18

25

8

11

11

16

large

16

medium

mean power (%)

Another expectation for valid data types is that they will be sensitive to the phenomena that they are intended to measure. Sprouse and Almeida (2017b) provide some information about the sensitivity of acceptability judgments to syntactic phenomena: they estimated the statistical power (the ability to detect an effect when an effect is truly present) of four judgment tasks (7-point scale, magnitude estimation, forced-choice, and yes–no; see the rows of Figure 12.2), for 50 of the phenomena from the Linguistic Inquiry data set, by running 1,000 resampling simulations for each sample size from 5 to 100 participants (the x-axis in Figure 12.2), and by calculating the percentage of results that reached statistical significance in those 1,000 simulations (the y-axis in Figure 12.2). The results are summarized in Figure 12.2, organized by effect size as measured using Cohen’s d, a standardized effect size measure, and classified by Cohen’s (1988) criteria for small (d = .2), medium (d = .5), and large (d = .8) effect sizes (the rows in Figure 12.2). The general trend that emerges in Figure 12.2 is that judgment tasks are remarkably sensitive when it comes to syntactic phenomena. The forced-choice task is the most sensitive, reaching 80 percent power for small effect sizes (using null hypothesis tests) at 30 participants, medium effect sizes at 16 participants, and large effect sizes at 11 participants.

25

50

75

100 0

25

50

75

100 0

25

50

75

100 0

25

50

75

extra large

0

100

sample size

Figure 12.2. Empirically estimated power relationships from Sprouse and Almeida (2017b), arranged by task (columns) and effect size (rows), using null hypothesis tests. The x-axis is sample size and the y-axis is estimated power (based on 1,000 re-sampling simulations for each sample size). The vertical line and number indicates the sample size at which 80% power is first reached.

 ’       

227

As with the results in the previous two sections, the interpretation of these results is subjective. To put these results into context, one possibility would be to compare the sensitivity of acceptability judgments to that of other data types in the broader field of experimental psychology. Unfortunately, the two fields have thus far focused on different measures of statistical power. For acceptability judgments, we have the power curves for specific tasks across a range of effect sizes and sample sizes, but no measure of the power of the published studies in the literature (because sample sizes and tasks are rarely reported for informally collected judgments, making such calculations impossible). In the broader field of psychology, we have measures of the power of the published studies (see Szucs and Ioannidis 2017 for a recent study and a review of previous studies), but we do not have power curves for specific tasks (because there are so many distinct tasks in the field, it would likely be impractical to measure a substantial number of them). For now, we can say two things. The first is that acceptability judgment tasks reach Cohen’s (1988) suggested target level of power of 80 percent for syntactic phenomena with relatively reasonable sample sizes, particularly for effect sizes that are medium or larger. The second is that the syntactic phenomena that have been published in Linguistic Inquiry tend to be medium or larger. As Figure 12.3 shows, 87 percent of the phenomena randomly sampled from Linguistic Inquiry have a Cohen’s d greater than .5. This suggests that, for the vast majority of syntactic phenomena in the current literature, acceptability judgment tasks can reach good power with a reasonable sample size. We do not currently have any systematic information about the sensitivity of judgment tasks to non-syntactic phenomena (processing, frequency, plausibility, task effects, etc.). The critical issue is that we do not know either the size of these effects or the amount of variability in judgments to them. One might wonder why syntacticians, who presumably are not interested in studying non-syntactic phenomena, should care about the sensitivity of judgments to non-syntactic

count of phenomena

20 15 10 5 0 0 0.2

0.5

0.8 1

1.5

2

2.5

3

3.5

4

4.5

Figure 12.3. The count of phenomena based on standardized effect size (Cohen’s d ) for the statistically significant replications from the 7-point scale task from Sprouse et al. (2013). The vertical lines represent Cohen’s (1988) suggestions for the interpretation of effect sizes: small (.2), medium (.5), and large (.8).

228

 

phenomena. My impression is that syntacticians often construct acceptability judgment experiments as if they believed that syntactic properties have a larger effect on judgments, while non-syntactic properties (processing complexity, frequency, etc.) have a smaller effect on judgments. One way this arises is that, when designing their informal judgment studies, syntacticians typically explicitly control for known syntactic factors, and also for factors from other areas of grammatical theory (phonology, morphology, semantics, pragmatics), but do not always control for factors that are more traditionally part of psycholinguistics, such as processing complexity, frequency of words or constructions, and task effects. I do not want to give the impression that syntacticians ignore these issues completely, just that there is a general trend for the two literatures, syntax and psycholinguistics, to focus on potential confounding factors that are more central to their respective theories. My impression is that debates in the literature about how well controlled acceptability judgment experiments are typically hinge on the factors that are being controlled, not on whether control in general is being applied. A systematic study of the size of non-syntactic effects would help to resolve this issue. A second way in which this belief appears to arise in the field is that syntacticians sometimes mention the possibility of using effect size as a heuristic for identifying potential syntactic effects, with larger effect sizes indicating a potential syntactic effect (or perhaps a grammatical effect) and smaller effect sizes indicating a potential language processing effect. I do not believe that syntacticians would use this as evidence toward a theory; but, as a heuristic for identifying phenomena to study in more detail, it is appealing. We are not in a position to evaluate the viability of this heuristic—we do not have systematic information on the effect sizes of non-syntactic effects. (But we do have evidence that some of the effects that syntacticians appear to care about are small, so the heuristic cannot be used as an absolute criterion.)

12.4 The choice to continue to use acceptability judgments In this final section I would like to ask a difficult question. At what point should syntacticians decide to abandon acceptability judgments in favor of other data types? As the previous sections have made clear, I do not personally believe that this is necessary. But I can imagine that some language researchers may remain unconvinced, perhaps for empirical reasons, or perhaps for reasons related to their own assumptions about the source of acceptability judgments. Even if syntacticians believe that acceptability judgments are valid, if this disagreement prevents the dissemination of results, or collaboration among researchers from otherwise allied fields, one might ask whether there could be practical value in adopting other methods, either substantially or completely. In this section I would like to take this question seriously. There are two conditions in which it would make

 ’       

229

sense to switch methods: if there were a scientific reason, or if there were a practical reason.

12.4.1 The scientific question Can other data types provide the type of information that syntacticians need to construct and evaluate syntactic theories? I think the answer is unequivocally yes. Syntacticians already appear to believe that the automatic error signal that gives rise to acceptability judgments also impacts other comprehension measures such as reading times, eye movements, scalp potentials, and hemodynamic responses. This is part and parcel of the argument for predictive validity— judgments tend to correlate with effects in these other measures. Therefore, in principle, syntacticians could simply use these other measures instead of acceptability judgments. Given that syntacticians likely believe that other measures could provide evidence for syntactic theories, one might wonder whether there are scientific reasons why they haven’t adopted the other measures to a greater degree. The issue, in my opinion, is that these methods will require specifying a relatively detailed processing theory that can be combined with a syntactic theory to make predictions about these other measures. Since syntacticians are not primarily interested in sentence processing theories, it would be more desirable to have a measure that can provide information about syntactic theories without requiring the specification of a detailed processing theory. While I am sympathetic to this issue (see especially Stabler 1991 for a discussion of this, and some caveats about how difficult it may be to derive syntactic predictions from processing theories), I do worry that this concern may be overstating the difference between judgments and other data types in this regard. Judgments do require a processing theory, because judgments are a type of processing data. It only appears that judgments are different because syntacticians do not typically discuss the processing theory explicitly in work using acceptability judgments. This is possible because judgments provide only one measure, at one time point (typically, at the end of the presentation of the sentence), rather than word-by-word (or millisecond-bymillisecond) measures. But full-sentence processing must still be accounted for in the experimental logic. Syntacticians can infer that the end-of-sentence judgment reflects a syntactic effect, and not some other sentence-processing effect, because they designed the experiment to manipulate syntax and hold other properties constant. Similarly, syntacticians avoid having to specify a precise temporal prediction for the error signal by assuming that error signals that arise at points during sentence processing will impact judgments made at the end of the sentence. The fact that syntactic theories are as successful as they are (section 12.2) and that judgments are as reliable and sensitive as they are (section 12.3) suggests

230

 

that these assumptions are substantially correct. There appears to be no scientific reason for syntacticians to abandon acceptability judgments completely. Though my personal belief is that there is no scientific reason for syntacticians to abandon judgments, there are certainly scientific reasons for some subset of syntacticians to explore other measures in service of syntactic theory. To my mind, this has never been in doubt—psycholinguists and experimental linguists have been exploring the link between syntax and sentence processing since the earliest days of the field. To be clear, it is not the case that all syntacticians must do this; it seems healthy for the field for different researchers to specialize in different aspects of the syntactic enterprise on the basis of their own personal interests.

12.4.2 The practical question Do the potential benefits of switching to other data types (in terms of encouraging collaboration across the broader field of language science) outweigh the potential practical costs? I think the answer is that we do not currently have enough information to make a definitive assessment. The little bit of information that we do have suggests that there would be quite a number of practical costs involved in such a switch. I think this helps to explain why syntacticians have generally not switched away from acceptability judgments as a response to concerns about the dissemination of results and the possibility of collaboration. On the one side of the equation, we know that judgments are much cheaper and much easier to collect than most (if not all) of the other data types in sentence processing. There is, typically, no need for special equipment or consumables, and it is, typically, possible to investigate a relatively large number of conditions at once. On the other side of the equation, we have some hints that sentence-processing measures such as EEG may ultimately be noisier than judgments when it comes to detecting the effects that syntacticians are interested in. For example, Figure 12.4 compares whether-island effects (*What do you wonder whether Mary read__?) and subject island effects (*What did the advertisement for __ interrupt the game?) using both acceptability judgments and ERPs. For each method, I have plotted the effect for each participant: for judgments, it is the difference in acceptability between the island effect sentence and a control sentence; and, for ERPs, it is the difference in scalp voltage at the critical word (for whether-islands it is the embedded subject, e.g. Mary; and for subject islands it is the verb, e.g. interrupt). The black points and lines represent a participant each; the red points and lines represent the mean of the participants. For acceptability judgments, all but a few participants show a clear positive effect. It is quite easy to see the effect, even without calculating a grand mean and without using statistical tests. For ERPs, the situation is quite different. It is not as

 ’        Whether island (N = 28)

microvolts

island effect (z−scores)

Whether island (N = 28) 2 1 0 −1

10

−10 critical word

2

−1 −2

next word

Subject island (N = 24) microvolts

island effect (z−scores)

Subject island (N = 31)

0

C3

0

−2

1

231

10

Pz

0 −10 critical word

next word

Figure 12.4. A comparison of by-participant effects for two island effects (whether and subject islands), for both acceptability judgments (left panels) and event-related potentials (right panels). By-participant effects are in gray; grand means are in black. The acceptability judgment effects are jittered roughly according to the probability density of the effects (a sina plot). The boxes in the ERP plots indicate the time window of the significant effect: a negativity for whether-islands (Kluender and Kutas 1993) and a positivity for subject islands (Neville et al. 1991).

easy to see a clear effect without calculating a grand mean or without using statistical tests. Crucially, all four samples have roughly the same number of participants. The issue here is an inherent property of these measures: acceptability judgments have relatively less variability, whereas ERPs have relatively more variability (at both the trial and participant level; see Luck 2014 for an introduction to ERPs that describes some of the physiological reasons for this). It may be possible to comb through the sentence-processing literature and extract information about effect sizes and variability for a number of syntactic effects (agreement violations, case violations, phrase structure violations, etc.). However, combing the existing literature will only get us so far, because only a subset of the effects of interest to syntacticians have been tested using sentenceprocessing methods to date. The obvious next step would be to test a larger number of syntactic effects using sentence-processing measures. Combined with the projects mentioned in previous sections, the end result would be a systematic 2  2 investigation of effects and methods: a test of both syntactic and nonsyntactic effects using both acceptability judgments and other sentence-processing methods. With that information we may be in a position to definitively assess the

232

 

cost–benefit ratio of each method for both sets of phenomena. Without that information, we must rely either on the impressions of individual researchers— impressions that are based on the small amount of data we do have—or on the broader evaluation metrics that were discussed in section 12.2.

12.5 Conclusion My primary goal in this chapter was to explicitly discuss the validity of acceptability judgments from the perspective of a user of acceptability judgments, in the hope that such a discussion might help both syntacticians and other language researchers chart a path forward for investigating the validity of acceptability judgments. Along the way I have tried to make my current personal opinion explicit as well: I believe that acceptability judgments have most, if not all, of the hallmarks of a valid data type. Syntacticians have a plausible theory of the source of acceptability judgments, a theory of how to leverage judgments for the construction of syntactic theories using experimental logic, and a set of evaluation criteria that are similar to those used for other data types in the broader field of psychology. At an empirical level, acceptability judgments have been shown to be relatively reliable across both tasks and participants, to be relatively sensitive (at least to syntactic phenomena), and to be relatively free from theoretical bias. As the facts currently stand, I would argue that acceptability judgments are at least as valid as other data types that are used in the broader field of language science. That said, I have also noted that most of our evidence comes either from subjective evaluations of syntactic theories (section 12.2) or from experimental studies that have focused primarily on English (section 12.3). Therefore it is possible that future evaluations or future experimental studies will challenge these conclusions. I have also argued that there is no scientific reason to prefer acceptability judgments over other data types; therefore the general choice in the field to use judgments over other sentence-processing measures appears to be a purely practical one. Judgments are unquestionably cheaper and easier to deploy, and there is some (admittedly limited) evidence that acceptability judgments involve less noise, and therefore yield larger effect sizes, for syntactic phenomena than other sentence-processing measures do. But we do not yet have the full set of data that we would need to determine the optimal practical choice(s)—a systematic (2  2) study of both syntactic and non-syntactic phenomena using both acceptability judgments and other sentence-processing methods.

13 Linguistic intuitions and the puzzle of gradience Jana Häussler and Tom S. Juzek

13.1 Introduction Syntacticians are interested in the grammatical status of linguistic sequences. To make inferences about a sequence’s grammaticality, syntacticians can choose from a selection of methods. The main non-quantitative method is what we call the process of community agreement, whereby linguists provide their own syntactic judgments and negotiate the status of questionable items until agreement is reached (see Phillips 2009; Linzen and Oseki 2018). One of the main quantitative methods is the experimental acceptability judgment task. In such a task, laypeople provide their judgments in an experimental setting. The judgments of laypeople are judgments of acceptability; and, arguably, the first judgments in the process of community agreement are also judgments of acceptability. However, acceptability judgments are confounded by extragrammatical factors (for an overview of factors, see Schütze 1996 and Sprouse 2013). It then takes careful examination from experts to extract the grammatical status of an item from such acceptability judgments. Here we look at how laypeople and experts alike form their linguistic ratings and how linguists use their expert knowledge in their grammatical reasoning. The need to distinguish between acceptability and grammaticality has been noted at least since Chomsky (1965: 11) and is discussed in greater detail in Schütze (1996). One of the main differences concerns gradience. Grammaticality is often assumed to be categorical, in other words an item is either ungrammatical or grammatical (but see Featherston 2005a; Keller 2000; Bresnan 2007; and, more recently, Lau et al. 2017 for proposals of incorporating gradience into the grammar itself).¹ Acceptability, however, often comes in degrees (for a review of

¹ In early work, Chomsky as well argued that “grammaticalness is, no doubt, a matter of degree” (Chomsky 1965). Even in later work, gradience recurs occasionally. For instance, Chomsky (1986a: 38) discusses the effect size discrepancy between empty category principle violations (strong effect) and subjacency violations (weaker effect) in terms of the number of barriers that are crossed. Jana Häussler and Tom S. Juzek, Linguistic intuitions and the puzzle of gradience In: Linguistic Intuitions: Evidence and Method. First edition. Edited by: Samuel Schindler, Anna Droz˙dz˙ owicz, and Karen Brøcker, Oxford University Press (2020). © Jana Häussler and Tom S. Juzek. DOI: 10.1093/oso/9780198840558.003.0013

234

 ̈     . 

corresponding findings, see Sorace and Keller 2005). Most syntacticians assume that this gradience derives exclusively from extra-grammatical factors. More recent experimental results cast doubt on this assumption, though. First, gradience in acceptability is considerably more pervasive than was previously assumed, as Featherston (2005a) has illustrated. Second, as we argue in what follows, because of the way acceptability data is structured, the factors that are commonly used to explain the observed gradience are insufficient to do so. This leads to what we call the puzzle of gradience: if one assumes that grammaticality is categorical, one must explain how gradience in acceptability can arise in the form in which it does (see sections 13.3 and 13.4 for more data and a detailed discussion of the puzzle).

13.1.1 The process of community agreement and acceptability judgment tasks A common approach to determining the syntactic status of a linguistic sequence is that the investigating linguist queries her or his intuition about that sequence. Arguably, in a first step, the linguist has an intuition about how acceptable an item is. Then, through careful consideration and contrasting, the linguist determines the grammatical status of the linguistic sequence. This method, in particular its first step, is often referred to as introspection in the linguistic literature. But the label is unfortunate, because any judgment, whether by an expert or by a layperson, is an introspective judgment. Even the term researcher introspection is not quite accurate, because the process involves more than just a single researcher querying her or his own intuition. The investigating linguist will check the literature for existing judgments, will ask students and colleagues for their judgments, and will get feedback from reviewers and readers. Typically, this process will continue until the community has reached agreement on the status of a linguistic sequence. Therefore we think that this method is best characterized as a process of community agreement.² For languages that are underrepresented in the community, that is, in situations where only few colleagues and reviewers are native speakers of the language under investigation, this process is less effective (see Linzen and Oseki 2018 on Hebrew and Japanese). The process of community agreement is a non-quantitative method, in the sense that one cannot easily apply quantitative statistical tests to it. It is sometimes referred to as an informal method, because the number of participants who have contributed to a judgment is often not clear, it is not known with precision

² We wish to thank Gisbert Fanselow and Ash Asudeh for underlining the importance of this point.

      

235

how the judgments came about, items are not distributed across lists, that is, a participant might see an item in multiple conditions, no fillers are used, items are not randomized, and, most importantly, the purpose of the study is not concealed. For discussion of these issues, one can consult, among others, Gibson and Fedorenko (2010) and Schütze (1996). Another common method is acceptability judgment tasks, in which the linguist collects intuitions from usually naïve participants in an experimental setting. The researcher systematically varies one or several factors and registers the effect of this variation, in our case the effect on the acceptability rating. Typically, participants are not informed about the purpose of the study, items are distributed over lists, fillers distract from the purpose of the study, items are randomized, and the number of participants is high enough to permit applying quantitative statistical analyses. Accordingly, acceptability judgment tasks are referred to as a quantitative method or sometimes as a formal method. One can consult for example Cowart (1997), Gibson and Fedorenko (2013), or Schütze and Sprouse (2013) for an in-depth discussion of experimental standards used in acceptability judgment experiments.

13.1.2 Acceptability and grammaticality In order to reduce the influence of potentially confounding factors, it is important that the expert designs any experiment carefully. “Clean” acceptability judgments are essential for making inferences about grammaticality. This brings us to a crucial distinction. The collected judgments are judgments of acceptability—in contrast to judgments of grammaticality. Acceptability intuitions are a joint product of the competence grammar and performance factors. That is, speakers make use of their competence grammar during actual language use. Grammaticality, on the other hand, would be a product of the competence grammar only. However, it cannot be observed directly, because grammar is a mental construct (Schütze 1996; for a recent review of the competence– performance distinction, see Chesi and Moro 2015). The example in (1), from Chomsky and Miller (1963: 286), illustrates the difference between acceptability and grammaticality. A note on diacritics: we use a caret for “unacceptable” and reserve the asterisk for “ungrammatical.” We will come back to the issue of diacritics in section 13.2.4. (1) ^The rat the cat the dog chased killed ate the malt. The sequence in (1) does not violate any grammatical rule but appears unacceptable because it is too complex to parse. The ideal speaker-listener—“who knows [the] language perfectly and is unaffected by such grammatically irrelevant

236

 ̈     . 

conditions as memory limitations, distractions, shifts of attention and interest, and errors (random or characteristic) in applying his knowledge of the language in actual performance” (Chomsky 1965: 3)—would identify the sequence in (1) as a grammatical sentence of English. Real-life speakers, however, when asked about the syntactic status or simply about the naturalness of a sequence such as (1), tend to give low ratings. Then an expert has to make sense out of such acceptability ratings. This is done through grammatical reasoning, that is, by carefully considering the item in question and then contrasting it with related sequences. Acceptability judgment tasks involve a rather clear division of labor: the participants give acceptability judgments and the expert makes inferences about grammaticality from them. This divide is less clear in the process of community agreement. When considering an item, linguists collectively query their introspective judgments and then apply grammatical reasoning to it. This often results in a reflection on both an item’s acceptability status and its grammaticality status.

13.2 Linguistic intuitions The term intuition is used in several ways (for overviews, see Plakias 2015 and Weinberg 2016). In this chapter, we equate intuition with an inclination to make a judgment. Intuitive judgments arise spontaneously, without any process of conscious reasoning. Taken in this sense, intuitions are immediate and unreflective (see Gopnik and Schwitzgebel 1998; Devitt 2006b). Crucially, we distinguish two kinds of linguistic intuitions: acceptability intuitions and source intuitions.³ The latter concern the source of reduced acceptability (or ameliorated acceptability in the case of linguistic illusions). They are intuitions about why a given linguistic expression is of reduced acceptability—not necessarily in terms of some concrete constraint but in terms of estimating the contribution of grammar versus extragrammatical factors. Both linguists and non-linguists can make intuitive judgments about the goodness of a given linguistic sequence. They both have acceptability intuitions. It is another issue, however, to judge why a given expression is degraded. This requires grammatical reasoning. Linguists may internalize some of this reasoning, forming beliefs about the source of the degradedness of specific constructions. Such beliefs may give rise to spontaneous judgments regarding the source of reduced acceptability. Unlike source intuitions, grammaticality reasoning involves conscious reflection on an item.

³ The distinction of acceptability intuitions versus source intuitions is closely related to Wasow and Arnold’s (2005) distinction between primary and secondary intuitions.

      

237

13.2.1 Acceptability intuitions Acceptability judgments do not reflect grammaticality directly. They reflect acceptability intuitions.⁴ A comprehensive model of how acceptability intuitions emerge is still missing (but see Carroll et al. 1981; Gerken and Bever 1986; Schütze 1996; Featherston 2005a; and Luka 2005 for important contributions). Given that informants asked to give an acceptability judgment can do so promptly after hearing a linguistic sequence, it seems reasonable to assume that acceptability intuitions are a by-product of parsing. This idea is captured in the Decathlon model (Featherston 2005a), at least for gradient judgments.⁵ The model relates gradient judgments directly to the grammar; binary judgments, in contrast, are linked to processes of language production, which involve competition.⁶ The Decathlon model assumes that grammar is a set of violable constraints associated with weights. Parsing incrementally outputs structures, which are evaluated by the constraints. Constraint violation results in a penalty the size of which is determined by the weight of the constraint. The constraint application module outputs a form–meaning pair and a numerical score computed as the sum of the violation costs. This score constitutes the acceptability intuition and can be mapped onto a judgment scale when a gradient judgment is required. For production and binary judgments, the output of the constraint application module is sent to a second module, the output selection module. Within this module, alternative syntactic realizations of the message to be conveyed compete for output selection. This competition is probabilistic, and hence occasionally a candidate with a lower acceptability score (higher violation costs) will win. Most of the time, however, the best candidate will win even when its lead over the second best is very small. Binary judgments are basically estimates of the probability of producing or encountering a linguistic sequence. A crucial question to be answered by any model of acceptability intuitions concerns the source of gradience in acceptability judgments. The Decathlon model provides a direct answer: gradient acceptability scores result from violating more or fewer constraints and weaker or stronger constraints. The acceptability score reflects the number and the strength of violations. Processing costs might add to the overall score. The Decathlon model is not very explicit on this score, but

⁴ Unfortunately, the term grammaticality judgment is often used as a synonym for acceptability judgment. ⁵ Following common practice in the field, we use the term gradient as an antonym for categorical/ binary. A gradient judgment is hence a judgment on a scale that provides more than two judgment categories. ⁶ A link between acceptability judgments and processes of language production has also been considered by other researchers (e.g. Kempen and Harbusch 2008) but is problematic because of production–judgment discrepancies (cf. Luka 2005). A striking example is that of aphasic speakers who struggle to produce sentences but still perform well on judging their acceptability (Linebarger et al. 1983).

238

 ̈     . 

processing costs have been shown to affect acceptability in both directions. Grammatical, but hard to process sentences receive reduced acceptability ratings (for experimental evidence, see Alexopoulou and Keller 2007; Fanselow and Frisch 2006; Gerken and Bever 1986; Hofmeister et al. 2014). As for ungrammatical sentences, processing difficulty can have the opposite effect, as evidenced by the so-called missing-VP effect (Gibson and Thomas 1999). Christiansen and MacDonald (2009) report an acceptance rate of 68 percent for sentences like (2). Sentences like (2) are ungrammatical; nevertheless they are occasionally judged to be acceptable despite the missing predicate for the first relative clause (that the maid _). (2) *The apartment that the maid who the service had sent over was well decorated. The Decathlon model assumes a second source of gradience, but explicitly only for binary judgments: frequency of occurrence in relation to alternative realizations. Binary judgments are thought to undergo the same competition process as production. Hence they do not reflect acceptability itself, but rather estimates of production probability. Probabilistic accounts of grammar attribute the gradience in acceptability to gradience in frequency (e.g. Bresnan 2007; Lau et al. 2017). The connection between frequency and acceptability is, however, only weak and asymmetric. While reduced acceptability implies low frequency and high frequency implies high acceptability, the reverse does not hold in either case: low frequency does not imply reduced acceptability and high acceptability does not imply high frequency (Featherston 2005a; Arppe and Järvikivi 2007; Bader and Häussler 2010). The likelihood that a given sentence occurs depends on several factors in addition to its acceptability, for example length and lexical frequency (Lau et al. 2017).

13.2.2 Source intuitions Since acceptability is a joint product of the competence grammar and factors related to performance, reduced acceptability poses a source ambiguity problem (Hofmeister et al. 2013). As linguists, we are interested in the grammatical status of a given construction. We aim to draw an inference from acceptability to grammaticality. Hence we need to disentangle the contributions of several factors and to isolate the contribution of the competence grammar. On the basis of prior experience and of expectations derived from this experience, linguists may have intuitions about the source(s) of reduced acceptability. Like experienced chemists who can guess the composition of a given substance, linguists can guess the ingredients of an acceptability intuition (see Newmeyer 1983). And, like chemists

      

239

who subject their guesses to further testing, linguists examine their intuitions with the help of grammatical analysis and, occasionally, of experiments.⁷ We call this kind of spontaneous educated guess a source intuition. Before discussing source intuitions in more detail, we want to illustrate the source ambiguity problem with two examples. As already pointed out, assessing the acceptability of a linguistic expression is only the first step. Linguistic analysis requires a decision regarding the contribution of competence grammar to the acceptability status. In some cases, the inference seems straightforward. The sentence in (1), repeated here as (3), is just such a case. (3) ^The rat the cat the dog chased killed ate the malt. (Chomsky and Miller 1963) For most speakers, doubly center-embedded object relative clauses as in (3) are unacceptable. Removing one level of embedding yields an acceptable sentence: The rat the cat killed ate the malt.⁸ Whether we describe the limitation in depths of embedding as a property of grammar or as a limitation imposed by the general cognitive capacity is a matter of grammatical reasoning. As pointed out already in Chomsky and Miller (1963), it is hard, if not impossible, to come up with a plausible grammatical explanation. Of course, the difficulty is to define what counts as a plausible grammatical constraint. Placing a limit on the number of embeddings—as proposed in Reich (1969) and, more recently, in Karlsson (2007a, 2007b)—seems rather “unsyntactic,” though it refers to structural properties. Moreover, such a grammatical constraint is hard to reconcile with the observation that doubly embedded subject-extracted relative clauses are acceptable (The rat who evaded the cat who chased the dog ate the malt).⁹ A limit on the number of embeddings is further challenged by semantic effects (Hudson 1996), by differences between oral and written language (for corpus data, see Karlsson 2007a and 2007b), and by individual differences (e.g. Hachmann et al. 2009; Wells et al. 2009). Processing-related explanations, on the other hand, are more plausible, although the details are controversial. As pointed out in Miller and Chomsky (1963), multiple center embedding of object relative clauses results in high memory load, because the dependency between a subject and the corresponding predicate needs to be kept active in memory while further clauses are processed. Gibson’s dependency locality theory provides an explicit metric of memory and

⁷ Unfortunately, experiments testing hypothesized sources of degradedness are not yet common practice. ⁸ Punctuation (The rat, the cat, the dog chased, killed, ate the malt) might be an alternative for reducing the parsing difficulty but poses new difficulties (e.g. non-canonical use of comma and uncommon intonational bracketing). ⁹ Note that there are cross-linguistic differences in the structure and processing of multiply embedded relative clauses (see Gibson 1991; Frank et al. 2016).

240

 ̈     . 

integration costs (Gibson 1998). Alternative accounts focus on interference (e.g. Lewis 1996). The processing of constructions involving multiple center embedding has been intensively studied in psycholinguistics. Another rather uncontroversial case of a grammatical string with reduced acceptability owing to processing difficulty is given in (4): (4) ^The horse raced past the barn fell.

(Bever 1970)

The garden path example in (4) contains a reduced passive relative clause that is locally compatible with a main verb analysis of raced (The horse raced past the barn). Removing the local ambiguity reduces the processing difficulty and increases acceptability substantially. Disambiguation can be achieved in various ways: (5) a. The horse that was raced past the barn fell. b. The horse, raced past the barn, fell. c. The horse ridden past the barn fell. Versions (5b) and (5c) preserve the structural properties of (4) and therefore provide a strong argument for non-structural factors, namely processing difficulty due to ambiguity, as a main source of the reduced acceptability of (4). Repeated examination of problematic sequences enables linguists to identify ambiguity and complexity without further cogitation. It may also facilitate concentrating on structural properties and being less distracted by plausibility and the like. Furthermore, linguists might accommodate an appropriate context more easily than non-linguists. Some of these factors may affect acceptability intuitions directly, making those of linguists less vulnerable to extra-grammatical factors, while other factors may come on top of acceptability intuitions. Source intuitions probably also include subconscious biases based on theoretical beliefs (see Greenbaum 1976). Biases can concern general assumptions about the data structure, for example the commonly assumed dichotomy grammatical– ungrammatical, and specific sentence types or constructions. If “intuition is nothing more and nothing less than recognition” (Simon 1992: 155), then linguistic intuition is the recognition of a linguistic pattern. This recognition may trigger judgments learned through repeated exposure to judgments in the literature as well as to corresponding analyses. In principle, theoretical biases may affect all three types of judgments discussed in this section (acceptability intuitions, source intuitions, and grammatical reasoning); but they are most critical for source intuitions, because these concern grammaticality and are unreflective. We surmise that the informal gathering of acceptability judgments taps into both acceptability intuitions and source intuitions. In informal surveys, the factor under examination is, typically, not hidden but rather highlighted by presenting to

      

241

participants all versions of a lexicalization. This might trigger recognition of the structure and of associated beliefs about the grammaticality. Crucially, acceptability intuitions and source intuitions serve different purposes in linguistic theorization. Acceptability intuitions constitute the data that linguistic theories aim to account for, whereas source intuitions serve as a starting point in generating hypotheses about the source of (un)acceptability. The use of the two kinds of intuition should not be confused. Only acceptability intuitions should be considered as evidence of grammaticality (see Wasow and Arnold 2005). A linguistic theory that systematically deviates from well-established acceptability intuitions must justify the discrepancy (this is typically done with reference to performance factors); the lack of a plausible explanation should result in revision or even rejection of the theory in question (for a similar argument at a more general level, see Gopnik and Schwitzgebel 1998).

13.2.3 Grammatical reasoning Source intuitions should not be equated with grammatical reasoning. Intuitions are immediate and unreflective (see Devitt 2006b), while reasoning takes time and is reflective. Source intuitions can be a point of departure for grammatical reasoning, though they are not necessary. Grammatical reasoning is based on explicit theoretical assumptions about the grammar. Hence grammatical reasoning is necessarily theory-laden. Note that source intuitions are not theory-free either. As argued earlier in this chapter, source intuitions result from internalizing judgments and analyses owing to repeated exposure. Source intuitions are basically recognized patterns. The crucial difference between source intuitions and grammatical reasoning is their unreflective versus reflective nature. However, this does not mean that linguists are fully aware of every theoretical assumption involved in grammatical reasoning. The theory-laden nature of grammatical reasoning implies the risk of theoretical biases (for examples, see Schütze 1996; also Machery and Stich 2012 and references therein). Nevertheless, we would not be as pessimistic as Lehmann (2004: 198), who states that “few linguists have escaped the temptation to dress the data they produce according to the theory they cherish.” The way Lehmann states his criticism suggests that linguists consciously adapt the data to their theories rather than adapting their theories to the data. We cannot rule out that this happens, but we optimistically assume that this is only rarely the case. Instead, intuitions—both acceptability intuitions and source intuitions—will be shaped by theoretical biases. For instance, linguists might be more inclined toward a dichotomy between acceptable and unacceptable that is based on the theoretical assumption of a binary division between grammatical and ungrammatical. Data from Dąbrowska

242

 ̈     . 

(2010) point in this direction. While non-linguists in her study show a gradual decline in acceptability of wh-questions, with highest acceptability ratings for prototypical instances of wh-questions that involve long extraction, lower acceptability ratings for non-prototypical instances of long wh-extraction, and lowest ratings for ungrammatical control items, the linguists’ judgments exhibit an almost binary distribution, resembling an S-curve. We think that three strategies can reduce the influence of theoretical prejudice. First, collecting acceptability judgments from laypeople, who will not have any theoretical bias, helps contrast acceptability intuitions with the linguist’s source intuitions¹⁰ and with the result of their reasoning. Second, we suggest making the process of grammatical reasoning visible in the presentation of data by using separate diacritics for acceptability and for grammaticality. Third, linguists should be more precise about the factors that they surmise are responsible for the discrepancy between acceptability and grammaticality, so that those factors become testable. We are under the impression that too often extra-grammatical effects are just stated without being made explicit—not to mention the testing of the hypothesized factor. In Syntactic Structures, Chomsky states that the aim of linguistic analysis is to distinguish the set of grammatical sequences from the set of ungrammatical sequences (Chomsky 1957: 13). To accomplish this aim, linguists first need to identify grammatical and ungrammatical sequences. This seems an easy task for clearly unacceptable sequences and for perfectly acceptable sequences.¹¹ For intermediate acceptability, Chomsky (1957: 14) suggests to “let the grammar itself decide.” We consider this approach to be in part problematic because it turns data that must be modeled by a grammar model into claims derived from that grammar model. On this approach, the grammatical status of a sequence varies depending on the proposed grammar. And in fact Newmeyer (1983) gives several examples that, according to him, reflect conflicting grammatical analyses despite shared acceptability intuitions. (6) a. He left is a surprise. b. *He left is a surprise.

(Diacritic according to Bever 1970) (Diacritic according to Chomsky and Lasnik 1977)

Bever (1970) and Chomsky and Lasnik (1977) agree in their intuition that (6) is unacceptable. They disagree, however, in their evaluation of the grammatical status of (6). Bever (1970) attributes the degradedness in acceptability to ¹⁰ Linguists might not always be fully aware of the difference, if their beliefs concerning grammaticality shape their acceptability intuition. The two intuitions might intermingle, producing a single intuition. ¹¹ But there are exceptions: grammatical sentences that appear unacceptable, as the examples in (1)–(3) and (4), and ungrammatical sentences that appear acceptable, as we see in (2), the missing-VP example.

      

243

processing difficulties that result from a preference for interpreting the first sequence of (pro)noun and verb as the main clause unless its subordinate status is indicated, for example by a complementizer (That he left is a surprise). Chomsky and Lasnik (1977), on the other hand, attribute the reduced acceptability of (6) to ungrammaticality and propose a grammatical constraint to capture this claim.

13.2.4 A proposal concerning diacritics The problem we want to point out concerns the transparency of the reasoning in the presentation of the example itself. For readers who have no intuitions of their own about the acceptability of a given string because they are not native speakers of the language, it is almost impossible to determine the acceptability status and to reconstruct the reasoning (though the text discussing the example might provide the necessary clues). The reader cannot be sure whether the diacritic or its lack indicates an acceptability intuition or the result of grammatical reasoning. According to the common convention, the *-marking is supposed to indicate ungrammaticality. This is how the diacritic is used in (6). However, this usage results in loss of information; the acceptability status is not expressed in the diacritic. A sentence marked with an asterisk could be ungrammatical and unacceptable; or it could be ungrammatical but acceptable, or of moderate acceptability. As a result, the pair of linguistic expression and marking cannot easily be used for further theory building. Without acceptability information, any alternative proposal can challenge only the grammatical constraint proposed to capture the assumed ungrammaticality, but not the assignment of ungrammaticality as such. This hampers theory building. Note that the use of a question mark is no solution either, because the diacritic becomes ambiguous. It could mark intermediate acceptability or uncertainty about the grammatical status (perhaps despite a clear (un)acceptability status). The inconsistent use of diacritics has been noted virtually since their invention in the early days of generative linguistics (e.g. Newmeyer 1983; Schütze 1996; Culbertson and Gross 2009). We consider it desirable to make the distinction between intuited acceptability (= data) and hypothesized grammaticality (= hypotheses) visible in the data display. We therefore propose the use of two diacritics, one for indicating an acceptability intuition and another one for indicating the result of grammatical reasoning. For the latter, we keep the traditional *-marking; for acceptability intuitions, we propose the caret (“^”), which should indicate reduced acceptability.¹² Combining the two types of

¹² Since acceptability has degrees, one might consider using one or more carets to indicate degrees of (un)acceptability. We are reluctant to suggest such an approach, because it hampers comparison across

244

 ̈     . 

diacritics results in four possible combinations. We illustrate them with an example from Gibson and Thomas (1999). The sequence in (7a) is grammatical but typically considered unacceptable. We mark such a grammatical but unacceptable sentence with “^.” The sequence in (7b), by contrast, is (more) acceptable, but in fact ungrammatical because it lacks a verb. We mark ungrammatical but acceptable sequences with “*.” The remaining two versions of (7) are derived by dropping the most deeply embedded relative clause. The sequence in (7c) is the shortened counterpart of (7b). Thanks to the reduced complexity, the missing verb is no longer concealed. We mark the combination of ungrammatical plus unacceptable with a combination of the two diacritics “*” and “^.” Finally, a grammatical and acceptable sentence like (7d) is left unmarked. (7) a. b. c. d.

^The patient the nurse the clinic had hired admitted met Jack. *The patient the nurse the clinic had hired met Jack. *^The patient the nurse met Jack. The patient the nurse admitted met Jack.

The use of separate diacritics for acceptability and for grammaticality will increase the transparency of the grammatical reasoning. Applying our proposal to the examples in (6) would make both the conflict and the consensus between Bever (1970) and Chomsky and Lasnik (1977) immediately accessible. (8) a. ^He left is a surprise. b. *^He left is a surprise.

(Judgment according to Bever 1970) (Judgment according to Chomsky and Lasnik 1977)

13.3 Our experiments Most syntacticians consider grammaticality to be categorical. A linguistic sequence is either ungrammatical or grammatical. This has been an explicit assumption made by most syntacticians, at least since Chomsky (1975), who notes that “[g]rammar sets up a sharp division between a class G of grammatical sentences and a class G´ of ungrammatical sentences.” This contrasts with acceptability. To our knowledge, there is not a single linguist who would claim that acceptability is binary (acceptable vs. unacceptable). That acceptability has degrees has been noted at least since Chomsky (1965) and has been discussed further in Keller (2000), Newmeyer (2003), Featherston authors or papers just as much as the use “??,” “?*” does. Instead, we would prefer quantitative data whenever the extent of (un)acceptability matters to the grammatical analysis.

      

245

(2005a), Sorace and Keller (2005), Sprouse (2007), and Wasow (2009), for example. Featherston (2005a) demonstrated how pervasive gradience in acceptability really is, by plotting about 1,000 acceptability ratings in ascending order. Featherston removed duplicates, which resulted in about 220 unique ratings; and he observed that the items, when ordered in ascending order, line up in a near linear fashion. Featherston’s (2005a) Figure 1 is here reproduced as Figure 13.1. Proponents of a categorical grammar typically explain such gradience with extra-grammatical factors, especially performance factors like memory load (e.g. Chomsky and Miller 1963), real-world implausibility (Sprouse 2008), and ambiguity (e.g. Bever 1970; Myers 2009: 409). However, more recent data suggest that grammaticality may also contribute to gradience. In what follows we present data for which extra-grammatical factors fail to capture the full extent of gradience.

13.3.1 Experiment 1: A different quality of gradience To explore the question of gradience further, we take up an observation reported in Häussler et al. (2016). The corresponding study compared expert judgments provided in linguistic articles to judgments obtained experimentally from laypeople. The items for this comparison come from a corpus that consists of items and corresponding judgments (indicated by diacritics) published in Linguistic Inquiry (LI) in the years 2001 to 2010. This corpus is similar to the corpus in Sprouse et al. (2013), though some details in the extraction procedure differ (for details see Juzek 2016). We screened the items that we have extracted from LI for testability in an experiment. We marked the items whose acceptability could be confounded, for example by deictic references, strong language, unintended alternative readings

Normalized judgments

3 2 1 0 –1 –2 –3

0

25

50

75

100 125 Judgments ranked

150

175

200

225

Figure 13.1. Judgments elicited under controlled conditions produce a linear pattern of well-formedness. Source: Featherston (2005a).

246

 ̈     . 

such as repair readings, garden paths, or colloquial language that might be stigmatized. This left us with 2,539 items we deemed testable. In our corpus, we distinguish between items that come from papers that use a binary scale and items that come from papers that use more than two judgment categories. Authors in the first group presumably make judgments of grammaticality only, using asterisks to mark ungrammatical items and leaving grammatical items unmarked. The second group also marks levels of (un)acceptability, using diacritics like “?,” “??,” “?*” and the like, or assumes a gradient grammar with several levels of grammaticality. Items that come from authors who use gradient scales make up the clear majority of items in our corpus, namely 81 percent. In what follows we therefore concentrate on findings for items from the second group—that is, from papers using a gradient scale. We sampled one hundred testable items from this group—fifty items marked with an asterisk in the corresponding LI-article (*-items) and fifty items unmarked in the LI-article (okay items). In a next step, we collected acceptability judgments in an online experiment using a 7-point-scale. These judgments were made by participants who are not linguists (laypeople). Eighty participants were recruited with the help of Amazon Mechanical Turk. We excluded non-native speakers of American English and participants who did not comply with the task from the results. Fifteen participants were excluded, so that we analyze results from sixty-five participants. Details of our methodology can be found in Juzek (2016) and Häussler and Juzek (2016). The results show some agreement between expert judgments and the ratings given by laypeople. Items *-marked by the corresponding authors tend to elicit low ratings from the laypeople, and items left unmarked by the experts tend to elicit high ratings from the laypeople. However, the results also show a substantial overlap in the rating space for *-items versus okay items. As a basic analysis, we divided the mean ratings offered by the laypeople into three bins, where the midbin consisted of items that received ratings within the range [3, 5]. Figure 13.2 illustrates the results of that comparison. Of the one hundred items, 43 percent fall into the mid-bin. Of the items in the mid-bin, 77 percent were marked with an asterisk in the paper from which they were sampled. Crucially, all items come from authors who use three or more judgment categories. That is, in principle, any of the tested items could have received an in-between judgment from the authors, marked with “?” and the like.

13.3.2 Preliminary discussion A high prevalence of gradience in acceptability has been observed before, notably by Featherston (2005a), as already discussed. The crucial question is, where does the gradience come from? Previously the gradience was assumed to be caused by

      

247

Gradient author judgments and 7-ps experimental ratings 7 Mean rating

6 5 4 3 2 1 1 5 9 13 18 23 28 33 38 43 48 53 58 63 68 73 78 83 88 93 98 Item rank

Figure 13.2. Mean ratings for the 100 LI items (y-axis), in ascending order (x-axis), as given by the non-linguists in our online experiment. Items in the bin [3, 5] are highlighted. Asterisks are items marked as ungrammatical by the original LI authors; circles are items unmarked in LI.

extra-grammatical factors. However, taking our results into account, such an explanation becomes less likely, as we will show here as we present those extragrammatical factors that are commonly assumed to cause the observed gradience. In our discussion we distinguish between performance factors and common methodological artifacts, namely aggregation effects, scale biases, and scale effects. Performance factors. Performance factors are factors such as complexity, memory burden, ambiguity, and real-world implausibility. Such factors can degrade the perceived acceptability of otherwise grammatical items, as the sequence in (1) illustrates. However, as mentioned before, the items were screened for potential confounds before the collection of experimental ratings. The experiment did not include garden path sentences and the like. Because of their origin and their purpose in the LI paper, the examples do not involve semantic problems either. Crucially, most items in the mid-bin are marked as ungrammatical by their corresponding LI authors. These items are not subject to degradation but rather benefit from amelioration. Participants rated them better than we would expect, given that the LI authors marked them as ungrammatical. This pattern resembles an effect described in the psycholinguistic literature as “ungrammatical acceptability” (Frazier 2008), “grammaticality illusion” (Vasishth et al. 2010), or “grammatical illusion” (Phillips et al. 2011). Prominent examples are the missing-VP effect illustrated in (7b), repeated here as (9), and attraction errors as in (10). (9) *The patient the nurse the clinic had hired met Jack. (Gibson and Thomas 1999)

248

 ̈     . 

(10) *The key to the cabinets were rusty from many years of disuse. (Pearlmutter et al. 1999) Sentences like (9) and (10) are ungrammatical, yet they are occasionally accepted in rating experiments and show reduced reflexes of ungrammaticality in reading experiments (e.g. Gibson and Thomas 1999; Christiansen and Macdonald 2009; Vasishth et al. 2010). The illusion can be accounted for by the difficulty of navigating complex syntactic structures in working memory. For instance, integrating the second verb (met) in (9) with the highest subject (the patient) increases the chance of not noticing that the intermediate subject (the nurse) lacks a verb (cf. Häussler and Bader 2015). Likewise, erroneously retrieving the plural modifier NP (the cabinets) in (10) for agreement checking creates an apparent plural agreement with the verb (were). Other illusions also involve checking processes such as the negative polarity item illusion in (11), or complex semantic structures such as the comparative illusion in (12) (see Montalbetti 1984; Vasishth et al. 2008; Parker and Phillips 2016; Wellwood et al. 2018) (11) *The bills that no senator voted for will ever become law. (12) *More people have been to Russia than I have.

(Montalbetti 1984)

The *-items in our data set are different. They do not include any of the phenomena known to create grammatical illusions and do not involve complex structures with multiple potential antecedents or integration sites for critical items. Examples are given in (13), together with the diacritics in the corresponding LI article and mean experimental ratings (in parentheses). (13) a. b. c. d. e. f. g. h.

*John was hoped to win the game. (4.20) *If you want good cheese, you only ought go to the North End. (4.40) *John beseeched for Harriet to leave. (4.51) *John said to take care of himself. (4.83) *October 1st, he came back. (5.00) *John pounded the yam yesterday to a very fine and juicy pulp. (5.26) *Sue estimated Bill. (5.31) *I read something yesterday John had recommended. (5.33)

Although we might be wrong about individual items, we consider it unlikely that our data set contains about thirty illusion phenomena that linguists were not aware of so far. We therefore conclude that the intermediate ratings in the experiment do not reflect only apparent intermediate acceptability.

      

249

Aggregation. Intermediate mean ratings could result from averaging individual ratings that are categorical in nature. If this is the case, we would expect underlying bimodal distributions for items with intermediate mean acceptability when it comes to individual ratings. For example, in our experiment with the 7-point scale, it is possible that, while the ratings averaged over participants (i.e. the aggregated ratings) came out as “in-between,” individual participants only gave endpoint ratings, that is, “1” and “7.” If we observe a mean value of “4,” that could be the average of an equal amount of “1” ratings and “7” ratings. Thus, to check whether the observed gradience is due to aggregation effects, we also need to look at individual ratings. For large sets of items, heat maps are an effective way of plotting the distributions of individual ratings. Heat maps represent data values in terms of colors.¹³ Clusters of cells with similar values in a data matrix come out as areas of similar color. Figure 13.3 gives the heat map for the individual ratings in our experiment. Shades of grey in this figure indicate the frequency of ratings in absolute counts. Gradient author judgments and 7-ps experimental ratings 7 6 No._of_ratings

Rating

5 20

4

10

3

0 2 1 10

20

30

40

50 60 Item_rank

70

80

90

100

Figure 13.3. A heat map of the individual ratings for the 100 LI items (y-axis) in ascending order (x-axis), as given by the non-linguists in our online experiment. Ratings range from 1 to 7. The darker the shade of gray, the more participants have provided the rating in question, and vice versa.

¹³ Common examples in other domains are relief maps with color intensity representing altitude (the darker the blue, the deeper the water) or thermal maps with color indicating the temperature (of a scanned body, a room, etc.).

250

 ̈     . 

For each item (on the x-axis) and each rating category (on the y-axis), the map indicates the number of ratings, that is, the number of participants who gave this particular rating to this particular item. The darker the grey, the more participants gave the corresponding rating. A bimodal distribution would be recognizable thanks to two dark cells used for that item—the two modes. Figure 13.3 shows no signs of bimodal distributions. For items with intermediate mean ratings, we see medium grey in the middle rather than two areas of dark grey at the top and the bottom of the figure. This means that many participants in fact gave intermediate ratings for these items. Thus the intermediate mean ratings are not the result of averaging over two sets endpoint ratings. That is, for any item in the mid-bin in Figure 13.2, many participants gave an “in-between” rating, as illustrated in Figure 13.3. We conclude that the observed gradience is not caused by aggregation. Scale biases. A problem that any rating study must solve concerns scale biases. Even though participants are confronted with the same scale, they may use it in different ways. The same holds for the authors of the extracted items. They simply may apply the scale in different ways, in other words they may apply different criteria for assigning a *-marking or some other diacritic or leave the item unmarked. For the data set we examine here, scale biases are less likely to have strong effects. The corresponding authors did use diacritics other than “*” in their paper. Hence they distinguish levels of acceptability. We found no hints that these authors assume more than two levels of grammaticality, that is, a gradient grammar. They seem to assume a binary division of grammatical versus ungrammatical. Diacritics like “?” are used to mark sentences that are grammatical or ungrammatical, but neither fully acceptable nor completely unacceptable. Since the authors used diacritics like “?,” they could have done so for the items we extracted (only *-items and okay items); but they chose not to do so. We can imagine two reasons underlying the authors’ choice. The authors might have considered the sentences as clearly (un)acceptable. In this case, the experimental ratings should cluster at the endpoints. A plot ordering the items by their mean acceptability should show an S-curve as a noisy approximation of the step function that creates the binary division. Scale biases would affect the dilation of the clusters but should not create a seamless increase. The observed near linear increase cannot be explained as the result of scale biases. Alternatively, the authors may perceive the acceptability of the corresponding items as mediocre, but may be sure about the grammatical status. In this case there would be nothing to compare, since the authors judge grammaticality while the participants in the experiment rate acceptability. This scenario would support our argument for using separate diacritics to indicate grammaticality versus acceptability.

      

251

Scale effects. Scale effects occur when the data set is imbalanced such that it does not include items that would cover the entire scale. When presented with such imbalanced data, participants might tend to readjust the scale and give ratings that cover the whole scale again. For instance, in an experiment in which only unacceptable items are tested, participants might use parts of the scale that they would not have used for unacceptable items if acceptable items had been present. As shown by Cowart (1997), experiments with a balanced set of items and with fillers are less vulnerable to scale effects. (For comparable effects in binary judgment tasks, see Sprouse 2009.) However, scale effects do not affect all linguistic sequences equally. Featherston (2017) argues that certain items are remarkably stable in their ratings, regardless of the other items included in the experimental materials, and that using them as absolute calibration items effectively counters scale effects and other, related distortions. The impact of scale effects is difficult to predict, and the question arises whether scale effects could explain the gradience observed so far. In the present case, participants rated *-marked and unmarked items—that is, unacceptable and acceptable sentences. Participants might have used the middle part of the scale “to fill in the vacuum.” To investigate the impact of scale effects further, we have conducted a second experiment.

13.3.3 Experiment 2: Scale effects In the second experiment we originally wished to re-test the items from Experiment 1, but only those that fell into the outer bins of [1, 3[ and ]5, 7]. This would have applied to fifty-seven items. If those items were re-tested, then the following should apply: if the gradience in Experiment 1 was due to scale effects, then one would expect the mid-bin to repopulate in the same quality and quantity as seen before. However, at the time of Experiment 2, Amazon Mechanical Turk was unavailable to non-US individuals (both workers and requesters). Thus we had to recruit our participants elsewhere, and we decided to use Prolific instead (https:// prolific.ac/). Prolific is mainly used by speakers of British English. Thus we would be comparing the ratings made by speakers of American English with the ratings made by speakers of British English, and this might have distorted the results. To avoid this, we first reran Experiment 1 with participants recruited through Prolific. As before, we ran an online judgment task and asked the participants to rate the items on a 7-point scale. We recruited thirty participants, out of whom we excluded five for not complying with the task. For technical reasons, we collected, for each participant, twice as many judgments as in Experiment 1. In all other respects, this replication follows Experiment 1. The results of the replication study validate those of Experiment 1. Again, ratings

252

 ̈     .  Gradient author judgments and 7-ps experimental ratings

7 Mean rating

6 5 4 3 2 1 1 3 5 7 9

12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 Item rank

Figure 13.4. Mean ratings for the 55 LI items that we have retested (y-axis) in ascending order (x-axis), as given by the non-linguists in our second online experiment. Items in the bin [3, 5] are highlighted. Asterisks are items that are marked as ungrammatical by the original LI authors; circles are items that are unmarked in LI.

cover the entire scale, and especially the mid-bin. Almost half of all items (45 percent) fell into the mid-bin of [3, 5] and 55 percent of all items fell into the outer bins of [1, 3[ and ]5, 7]. In a second step, we re-tested only those items that fell into the outer bins in the replication of Experiment 1—the 55 percent—with the same methodology as in the replication study. In the follow-up we had thirty participants, out of whom we excluded four for not complying with the task. The results of this follow-up are illustrated in Figure 13.4. The mid-bin is repopulated to some extent. Now, 25 percent of the items fall into the range [3, 5]. However, the refill is different in quality. In Experiment 1, 22 percent of all items fall into the narrower bin of [3.5, 4.5] and, for the replication using Prolific, this figure stands at 24 percent. However, only 7 percent of all the re-tested items in the follow-up fall into this narrower range and we now observe a pattern that somewhat resembles the S-curve we would have expected in Experiment 1. Thus we conclude that scale effects can explain some of the observed gradience, but certainly not all of it.

13.4 The puzzle of gradience 13.4.1 Extra-grammatical factors After excluding methodological factors as the main source of the observed gradience, we are still left with several possibilities. In principle, gradience could stem from the grammar itself, or from extra-grammatical factors, or from both.

      

253

Given that acceptability ratings necessarily reflect both grammatical and extra-grammatical factors, the two possibilities are hard to disentangle. Adopting a term coined by Hofmeister et al. (2013), this could be called a source ambiguity problem. Hofmeister and colleagues used this term for the challenge to decide whether an effect observed in acceptability ratings reflects a grammatical constraint or a processing effect. The problem of identifying the source of degradedness in acceptability extents, however, to other impact factors as well, for example to plausibility (see Juzek and Häussler 2019). Let us assume that grammaticality is a binary distinction. Although very common, this is not a necessary assumption. Gradience can be accommodated as part of the grammar by including a quantitative component, for example through constraint weights, as in the Decathlon model (Featherston 2005a) or in linear optimality theory (Keller 2000). However, for now, we maintain the assumption of a categorical grammar for two reasons: (i) this is the prevalent view in the field; and (ii) it allows us to present the puzzle in its strongest form. We checked the papers from which our items were extracted and found no indication that the authors assume a gradient grammar. They do not refer to any existing gradient grammar model and do not mention gradience. We conclude that the authors use diacritics such as “?” to indicate reduced acceptability or uncertainty rather than degrees of grammaticality. For the items under investigation, the authors chose “*”-marking or left the item unmarked. On the basis of what has been said so far, we arrive at the premises (P1)–(P6). (P1) The observed gradience is not an artifact caused by aggregation. (P2) The observed gradience is not an artifact caused by scale biases, especially considering that all the LI authors from which we took our samples were using three or more categories (“*,” some form of “?,” and “OK”). (P3) The observed gradience cannot be fully explained by scale effects. (P4) The observed gradience cannot be fully explained by performance factors, especially considering that the majority of the items with intermediate acceptability ratings in the experiment were marked with an asterisk “*” in the paper they were sampled from. (P5) The authors of the LI papers are not blatantly wrong. (P6) Grammaticality is categorical. Taken together, these propositions lead us to what we call the puzzle of gradience, given in (PG): (PG)

How can the high prevalence of gradience in acceptability be explained?

254

 ̈     . 

Unless there is another factor that we are not aware of and that is not listed here, the puzzle can be resolved only by giving up one of the premises. So far we have argued for the validity of (P1)–(P4); (P5) is corroborated by findings by Culbertson and Gross (2009), Dąbrowska (2010), and Sprouse et al. (2013), for example. Thus, in our view, (P6) is the premise with the least support. Beyond the present data set, the puzzle of gradience boils down to the question whether we are forced to include a quantitative component to our grammar model(s) in order to allow for gradience as a property of grammar. Though Newmeyer (2007) is right that gradience in acceptability does not necessarily imply gradience in grammar, the latter is at least a logical possibility that should not be excluded a priori.

13.5 Concluding remarks We do not have a definite solution to the puzzle. And it is questionable whether definite evidence for the problem of categorical versus gradient grammars can be provided at all. However, we think that our data are circumstantial evidence for gradient grammars. As argued, claiming that the observed gradience is simply due to performance factors is not sufficient for the present data. Proponents of categorical grammars have to offer another explanation. Previously, certainly before Featherston’s results (Featherston 2005a), performance factors were accepted as an explanation for the observed gradience. The default assumption was that the grammar is categorical, and evidence was required to make a case for a gradient grammar. In the present work, we showed that known performance factors and other methodological factors such as aggregation effects, scale biases, and scale effects are not sufficient to explain the observed gradience. Thus we think that the burden of proof has shifted. A gradient grammar seems more plausible, and more evidence is needed to support the assumption of a categorical grammar.

Acknowledgments The authors would like to thank Tom Wasow for valuable advice as well as the audience of the workshop “Linguistic Intuitions, Evidence, and Expertise” and two anonymous reviewers. Experiment 1 was partly funded through the Jesus College, Oxford, Research Allowance. Both authors contributed equally to the chapter.

14 Experiments in syntax and philosophy The method of choice? Samuel Schindler and Karen Brøcker

14.1 Introduction In the Chomskyan tradition of the study of syntax, it is common practice to “informally” consult one’s own or one’s colleagues’ linguistic intuitions when building and testing theories of grammar. More recently, however, critical questions have been raised about this approach by proponents of experimental syntax (XSyn). In their pioneering works, Schütze (1996) and Cowart (1997) have argued that linguists should adopt a more “scientific” approach and put theories of grammar on a broader foundation by collecting linguistic intuitions from a large number of ordinary speakers and by applying well-established statistical methods. Interestingly, this call for a more systematic approach in the practice of syntacticians is echoed in recent discussions in metaphilosophy, that is, in the study of philosophical methods. Like linguists, philosophers have traditionally used their intuitions in thought experiments, so as to assess their theories. But starting in the early 2000s, some philosophers have criticized this informal method and campaigned for a more systematic investigation of intuitions from non-philosophers. This critical approach is generally known as experimental philosophy, or simply XPhi. Two pioneers and proponents of XPhi have recently sought to motivate XPhi by appealing to XSyn and its alleged benefits in linguistics (Machery and Stich 2012). In this chapter we argue that claims about the superiority of experimental methods in syntax are not always justified. Experimental methods in linguistics cannot therefore serve unconditionally as a model for the promotion of experimental methods in philosophy. This is how we proceed. In section 14.2 we review claims about the methodological superiority of XSyn in comparison to traditional, informal methods of using linguistic intuitions as evidence in syntactic research. In section 14.3 we assess whether these claims are justified. In section 14.4 we discuss Machery and Stich’s appeal to XSyn in their championing of XPhi. In section 14.5 we assess Machery and Stich’s claims in the light of our discussion in section 14.3. In section 14.6 we conclude in favor of methodological pluralism. Samuel Schindler and Karen Brøcker, Experiments in syntax and philosophy: The method of choice? In: Linguistic Intuitions: Evidence and Method. First edition. Edited by: Samuel Schindler, Anna Droz˙ dz˙ owicz, and Karen Brøcker, Oxford University Press (2020). © Samuel Schindler and Karen Brøcker. DOI: 10.1093/oso/9780198840558.003.0014

256

     ø 

14.2 Experimental syntax The way linguistic intuitions have traditionally been collected in linguistics can, in short, be characterized like this. Linguists construct a sentence that contains some syntactic phenomenon that they are interested in. They then ask a native speaker, most commonly themselves, whether the sentence appears to be acceptable or not. Often they may ask for the opinion of one or more colleagues as well and refine their analysis on the basis of the colleagues’ responses. The sentence may be considered in the context of other, similar sentences, which differ from the focus sentence mainly in the phenomenon of interest (forming minimal or near minimal pairs with the focus sentence). Schütze and Sprouse (2013) mention five ways in which what they call traditional judgment experiments are different from standard practice in the neighboring field of experimental psychology: I. “relatively few speakers (fewer than ten); II. linguists themselves as the participants; III. relatively impoverished response options (such as just ‘acceptable,’ ‘unacceptable,’ and perhaps ‘marginal’); IV. relatively few tokens of the structure of interest, and V. relatively unsystematic data analysis.” (Schütze and Sprouse 2013: 30) Prima facie, this approach looks utterly unscientific. Accordingly, a number of commentators have criticized this informal practice and called for a more systematic approach. Such an approach is generally called “experimental syntax,” after the title of Cowart’s (1997) book. Cowart focuses on how you can design and run syntactic experiments so as to avoid the problems of the traditional method. According to Cowart, the basic experimental set-up for collecting linguistic intuitions should consist of a questionnaire and should use multiple informants and sentences that come in paradigm-like sets with multiple sentences of each type and with varying order of presentation between informants. Finally, the results should be subjected to relevant statistical tests (Cowart 1997: 12–13). Similar recommendations for experimental work in syntax are found in Schütze (1996), Wasow and Arnold (2005), Featherston (2007), and Gibson et al. (2013). Another recommendation that recurs in the XSyn literature is that subjects should not know what the hypothesis being tested is, so that linguists should not use their own intuitions as evidence (see e.g. Wasow and Arnold 2005). Proponents of XSyn argue that a change from the traditional method to an experimental approach as described here would bring about several improvements. First, they claim that adopting experimental methods will lead to more reliable data by weeding out error variance (random fluctuations across participants).

    

257

Reliability in this sense means consistency over time and across circumstances. When many participants are asked to judge many sentences, random fluctuations cancel each other out, and this produces a higher reliability (consistency) of the results. Second, proponents of XSyn argue that experimental methods will lead to a higher degree of validity (ensuring that we are actually investigating the phenomenon of our interest) by avoiding non-random, irrelevant effects such as experimenter bias, other unconscious biases, unwanted lexical effects and so on (see e.g. Wasow and Arnold 2005; Myers 2009; Gibson and Fedorenko 2010). One way in which experimental studies can avoid such effects lies in the design of experimental studies. For instance, by letting subjects judge multiple lexicalizations of a target structure, one can try to control for parsing issues. Thus, in addition to presenting subjects with the sentence “the lawyer visited on Tuesday was a mess,” which might be judged unacceptable by subjects, one can present (the same or other) subjects with the structurally similar sentence “the factory visited on Tuesday was a mess,” in which the semantics arguably blocks the wrong parse (factories cannot go on visits).¹ Third, proponents of XSyn argue that experimental studies allow for obtaining more nuanced data. Featherston (2007: 275) argues that traditional methods do not shed sufficient light on cases where multiple grammatical phenomena interact, or on cases where the effects are relatively small. Another benefit of using the more sensitive experimental methods, according to some proponents of XSyn, is that they allow for the detection of gradience in acceptability and grammaticality (see e.g. Featherston 2007). Lastly, proponents of XSyn argue that a change from the traditional method to experimental methods will yield more scientific methodology and data practices in general. Some commonly mentioned aspects of this more scientific practice are objectivity, rigor, and transparency (Ferreira 2005; Cowart 1997; Myers 2009; Gibson et al. 2013). We can thus summarize the motivations for XSyn, as compared to more traditional methods: 1. better reliability: less error variance and noise in the data; 2. better validity: less theoretical bias and fewer irrelevant factors; 3. higher sensitivity and richer data: better detection of more aspects of the phenomena; 4. overall better, more scientific methodology. In the following section we shall discuss whether these motivations are well founded.

¹ We wish to thank an anonymous referee for bringing up this issue and for providing the examples.

258

     ø 

14.3 Is XSyn methodologically superior? The formal methods of data collection preached by the XSyn movement have not fallen on deaf ears; even for those sympathetic to traditional methods, they have become important tools of investigation (Sprouse 2015). However, informal methods of collecting linguistic intuitions are still very much predominant in syntactic research.² This could of course be so for purely pragmatic reasons. That is, linguists might stick to informal methods because this is simply the most convenient, least time-consuming, and least expensive way to obtain linguistic intuitions. In a suboptimal world, linguists would stick to traditional methods because they are convenient, even if experimental methods are superior. Do we live in such a world? In order to assess this question, we will discuss each of the four aforementioned claims made by XSyn proponents.

14.3.1 Better reliability of the data gathered through XSyn? In a seminal contribution to the debate about whether the reliability of formal methods of collecting linguistic intuitions is superior to that of informal ones, Sprouse and Almeida (2012a) tested all (i.e. 469) acceptability judgments about English sentences found in a popular syntax textbook with a large sample of ordinary speakers. They conservatively estimated that at least 98 percent of the judgments of linguists and ordinary speakers converge (interpreting all failures to replicate as true negatives). Likewise, Sprouse et al. (2013) tested 148 randomly sampled judgments from a leading linguistics journal and estimated a convergence rate of 95 percent, with a margin of error of 5.3–5.8 percent. In a discussion note, Sprouse and Almeida remark about both of these results: These high (conservative) convergence rates suggest that the sample sizes used by linguists (whatever they are) historically have introduced little error to the empirical record for any combination of the following reasons: (1) the samples are larger than what critics claim; (2) the effect sizes are so large that small samples still yield good statistical power; or (3) [acceptability judgment] results are highly replicated before and after publication (e.g., Phillips 2009). (Sprouse and Almeida 2017b)

² A search in the comprehensive database Linguistics and Language Behavior Abstracts reveals that in the past ten years there have been only twenty-seven peer-reviewed journal articles that contain the keyword “experimental syntax” (in the English language). Note that this is likely to underestimate the use of experimental methods, as not all papers will necessarily use the term.

    

259

Thus it seems that the acceptability judgments obtained by informal methods mostly from linguists themselves reliably reflect the acceptability judgments obtained by more formal methods from large amounts of ordinary speakers.

14.3.2 Better validity through XSyn? One advantage that is often emphasized by the proponents of XSyn is that the intuitions of ordinary subjects are not subject to theoretical biases as much as the intuitions of linguistic experts (Schütze 1996; Cowart 1997; Featherston 2007; Gibson and Fedorenko 2010, 2013). There is, however, another set of concerns, which pulls in the opposite direction with regard to the decision whether to use expert linguists or laypeople, namely pragmatic factors. These factors are sometimes underestimated by proponents of XSyn. For example, ordinary subjects who are asked to provide an acceptability judgment for the first time in their lives might be confused about the purpose of the task. They might, for example, misunderstand the task as asking them to judge whether the sentence in question accords to their language community’s etiquette. Similar concerns hold for other instructions, such as “does this sound natural to you,” which might be understood to ask about the relative frequency of structures similar to the ones presented. More sophisticated strategies have been developed for making sure that ordinary speakers understand the purpose of the task presented to them (see e.g. Gibson and Fedorenko 2013 and Sprouse 2015). However, for all the cautionary steps one might take, it still seems plausible that linguists, by virtue of their training, know best what acceptability judgment tasks require. Another set of concerns relates to performance factors such as parsing and memory constraints. Many ordinary subjects would for example judge the (by now famous) sentence “The horse raced past the barn fell” unacceptable (for easier reference, we will refer to this sentence simply as HORSE in what follows). Linguists, on the other hand, would recognize that this judgment arises from processing limitations in the parsing of the sentence (a so-called garden path effect) and that the sentence is nevertheless perfectly grammatical. It is this judgment that the sentence is grammatical that would then be used as evidence for or against theories of grammar (rather than the plain acceptability judgment per se). Similarly, most subjects would find center-embedded phrases like “A man that a woman that a child that a bird that I heard saw knows loves” unacceptable, even though it is perfectly grammatical (it derives from “A man that a woman loves”). Again, a grammar consistent with this sentence would be seen as supported, not contradicted, by the linguistic evidence. As mentioned in section 14.2, proponents of XSyn have claimed that performance factors cancel each other out in appropriately designed experiments that use large numbers of ordinary subjects. In order to control for performance effects in sentences like HORSE or in

260

     ø 

center-embedded sentences such as the one considered earlier, one will have to vary their lexicalization. For example, a sentence like “the paint daubed on the wall stank” (Collins 2008b) is presumably parsable without difficulty also for people unfamiliar with the garden path effect. This will not work for all targets, though; multiply center-embedded sentences are a case in point. Also, the controls can only be as good as what the experts manage to come up with: the best performance of the folk is thus constrained by the ingenuity of the designers of the experiment. But the design may not be good enough to extract the grammaticality judgments of interest in a given case. And of course, even with very good controls, things can still go wrong. In contrast, professional linguists, by virtue of their training and experience, know what kinds of extraneous effects to watch out for and have honed their skills with extremely many examples. Well-designed experiments may make it even more likely for them to successfully control for confounders; but, even when everything goes well, lay subjects may arguably do, at best, just as well as the experts. This is not to say that there is no epistemological advantage to using lay subjects instead of linguistic experts. On the contrary, experiments with subjects without any theoretical stakes in debates about grammar and without relevant grammatical presuppositions seem to be much better suited for shielding against theoretical bias. We do not think that performance factors are a priori and intrinsically more problematic than theoretical bias is. But we do think that both theoretical bias and performance factors are important errors and that the risks of both must be carefully weighed against each other, as there are both costs and benefits in using linguistic experts versus ordinary speakers. Sprouse and Almeida have also commented on the risk of theoretical bias in acceptability judgments. They argue that, if theoretical bias were a real concern, one would expect “sign reversals” between expert and naïve subject populations (Sprouse and Almeida 2017b). That is, if linguists’ judgments were biased, there should be many instances in which linguists judge a sentence acceptable whereas ordinary speakers do not, and vice versa. Again, Sprouse and Almeida find no evidence for this idea in their textbook study (no sign reversals) and very few instances in their journal study (1–3 percent).³ Sprouse and Almeida make another interesting point about the risk of theoretical bias: although syntacticians have constructed many substantially (or even radically) different kinds of syntactic theories, this divergence is “rarely based on different data sets” (Sprouse and Almeida 2012a: 631). Instead the two conclude: “whatever disagreements there are in linguistics literature, they appear

³ A referee for this chapter remarked (rightly, we think) that avoiding sign reversals may be too low a standard. In other disciplines, for instance in psychology, the fact that effects do not replicate or are just weaker than the original results is already a reason for concern. The equivalent in linguistics, we suppose, would be weaker acceptability judgments.

    

261

to obtain mostly at the level of interpreting, not establishing, the data” (see also Phillips 2009).

14.3.3 Richer data through XSyn? It is generally accepted that acceptability judgments exhibit gradience (more vs. less acceptable) rather than categoricity (acceptable vs. unacceptable). It is also accepted that XSyn is a good means of revealing gradience. It is, however, controversial whether gradience in acceptability indicates real degrees of grammaticality, or whether acceptability judgments exhibit gradience despite a categorical grammar (see also Haider 2007; Fanselow 2007; Phillips 2009).⁴ Given the path linguistics has taken, it would seem that most linguists believe the second disjunct: although Chomsky himself initially thought that his theories could and should capture degrees of grammaticality (Chomsky 1957, 1965), and although there have been several attempts to develop grammars that do allow for grammatical gradience (Bard et al. 1996; Sorace and Keller 2005; Keller 2000; Featherston 2005a; Fanselow et al. 2006), gradience has not played a major role in “mainstream” theories of grammar.⁵ In order to detect gradience in acceptability judgments, non-categorical scales must be used. The seemingly most straightforward way to comply with this requirement is by using Likert scales (with n points). The use of such scales, however, is intrinsically problematic. First, there is a possibility that, for any chosen n, subjects have gradience intuitions that are larger than n. Second, there is no guarantee that the distances between the n points are equal: for example subjects might use the scale difference between 1 and 2 in such a way that it implies a larger (psychological) distance than the scale distance between 3 and 4. In view of these difficulties, some have suggested to use so-called magnitude estimation to acceptability judgments (Bard et al. 1996). In magnitude estimation, which was developed in psychophysics, subjects are presented with a stimulus (e.g. a source of light of a certain brightness) to which they are asked to assign a standard (e.g. 100). They are then asked to assess other stimuli of the same kind by comparison to the standard. For example, if a subject believes that a light source is twice as strong as the standard, it would be assigned the value 200. The advantage of magnitude estimation is apparent: unlike in Likert scales, here subjects can ⁴ It is interesting to note that even judgments about clearly categorical concepts, such as number oddity, yield stable gradience (see Armstrong et al. 1983). ⁵ In a recent survey carried out by one of us (Brøcker 2019; see also this volume), generative linguists were asked whether gradience results in acceptability judgment experiments could be due to a graded grammar or whether such results are more likely due to extra-grammatical factors. There was no significant difference in the frequency with which each option was chosen. This result seems somewhat at odds with the fact that categorical grammars are so prevalent in the literature. As the discussion in the rest of this section indicates, though, there may be good reasons for this prevalence.

262

     ø 

choose the grain of the scales for themselves; and the distances between units are stably defined in relation to the standard. The indefinite grain of magnitude estimation presumably is much better suited for reflecting subjects’ own degrees of gradience (which may vary). Although magnitude estimation seems a powerful tool for probing acceptability judgments, critics have noted that the gain in sensitivity with magnitude estimation as compared to other tasks is effectively insubstantial, as the results of several studies using magnitude estimation have been shown to be representable equally well with Likert scales (Bader and Häussler 2010; Weskott and Fanselow 2008, 2011). More problematically, Sprouse (2011) has demonstrated that subjects fail to make the ratio judgments required by the assumption that stimuli are commutative, which is required by an appropriate use of magnitude estimation, namely p*(q*X)  q*(p*X)—where X is the standard and p and q are multiples relating the standard to other stimuli. Sprouse (2011: 285) speculates that “acceptability judgments may not have a true zero point representing the absence of all acceptability in the way that physical stimuli such as loudness have a true zero point representing the absence of all sound.” Thus magnitude estimation seems inapplicable to acceptability judgments in linguistics. Experimentalists who argue for the relevance of gradience for grammaticality will have to find ways to ameliorate these problems, provide arguments for why gradience may be used despite these shortcomings, or develop new, more appropriate methods for detecting gradience. Another problem in the works of some proponents of XSyn is the relation between acceptability judgments and grammaticality. Featherston (2007), for example, distinguishes between three types of grammaticality judgments: judgments of “perceived well-formedness” (essentially, subjects’ acceptability judgments using magnitude estimations), “traditional binary grammaticality judgment[s]” (which he considers to reflect relative frequency in language use), and judgments according to some theoretical notion of grammaticality, which “is dependent on linguistic knowledge and related to particular assumptions about what a structure should be like” (Featherston 2007: 294–5, emphasis added). Featherston restricts his analysis to the first type of judgment. It is, however, not at all clear that judgments of the first type really are informative of grammar rather than just of acceptability. If they are just acceptability judgments, then results showing that subjects use graded judgments do not directly challenge traditional categorical grammars. In fact it is not obvious that this question would be solvable empirically. Featherston suggests that one should heuristically exclude from grammar building only those acceptability judgments “which can be accounted for by known performance or processing factors” (Featherston 2007: 312; see also Keller 2000 and Schütze 1996).⁶ Yet Featherston’s suggestion would implausibly enlarge the set of ⁶ See Keller (2000: 29): “Given that no systematic performance explanation for gradience is available, we will work on the assumption that gradience is best analysed in terms of linguistic

    

263

grammaticality judgments, since the underlying psychological mechanisms for acceptability judgments (which do not reflect grammaticality) are often ill understood. Waiting for psychology to sort out these mechanisms would bring linguistics to a grinding halt. It thus seems indispensable to the practice of linguistics to use theoretical linguistic considerations in order to disambiguate acceptability from grammaticality. For theoretical considerations to be used successfully in assessing acceptability judgments, they must have some normative force. Consider again HORSE: even though the sentence appears unacceptable, it is widely regarded as grammatical. In order to make grammaticality judgments that “correct” acceptability judgments, one needs normative grammatical theories: sentence X ought to be grammatical by the lights of well-confirmed grammatical theory T, in spite of X’s appearing unacceptable. But it is not clear that theories of grammatical gradience of such normative force exist. On the contrary, the theories of grammatical gradience that have hitherto been developed seem to lack it, since they take as input unfiltered acceptability judgments.⁷ Although the use of theoretical constraints in the assessment of the data can of course be problematic (see section 14.3.2), it need not be. There are situations in which theoretical bias can be methodologically positive, namely when the theory generating the bias is a theory that has independent empirical support. For example, when it was found a few years ago that neutrinos travel faster than the speed of light, it was prudent for physicists to exercise skepticism toward this result, given that it contradicted one of the most well-confirmed theories in modern physics: Einstein’s special theory of relativity (Schindler 2013). Proponents of XSyn are often motivated by the idea of obtaining more theoryneutral, empirically driven ways of discerning grammaticality from acceptability. But it is questionable whether this can be done only from the bottom up, so to

competence.” Schütze (1996: 6ff.) also recommends to first get a better grasp of performance factors by building models of those before using acceptability judgments in the construction of grammars. ⁷ Most theories of grammatical gradience are based on versions of optimality theory, in which a “competition” between candidate structures selects one candidate as optimal or grammatical when it best satisfies multiple grammatical constraints. In its most advanced form, namely linear optimality theory, this approach comes with a learning algorithm which estimates weights for the grammatical constraints it takes as input. These weights are computed on the basis of training sets that contain candidate structures associated with a grammaticality score. The algorithm then determines an optimal set of constraint weights for a given training set (see Keller 2000 and Sorace and Keller 2005 for details). The weights ideally represent grammatical gradience. The problem, though, is that, if the grammaticality score used in the determination of the weights comes from acceptability judgments (as they do in XSyn experiments), then gradience in the form of weights of grammatical constraints may reflect merely extra-grammatical processing constraints. Again, additional arguments are required for concluding that the gradience found in judgments actually reflects degrees of grammaticality. It should also be noted that there is a risk of “overfitting” the weights of the constraints to the used training set, so that the grammaticality models become poor predictors of “unseen” or new data, as appreciated by Keller (2000: 272).

264

     ø 

speak, by first determining performance factors in order to determine grammaticality, as XSyn proponents have suggested. Already Chomsky (1965) expressed doubt that there could ever be any “operational criteria” for determining grammaticality and believed that sometimes ambiguous cases should be disambiguated by the accepted grammars (see also Schütze 1996: 22–3). In fact there are reasons to think that a “theory-neutral” and “bottom-up” approach, which determines performance factors before tackling the question of grammaticality, may set the bar too high: as we shall see in the next section, not even the model science of physics can be said to satisfy the conditions for such an approach. It is also true that, even when acceptability judgments are gathered in accordance with the principles of experimental syntax (non-linguists, large numbers), the results will still have to be ultimately analyzed by linguists. They are the ones who will have to pull apart grammaticality from acceptability data. Hence we should have no illusions about the possibility of eliminating the theoretical bias entirely with the use of experimental methods.

14.3.4 Is XSyn more scientific? Even though XSyn’s call for systematic and controlled experiments seems prima facie much more scientific than the more informal ways of obtaining acceptability judgments, we have seen here that there are significant complications to such an approach. Not only is it more difficult to instruct lay subjects to generate acceptability judgments, but acceptability data will also have to be cleared of performance factors. Such considerations must be weighed against any possible advantage in guarding against theoretical bias, a threat on which XSyn has put much focus. Although XSyn proponents have emphasized the detection of gradience as a selling point of experimental methods, it is not at all clear how gradience (which undoubtedly exists) is best to be measured in a reliable way, as we saw in section 14.3.3. Lastly, it has been shown that informal methods produce judgments that seem highly representative of judgments obtained from ordinary speakers. In sum, the case that XSyn would be more scientific and would provide better data than informal methods seems weak. But are informal methods scientific in the first place? The fact that linguists’ acceptability judgments are representative may suggest so (see section 14.3.1). Still, one may ask again why linguists’ judgments constitute reliable evidence for theories of grammars. One reason is surely a pedestrian one: the vetting that acceptability judgments receive at conference and seminar presentations and from reviewers in journals (Phillips 2009). But there are other aspects of using linguistic intuitions as evidence, and these aspects deserve being highlighted in a discussion of scientificity. Chomsky himself has often compared the building of grammars on the basis of acceptability and grammaticality judgments to the scientific method introduced

    

265

by Galileo (Chomsky 1980; Chomsky and Saporta 1978). In his review of Chomsky’s talk about the “Galilean style” of linguistics, Botha (1982) identifies three elements: the construction of abstract models of grammar, the mathematical (or formal) nature of models, and the belief that abstract models have a higher degree of reality than “the ordinary world of sensation.” The last of these features is expressed through what Botha calls “epistemological tolerance” and Chomsky has described as “a willingness to set aside apparently refuting evidence” and “a readiness to tolerate unexplained phenomena or even as yet unexplained counterevidence” (Chomsky 1980: 9–10). Chomsky himself evokes Galileo’s struggle to prove that Earth is a moving planet (on the basis of a number of astronomical observations) at a time when no plausible theory for terrestrial physics on a moving Earth was yet available (Chomsky and Saporta 1978).⁸ There are several examples in which Chomsky and other linguists have used epistemological tolerance (Botha 1982; Riemer 2009; Behme 2013), but one important application is, arguably, acceptability judgments: although all acceptability judgments, potentially, are evidence of grammars, an acceptability judgment might say more about certain performance factors than about the grammar itself. Acceptability judgments that conflict with (theoretically driven) grammaticality judgments are therefore to be treated with caution. Again, the line one has to tread here is a thin one, but any theory development requires “breathing space” from evidence that only appears to refute it, but doesn’t actually do so (Lakatos 1970; Botha 1982; Feyerabend 1975). Sometimes one may even lack full explanations of why not all relevant phenomena can be accommodated by one’s theory. Galileo and Newton, for example, pioneered our modern understanding of physics by discovering fundamental principles of motion without having a theory of “fudge factors” such as friction and air resistance. Instead, the difference between the predictions derived from those principles and the data remained theoretically unaccounted for (see Koertge 1977). It would have been a tremendous loss to science had their theories been dismissed because they could not fully accommodate the observable phenomena (including the fudge factors). Even the fundamental principles of more recent theories in physics often serve primarily explanatory purposes, and need to be amended with “phenomenological” corrections in order to fully capture the phenomena (see Cartwright 1983). It seems ill advised to set higher standards for linguistics in the discovery of grammatical principles and to demand that linguists first present fully fledged theories of memory limitations and the like before deciding on the grammaticality status of a sentence. Despite these reservations, there are undoubtedly further, seemingly more scientific benefits of XSyn. Some proponents of experimental methods argue

⁸ For further examples, see chapter 6 in Schindler (2018).

266

     ø 

that formal–quantitative methods, in addition to generating reliable data, also provide us with information about the data. Ideally this meta-information allows us to infer whether a specific result is trustworthy or not. For example, Myers points out that formally collected quantitative results are usually reported with a measure of statistical significance, which, “in turn, is related to the probability of future replications” (Myers 2009: 409). That is, the result of a formal judgment experiment comes with a measure of the experiment’s reliability. This is not the case for informal judgment experiments. Yet such measures have limitations. In particular, they are, first and foremost, about acceptability judgments. That is, even with a perfectly reliable experiment and very stable and repeatable data, theoretical reasons can weigh in so strongly that the produced data have no bearing whatsoever on the grammatical theories in question (as for example in HORSE). The produced data would then be data about parsing, memory limitations, and the like. Another dimension that experimental methods can reveal and traditional informal methods cannot is effect size. Sprouse and Almeida (2013) show that most of the phenomena in the samples collected in their textbook study (Sprouse and Almeida 2012a) have very large effect sizes. Sprouse and Almeida suggest that this fact would legitimize the use of informal methods in syntax in general, as they take the phenomena they investigate to be highly representative of the ones in focus within syntactic research. Yet Gibson et al. (2013) deny this. Instead, they claim that “cutting edge” and “forefront” syntactic research often debates more exotic phenomena, not covered by the examples investigated by Sprouse and Almeida.⁹ They are not satisfied with the idea that Sprouse and Almeida have established once and for all that the effect sizes studied by linguists are big. They therefore demand that the effect sizes are always checked experimentally, and not just estimated from the armchair. Even if a large majority of phenomena studied in syntax have large effect sizes, they argue, this does not help the researcher who investigates a new phenomenon. Only once a quantitative study has been done for a particular phenomenon can one confidently make assertions about the effect and sample sizes for that very phenomenon. Another potential benefit of using formal methods is that they constitute more regimented procedures for collecting and analyzing data. For instance, it is not standard practice for those who use the informal method to report how many colleagues they consulted for their judgments, how many agreed or disagreed with the judgment presented by the author, which lexicalizations they considered, whether they discarded any particular lexicalizations, and

⁹ But cf. Phillips (2009), who has argued that in the vast majority of cases it is not the phenomena that are controversial, but rather their interpretations (as being grammatical or not).

    

267

why. Myers (2009) argues that this kind of transparency about the procedure that lies behind results produced with the informal method would allow readers to judge the trustworthiness of those results more easily (this is especially relevant in cases where the reader is not a native speaker of the language the judgments are about). Others point out that a scientific test should first and foremost give valuable insights, not necessarily be maximally methodologically rigorous or objective according to some standard (Grewendorf 2007). However, methodological rigour and insight can go hand in hand. As argued earlier, experimental methods can offer insights about the data. Whether those insights are worth the added effort might vary with how controversial the judgments are, as well as with the potential grammatical importance of the phenomenon in question.

14.4 Experimental philosophy: Common motivations The practice of using one’s intuitions as evidence for theorizing can also be found in philosophy. In the so-called method of cases, philosophers consider hypothetical scenarios, make judgments about these scenarios, and draw from them conclusions for their theories about the mind, language, knowledge, ethics, and so on. These scenarios are often simply referred to as “cases” or thought experiments. Moreover, just like linguists, philosophers have started to systematically test the intuitions of non-philosophers (“the folk”) (Weinberg et al. 2001; Machery et al. 2004). As an example of this method, consider Gödel cases. In these cases it is imagined that it wasn’t Kurt Gödel who proved the incompleteness theorem of arithmetics but rather a man called Schmidt. Schmidt died under mysterious circumstances and Gödel somehow got hold of the proof and successfully claimed credit for it. The common intuition is that the name “Gödel” refers to the person who got hold of the proof and successfully claimed credit for it, not to the person to whom the description “discovered the proof” actually applies, namely Schmidt. Such cases were first brought up by Kripke (1980) in his famous Naming and Necessity, which is widely viewed as making a convincing case for a “causal” theory of reference. The method of cases is used in many areas of philosophy, for instance in the philosophy of mind, in epistemology, in moral philosophy, in philosophy of language, in metaphysics. Famous thought experiments are Gettier cases, Mary’s Room, Searle’s Chinese Room, fake barn cases, trolley cases, split brain cases, and more (Brown and Fehige 2017). Because judgments made in cases such as these appear to be fairly immediate, they are often referred to as “intuitions,” although several philosophers prefer not to describe them in terms of their subjective phenomenology (Williamson 2007b, 2011; Machery 2017). We shall here adopt Machery’s minimalist conception of intuitive

268

     ø 

judgments as simply “case judgments,” that is, judgments made in cases (Machery 2017).¹⁰ Experimental philosophers, just as experimental syntacticians, have criticized philosophers’ reliance on their own intuitions and the tendency to ignore the intuitions of the folk. Several of the motivations of experimental philosophers mirror those of the experimental syntacticians. Machery and Stich (2012), for example, emphasize the risk of theoretical bias in philosophers’ practice of using their own intuitions to assess theories of linguistic reference in the Gödel cases; they are concerned that philosophers of the Kripkean persuasion—that is, the majority of analytic philosophers—are biased toward the kind of judgment that is consistent with the causal theory of reference. In one of the first XPhi studies, Machery et al. (2004) presented evidence that, in Gödel cases, Chinese undergraduate students (in an unspecified field) have intuitions that align to a surprising degree with descriptivist theories of reference (in the Gödel case, this would be tantamount to saying that “Gödel” refers to the person who actually discovered the proof, namely Schmidt). Machery and Stich (2012) see this study as demonstrating that philosophers cannot blindly rely on their own intuitions when constructing theories of reference; and they cite it as a good example of the kind of method that philosophers should follow. They conclude by formulating a dilemma for philosophers of language: either ordinary speakers’ intuitions matter for theories of reference or they do not. In the former case, theories of reference must be “substantially modified to accommodate the variation in reference determination” (Machery and Stich 2012: 509). In the latter case, they claim, philosophers’ intuitions are not sufficient to justify the assumption that “proper names have a semantic reference” rather than just “speaker’s reference,” that is, reference intended by the speaker in communication. In support of this claim, they cite Chomsky and other linguists and philosophers of language who have expressed skepticism about the existence of semantic reference (see Chomsky 2000). They also point out that the intuitions of philosophers, in contrast to the intuitions of experts in other fields, are not externally validated (e.g. in the way a doctor’s intuitive judgment that the collarbone is broken can be externally validated by an x-ray). Machery and Stich explicitly appeal to XSyn to support their conclusion and to motivate the XPhi approach more generally. Apart from the risk of theoretical bias, whose avoidance is also a central motivation of experimental syntacticians, Machery and Stich mention the risk that the traditional method of cases ignores ¹⁰ There have been many attempts to justify the reliability of case judgments theoretically: as a priori necessary judgments (Bealer 1998), as possibility judgments (Malmgren 2011), as counterfactual judgments (Williamson 2007b), or as quasi-perceptual mental states (Bengson 2015; Chudnoff 2011, 2013b). Unlike in linguistics, however, in philosophical reasoning there is no agreed upon or even widely shared view of why the use of case judgments as evidence might be justified. Some philosophers have even denied that intuitions play any significant role in the practice of philosophy (Cappelen 2012).

    

269

the “diversity of intuitions”; and they compare this risk to the one of ignoring “dialectical variation” in linguistic intuitions among non-linguists. In philosophy, one could be interested for example in possible differences among the intuitions of subjects from different cultural backgrounds, among the intuitions of men and women, and so on. Such differences are arguably better investigated by experimental means, and not from the armchair. However, Machery and Stich’s analogy between possible differences of this kind and dialectal variation in syntax does not seem entirely apt. First, in contrast to the traditional practice in philosophy, syntacticians have investigated idiolects with traditional, informal methods. Second, contrary to how Machery and Stich make it sound, detecting dialectical variation has not played a major role in XSyn. In fact one of the critiques launched against experimental syntacticians by the traditionalists is that using large samples of speakers and averaging across the results puts one at risk of overlooking individual variation (Den Dikken et al. 2007; Fanselow 2007). Although dialectical variation is not a concern of current XSyn, there is another motivation that drives proponents of XSyn, and it seems exploitable for Machery and Stich’s purposes—at least in principle. This motivation concerns gradience in acceptability judgments. The detection of gradience in the judgments of ordinary speakers, just like the detection of diversity of philosophical intuitions, requires the systematic investigation of larger samples of judgments. Finally, the proponents of XPhi, just like the champions of XSyn, have claimed that the methods of XPhi are more scientific than the informal ones, and therefore also more reliable. Alexander and Weinberg, for example, write that “philosophers need to continue to improve the methods used to study philosophical cognition, combining survey methods with more advanced statistical methods and analyses, and supplementing survey methods with a wider variety of methods from the social and cognitive sciences” and that “too many questions pertinent to evaluating the trustworthiness of epistemic intuitions can only be addressed properly with some substantial reliance on scientific methods” (Alexander and Weinberg 2014: 138 and 141, emphasis added). In sum, the four motivations driving XSyn can also be found in XPhi: greater reliability, less theoretical bias, higher sensitivity and richer data, and overall a better, more scientific methodology. Already in section 14.3 we saw that none of the motivations of XSyn would justify an all-round claim to superior methodology of XSyn. In what follows we will try to draw the appropriate lessons for XPhi by assessing the analogies that have been made between XPhi and XSyn.

14.5 Lessons for XPhi from XSyn Machery and Stich (2012: 495) have argued that “philosophers should emulate linguists, who are increasingly replacing the traditional informal reliance on their

270

     ø 

own and their colleagues’ intuitions with systematic experimental study of ordinary speakers’ intuitions.” In the face of the more accurate picture of the controversial value of experimental methods in syntax we presented here, XSyn cannot serve as a model for the promotion of XPhi, at least not unconditionally.

14.5.1 Better reliability of data gathered by XPhi? Unlike in linguistics, where the debate has centered on the question of whether formal methods are more reliable than informal ones, in philosophy the debate has taken a rather different shape. Although XPhi practitioners assume that their method is more reliable and constitutes a more scientific way of collecting case judgments, they do not claim that the data obtained by these methods are more reliable than the case judgments of philosophers. On the contrary, they have argued on the basis of the results obtained by XPhi methods (mostly with subjects without philosophical training) that case judgments per se are unreliable (Alexander and Weinberg 2007, 2014; Machery 2017). More specifically, they have argued that intuitive judgments found in the folk vary with factors that seem extraneous or irrelevant to the task at hand, such as demographic variables (e.g. culture, gender) or presentation effects (e.g. order of presentation) (see Machery 2017). Several XPhi proponents have argued that case judgments should be used only with extreme caution—if at all (Weinberg et al. 2001; Alexander and Weinberg 2007, 2014; Machery 2017). Proponents of the traditional method have questioned XPhi’s skeptical conclusions by criticizing the use of subjects they deem unqualified, namely subjects without philosophical training. Proponents of this “expertise defense,” as it has come to be known, have demanded that experimental philosophers provide proof that not only the folk, but also professional philosophers are subject to such extraneous effects (Hales 2006; Williamson 2007b, 2011; Ludwig 2007; Devitt 2006c, 2011a; Horvath 2010). XPhi proponents, in turn, have questioned the very idea that philosophers possess expertise for making case judgments and argued that philosophers’ judgments would be subject to extraneous factors, even if philosophers did possess an expertise for making them (Weinberg et al. 2010; Machery 2017). Experimental philosophers have also asked why philosophers rather than the folk should be trusted in cases of disagreement (Machery 2017). Machery et al. (2004) believe that the expertise defense “smacks of narcissism in the extreme.”¹¹

¹¹ Although the equivalent of the expertise defense in linguistics (namely that professional linguists are better subjects for making acceptability judgments than ordinary speakers are) has not been entertained by anybody in the debate about XSyn, it has been mentioned occasionally as a possible position (Gibson and Fedorenko 2013; Sprouse 2015).

    

271

In attempts to break this stalemate, experimental philosophers have started to design experiments with professional philosophers as subjects. These experiments have presented evidence that even professional philosophers are subject to extraneous effects (Schwitzgebel and Cushman 2012, 2015; Tobia et al. 2013; Schulz et al. 2011). But, because these experiments have been carried out mostly in the field of moral philosophy and because intuitive case judgments may not form a single kind, but a motley bunch, and may be reliable to different degrees (Nado 2014), the option that philosophers may be better judges of cases in other areas of philosophy is still a live option. Lastly, in support of traditional judgments in epistemology, and particularly in Gettier cases, it has been shown experimentally that folk intuitions actually do converge with the ones of philosophers, when the folk are guided so as to comprehend the different steps involved in making a case judgment (Nagel et al. 2013; Turri 2013).¹² This would seem to suggest that the reasons for previous results that indicate non-standard responses in the folk may have to be sought in extraneous factors such as an imperfect understanding of the vignette.

14.5.2 Better validity of XPhi? Is the risk of theoretical bias a good reason for adopting experimental methods in philosophy? As in the case of XSyn, we believe that there is a risk not only of theoretical bias in the “traditional” method, but also of performance errors when laypeople are used. That is, just as in XSyn, there is a risk that naïve subjects would not understand the task instructions properly or would make mistakes that result from not having frequent exposure to philosophical thought experiments. The problem of the speaker’s reference (discussed by Machery and Stich 2012) is just one example: instead of understanding that the task is about the (fixed) reference of proper names, subjects might take the task to be about the person they intend to refer to. But, as we mentioned in section 14.5.1, experimental philosophers have identified a whole range of extraneous variables according to which intuitive judgments of the folk vary. Again, even though some studies show that philosophers’ intuitions vary in this problematic way as well, there is still a lot of work to be done to show that philosophical case judgments are unreliable in general, for both the folk and the philosophers. As we explained earlier, we do not believe that the risk of theoretical bias is obviously more severe than the risk of performance errors. On the contrary, it seems that theoretical bias is more easily controlled for within the standard practice of philosophers. Philosophers are known for holding often radically ¹² Even outright critics of the armchair method have experimentally confirmed standard philosophical judgments in some cases (Machery et al. 2017).

272

     ø 

different views and for defending them ferociously. It thus seems unlikely that their theories would bias their judgments in such a way that these judgments would converge because of the theories philosophers hold. Indeed, we think that something analogous to XSyn is true: philosophers tend to agree on many judgments in thought experiments despite the fact that there is such a diversity of philosophical views. For example, there is a plethora of views regarding the possibility of strong artificial intelligence in Searle’s famous Chinese Room argument.¹³ Yet all parties to the debate seem to agree with the judgment that the man in the room does not understand Chinese, even though they might disagree that the entire system of the room, or all variations of the room, exhibit understanding. Although there exists no systematic survey among philosophers that would show that there is a consensus in the judgments of important thought experiments,¹⁴ even critics of the “method of cases” concede that there are very robust case judgments, which are widely shared in the philosophical community (Machery 2017).¹⁵ In sum, just as in linguistics, in philosophy, too, performance errors must be a concern for those who advocate the use of non-philosophers, in their desire to forego the risk of theoretical bias when using experts. And theoretical bias seems no great actual risk, given that philosophers with radically different views often share the same intuitions.

14.5.3 Richer data through XPhi? Although Machery and Stich (and, earlier, Machery et al. 2004) advertise the detection of cultural differences as good reason for adopting experimental methods, as mentioned, much of the XPhi proponents’ work has followed a different agenda. This agenda is largely “negative,” in the sense that XPhi proponents have used their experiments to argue for the unreliability of case judgments. Thus, many of the effects detected by XPhi seem to be of a kind that is generally deemed irrelevant for philosophical purposes, by proponents of experimental and traditional methods alike (Alexander and Weinberg 2007). In this sense, the effects in question are akin to gradience in linguistic acceptability ¹³ In Searle’s Chinese Room “argument,” it is imagined that a man without any knowledge of the Chinese language sits in a closed room and receives strings of Chinese symbols from one end and outputs other strings of Chinese symbols at the other end, on the basis of symbol manipulations detailed in a big rule book of Chinese. Even though the outputs would appear to be perfectly well formed (and meaningful) to a Chinese speaker located outside the room, the man in the room does not understand Chinese. Searle used this thought experiment to argue against Strong AI: although machines may be capable of perfectly simulating (linguistic) intelligence through syntactical manipulations, they lack the semantics required for true intelligence. ¹⁴ See, however, the recent survey by one of us (Schindler and Saint-Germier n.d.). ¹⁵ The survey by Bourget and Chalmers (2014) provides data for philosophers’ judgments on some thought experiments, including the Zombie argument and trolley cases. These seem to be less stable.

    

273

judgments that are not generally agreed to be of relevance to theories of grammar (although some proponents of XSyn disagree). But there is also work in a more positive vein, which is interested in the psychological mechanisms that underlie case judgments (Knobe and Nichols 2008) and in better understanding the (confounding) factors that might contribute to our making judgments in a particular way (Mortensen and Nagel 2016; Alexander and Weinberg 2014). This use of experimental methods in philosophy, we think, may help improve the evidential base of philosophy and our understanding of intuitive judgments. The goal of this approach is not so much to replace the traditional armchair method as to understand how it can be put to use and when it is safe to apply it; and it shows how both formal and informal methods can be used alongside each other.

14.5.4 Is XPhi more scientific? Just like in linguistics, it seems that methods should be chosen for the purposes they serve best. Whether the best method is the traditional “armchair” method or whether it is XPhi may have no unequivocal answer. At the very least the answer to this question depends on the purpose that is being pursued. If the purpose is to obtain case judgments that are least subject to performance factors, then, just as in linguistics, armchair methods seem to be more suited (as the proponents of the expertise defense have argued). On the other hand, if one is interested in what the folk think (as Machery and colleagues have been arguing we should be), then XPhi does seem the method of choice.

14.6 Conclusion In this chapter we argued that, although XSyn and XPhi share many motivations (better reliability, better validity, richer data, more scientificity or objectivity), claims about the intrinsic superiority of these experimental approaches to collecting intuitions in these two fields are not justified. Although there is a case to be made for experimental methods being better suited for controlling theoretical bias, there are other errors (specifically, performance errors) for whose control the use of experts seems more advantageous. As we have suggested in section 14.3, this is prima facie compatible with the idea that in some cases experimental methods may be helpful in uncovering and countering some forms of performance error. Formal methods should thus not be used blindly, and any method should be assessed for what it can achieve for the purpose at hand.

Acknowledgments We would like to thank two anonymous readers and Anna Drożdżowicz for their input.

References Abrusán, Márta, and Kriszta Szendroi (2013). “Experimenting with the king of France: Topics, verifiability, and definite descriptions.” Semantics and Pragmatics 6(10): 1–43. Adams, Marilyn Jager (1990). Beginning to Read: Thinking and Learning about Print. Cambridge, MA: MIT Press. Adger, David (2003). Core Syntax: A Minimalist Approach. Oxford: Oxford University Press. Adleberg, Toni, Morgan Thompson, and Eddy Nahmias (2014). “Do Men and Women Have Different Philosophical Intuitions? Further Data.” Philosophical Psychology 28(5): 615–41. doi: 10.1080/09515089.2013.878834. Adrián, José Antonio, Jesús Alegria, and José Morais (1995). “Metaphonological Abilities of Spanish Illiterate Adults.” International Journal of Psychology 30(3): 329–53. Alexander, Joshua, and Jonathan M. Weinberg (2007). “Analytic Epistemology and Experimental Philosophy.” Philosophy Compass 2(1): 56–80. Alexander, Joshua, and Jonathan M. Weinberg. (2014). “The ‘Unreliability’ of Epistemic Intuitions,” in Edouard Machery and Elizabeth O’Neill (eds.), Current Controversies in Experimental Philosophy. New York: Routledge, 128–45. Alexopoulou, Theodora, and Frank Keller (2007). “Locality, Cyclicity, and Resumption: At the Interface between the Grammar and the Human Sentence Processor.” Language 83: 110–60. Anderson, John R. (1980). Cognitive Psychology and Its Implications. San Francisco, CA: W. H. Freeman. Armstrong, Sharon Lee, Lila R. Gleitman, and Henry Gleitman (1983). “What Some Concepts Might Not Be.” Cognition 13: 263–308. Arppe, Antti, and Juhani Järvikivi (2007). “Every Method Counts: Combining CorpusBased and Experimental Evidence in the Study of Synonymy.” Corpus Linguistics and Linguistic Theory 3(2): 131–59. Bader, Markus, and Jana Häussler (2010). “Toward a Model of Grammaticality Judgments.” Journal of Linguistics 46(2): 273–330. Bard, Ellen Gurman, Dan Robertson, and Antonella Sorace (1996). “Magnitude Estimation of Linguistic Acceptability.” Language 72(1): 32–68. Bayne, Tim (2009). “Perception and the Reach of Phenomenal Content.” Philosophical Quarterly, 59(236): 385–404. Bayne, Tim, and Michelle Montague (2011). “Cognitive Phenomenology: An Introduction,” in T. Bayne and M. Montague (eds.), Cognitive Phenomenology. Oxford: Oxford University Press, 1–34. Bealer, George (1998). “Intuition and the Autonomy of Philosophy,” in M. R. DePaul, and W. Ramsey (eds.), Rethinking Intuition: The Psychology of Intuition and Its Role in Philosophical Inquiry. Lanham, MD: Rowman & Littlefield, 201–39. Behme, Christina (2013). “Noam Chomsky: The Science of Language: Interviews with James McGilvray.” Philosophy in Review 33(2): 100–3. Bengson, John (2015). “The Intellectual Given.” Mind, 124(495): 707–60. Benz, Anton, and Nicholle Gotzner (2014). “Embedded Implicatures Revisited: Issues with the Truth-Value Judgment Paradigm,” in Judith Degen, Michael Franke and Noah

276



D. Goodman (eds.), Proceedings of the Formal and Experimental Pragmatics Workshop (ESSLLI). Tubingen: ESSLLI, 1–6. Bertelson, Paul, Beatrice de Gelder, Leda V. Tfouni, and José Morais (1989). “Metaphonological Abilities of Adult Illiterates: New Evidence of Heterogeneity.” European Journal of Cognitive Psychology 1(3): 239–50. Bever, Thomas G. (1970). “The Cognitive Basis for Linguistic Structures,” in R. Hayes (ed.), Cognition and Language Development. New York: Wiley & Sons, 277–360. Bhatt, Rajesh, and Roumyana Pancheva (2004). “Late Merger of Degree Clauses.” Linguistic Inquiry 35: 1–45. Biber, Douglas (1988). Variation across Speech and Writing. Cambridge: Cambridge University Press. Bloomfield, Leonard (1933). Language. New York: Holt. Bock, J. Kathryn (1986). “Syntactic Persistence in Language Production.” Cognitive Psychology 18: 355–87. Bond, Zinny (2008). “Slips of the Ear,” in D. Pisoni and R. Remez (eds.), Handbook of Speech Perception. Malden (MA): Blackwell, 290–310. BonJour, Laurence (1980). “Externalist Theories of Empirical Knowledge.” Midwest Studies in Philosophy 5(1): 53–74. Bošković, Željko, and Howard Lasnik (2003). “On the Distribution of Null Complementizers.” Linguistic Inquiry 34: 527–46. Botha, Rudolf P. (1982). “On ‘the Galilean Style’ of Linguistic Inquiry.” Lingua 58(1): 1–50. Bourget, David, and David J. Chalmers (2014). “What Do Philosophers Believe?” Philosophical Studies 170(3): 465–500. Branigan, Holly P., and Martin J. Pickering (2017). “An Experimental Approach to Linguistic Representation.” Behavioral and Brain Sciences 40. https://doi.org/10.1017/ S0140525X16002028. Bresnan, Joan W. (2007). “Is Syntactic Knowledge Probabilistic? Experiments with the English Dative Alternation,” in Sam Featherston and Wolfgang Sternefeld (eds.), Roots: Linguistics in Search Of Its Evidential Base. Berlin: Mouton de Gruyter, 75–96. Bresnan, Joan W., Anna Cueni, Tatiana Nikitina, and Harald Baayen (2007). “Predicting the Dative Alternation’, in G. Boume, I. Kraemer, and J. Zwarts (eds.), Cognitive Foundations of Tnterpretation. Amsterdam: Royal Netherlands Academy of Science, 69–94. Brøcker, Karen (2019). Justifying the Evidential Use of Intuitive Judgements in Linguistics. PhD thesis, Aarhus University. Brogaard, Berit (2013). “Phenomenal Seemings and Sensible Dogmatism,” in C. Tucker (ed.), Seemings and Justification: New Essays on Dogmatism and Phenomenal Conservatism. New York: Oxford University Press, 270–89. Brogaard, Berit (2018). “In Defense of Hearing Meanings.” Synthese 195(7): 2967–83. Brouwer, Harm, Matthew W. Crocker, Noortje J. Venhuizen, and John C. J. Hoeks (2017). “A Neurocomputational Model of the N400 and the P600 in Language Processing.” Cognitive Science 41: 1318–52. Brown, James Robert, and Yiftach Fehige (2017). “Thought Experiments,” in Edward N. Zalta (ed.), The Stanford Encyclopedia of Philosophy. https://plato.stanford.edu/arch ives/sum2017/entries/thought-experiment. Bruening, Benjamin (2010). “Ditransitive Asymmetries and a Theory of Idiom Formation.” Linguistic Inquiry 41: 519–562. Buckwalter, Wesley, and Stephen Stich (2014). “Gender and Philosophical Intuition,” in Joshua Knobe and Shaun Nichols (eds.), Experimental Philosophy, vol. 2. New York: Oxford University Press, 307–46.



277

Bybee, Joan L., and Joanne Scheibman (1999). “The Effect of Usage on Degrees of Constituency: The Reduction of don’t in English.” Linguistics 37: 575–96. [Reprinted in Bybee, Joan (2997). Frequency and the Organization of Language. Oxford: Oxford University Press, 2294–312.] Cappelen, Herman (2012). Philosophy without Intuitions. Oxford: Oxford University Press. Carroll, John M., Thomas G. Bever, and Chava R. Pollack (1981). The Non-Uniqueness of Linguistic Intuitions. Language 57: 368–83. Carruthers, Peter (2006). The Architecture of the Mind. Oxford: Oxford University Press. Carston, Robyn (2010). “Metaphor: Ad hoc Concepts, Literal Meaning and Mental Images,” Proceedings of the Aristotelian Society 110: 295–321. Cartwright, Nancy (1983). How the Laws of Physics Lie. Oxford: Oxford University Press. Chesi, Cristiano, and Andrea Moro (2015). “The Subtle Dependency between Competence and Performance,” in Á. Gallego and D. Ott (eds.), 50 Years Later: Reflections on Chomsky’s Aspects. Cambridge, MA: MIT Working Papers in Linguistics, 33–45. Chomsky, Noam (1957). Syntactic Structures. The Hague: Mouton. Chomsky, Noam (1961). “Some Methodological Remarks on Generative Grammar.” Word 17: 219–39. [Reprinted as Chomsky, N. (1964). “Degrees of Grammaticalness,” in J. Fodor and J. J. Katz (eds.), The Structure of Language. Englewood Cliffs, NJ: Prentice Hall, 384–89.] Chomsky, Noam (1965). Aspects of the Theory of Syntax. Cambridge, MA: MIT Press. Chomsky, Noam (1973). “Conditions on Transformations,” in Stephen Anderson and Paul Kiparsky (eds.), A Festschrift for Morris Halle. New York: Holt, Rinehart & Winston, 232–86. Chomsky, Noam (1975). The Logical Structure of Linguistic Theory. New York: Plenum. Chomsky, Noam (1980). Rules and Representations. New York: NYU Press. Chomsky, Noam (1981). Lectures on Government and Binding: The Pisa Lectures. Dordrecht: Foris Publications. Chomsky, Noam (1986a). Barriers. Cambridge, MA: MIT Press. Chomsky, Noam (1986b). Knowledge of Language: Its Nature, Origin, and Use. New York: Praeger. Chomsky, Noam (1991). “Linguistics and Adjacent Fields: A Personal View,” in A. Kasher, The Chomskyan Turn: Linguistics, Philosophy, Mathematics and psychology. Oxford: Blackwell, 3–25. Chomsky, Noam (2000). New Horizons in the Study of Language and Mind. Cambridge: Cambridge University Press. Chomsky, Noam, and Morris Halle (1968). The Sound Pattern of English. Cambridge, MA: MIT Press. Chomsky, Noam, and Howard Lasnik (1977). “Filters and Control.” Linguistic Inquiry 8: 425–504. Chomsky, Noam, and George A. Miller (1963). “Introduction to the Formal Analysis of Natural Languages,” in R. D. Luce, R. R. Bush, and E. Galanter (eds.), Handbook of Mathematical Psychology, vol. 2. New York: Wiley, 269–321. Chomsky, Noam, and Sol Saporta (1978). “An Interview with Noam Chomsky,” in Noam Chomsky, Working Papers in Linguistics, vol. 4. Seattle: Department of Linguistics, University of Washington, 301–19. Christiansen, Morten H., and Maryellen C. MacDonald (2009). “A Usage-Based Approach to Recursion in Sentence Processing.” Language Learning 59: 126–61. Chudnoff, Elijah (2011). “The Nature of Intuitive Justification’. Philosophical Studies 153 (2): 313–33.

278



Chudnoff, Elijah (2013a). Intuition. Oxford: Oxford University Press. Chudnoff, Elijah (2013b). “Intuitive Knowledge.” Philosophical Studies 162(2): 359–78. Clark, Herbert H., and Susan E. Haviland (1974). “Psychological Processes as Linguistic Explanation,” in David Cohen (ed.), Explaining Linguistic Phenomena. Washington, DC: Hemisphere Publishing Corporation, 91–124. Cohen, Jacob (1988). Statistical Power Analysis for the Behavioral Sciences (2nd edn.). Hillsdale, NJ: Lawrence Earlbaum. Cohnitz, Daniel, and Jussi Haukioja (2015). “Intuitions in Philosophical Semantics.” Erkenntnis 80(3): 617–41. Collins, John (2004). “Faculty Disputes.” Mind & Language 19: 503–33. Collins, John (2006). “Between a Rock and a Hard Place: A Dialogue on the Philosophy and Methodology of Generative Linguistics.” Croatian Journal of Philosophy 6: 469–503. Collins, John (2008a). Chomsky: A Guide for the Perplexed. London: Continuum. Collins, John (2008b). “Knowledge of Language Redux.” Croatian Journal of Philosophy 8: 3–43. Collins, John (2017). “The Linguistic Status of Context Sensitivity,” in Bob Hale, Crispin Wright, and Alexander Miller (eds.), A Companion to the Philosophy of Language (2nd edn.). Oxford: Willy Blackwell, 151–73. Collins, John (2020). Linguistic Pragmatism and Weather Reporting. Oxford: Oxford University Press. Costantini, Francesco (2010). “On Infinitives and Floating Quantification.” Linguistic Inquiry 41: 487–96. Coulson, Seana (2004). “Electrophysiology and Pragmatic Language Comprehension,” in Ira A Noveck and Dan Sperber (eds.), Experimental Pragmatics. London: Palgrave Macmillan, 187–206. Cowart, Wayne (1997). Experimental Syntax: Applying Objective Methods to Sentence Judgments. Newbury Park, CA: SAGE. Cowart, Wayne, and Helen S. Cairns (1987). “Evidence for an Anaphoric Mechanism within Syntactic Processing: Some Reference Relations Defy Semantic and Pragmatic Constraints.” Memory & Cognition 15: 318–31. Culbertson, Jennifer, and Steven Gross (2009). “Are Linguists Better Subjects?” British Journal for the Philosophy of Science 60(4): 721–36. Cushman, Fiery (2015). “Punishment in Humans: From Intuitions to Institutions.” Philosophy Compass 10: 117–33. Dąbrowska, Ewa (2010). “Naive vs. Expert Competence: An Empirical Study of Speaker Intuitions.” Linguistic Review 27: 1–23. Dahl, Östen (1979). “Is Linguistics Empirical?” in T. Perry (ed.), Evidence and Argumentation in Linguistics. Berlin: Walter de Gruyter, 133–45. De Villiers, P. A., and de Villiers, J. G. (1972). “Early Judgments of Semantic and Syntactic Acceptability by Children.” Journal of Psycholinguistic Research 1(4): 299–310. De Villiers, Jill G., and Peter A. De Villiers (1974). “Competence and Performance in Child Language: Are Children Really Competent to Judge?” Journal of Child Language 1(1): 11–22. Degen, Judith, and Goodman, Noah D. (2014). “Lost Your Marbles? The Puzzle of Dependent Measures in Experimental Pragmatics,” in Paul Bello, Marcello Guarini, Marjorie McShane and Brian Scassellati (eds.), Proceedings of the 36th Annual Conference of the Cognitive Science Society, vol. 1. Quebec: Cognitive Science Society, Inc., 397–402.



279

Den Dikken, Marcel, Judy B. Bernstein, Christina Tortora, and Raffaella Zanuttini (2007). “Data and Grammar: Means and Individuals.” Theoretical Linguistics 33(3): 335–52. DePaul, Michael, and William Ramsey, eds. (1998). Rethinking Intuition: The Psychology of Intuition and Its Role in Philosophical Inquiry. Lanham, MD: Rowman & Littlefield. Devitt, Michael (2006a). “Defending Ignorance of Language: Responses to the Dubrovnik papers.” Croatian Journal of Philosophy 6: 571–606. Devitt, Michael (2006b). Ignorance of Language. Oxford: Clarendon. Devitt, Michael (2006c). “Intuitions in Linguistics.” British Journal for the Philosophy of Science 57: 481–513. Devitt, Michael (2008a). “Explanation and Reality in Linguistics.” Croatian Journal of Philosophy 8(23): 203–31. Devitt, Michael (2008b). “Methodology in the Philosophy of Linguistics.” Australasian Journal of Philosophy 86: 671–84. Devitt, Michael (2010a). “Linguistic Intuitions Revisited.” British Journal for the Philosophy of Science 61(4): 833–65. Devitt, Michael (2010b). “What ‘Intuitions’ Are Linguistic Evidence?” Erkenntnis 73: 251–64. Devitt, Michael (2011a). “Experimental Semantics.” Philosophy and Phenomenological Research 82(2): 418–35. Devitt, Michael (2011b). “No Place for the a priori,” in M. Schaffer and M. Veber (eds.), What Place for the a priori? Chicago, IL: Open Court, 9–32. Devitt, Michael (2011c). “Whither Experimental Semantics?” Theoria 72 (2011): 5–36. Devitt, Michael (2012). “The Role of Intuitions in the Philosophy of Language,” in D. G. Fara and G. Russell (eds.), The Routledge Companion to the Philosophy of Language. New York: Routledge, 554–65. Devitt, Michael (2013). “Is There a Place for Truth-Conditional Pragmatics?” Teorema 32 (2): 85–102. Devitt, Michael (2014a). “Linguistic Intuitions: In Defense of ‘Ordinarism.’ ” European Journal of Analytic Philosophy 10(2): 7–20. Devitt, Michael (2014b). “Linguistic Intuitions Are Not the ‘Voice of Competence,’ ” in M. Haug (ed.), Philosophical Methodology: The Armchair or the Laboratory? London: Routledge, 268–93. Devitt, Michael (2015). “Testing Theories of Reference,” in Jussi Haukioja (ed.), Advances in Experimental Philosophy of Language. London: Bloomsbury Academic, 31–63. Devitt, Michael and Nicolas Porot (2018). “The Reference of Proper Names: Testing Usage and Intuitions.” Cognitive Science 42(5): 1552–85. doi: 10.1111/cogs.12609. Devitt, Michael, and Kim Sterelny (1987). Language and Reality: An Introduction to the Philosophy of Language. Cambridge, MA: MIT Press. Dillon, Brian, Andrew Nevins, Allison Austin, and Colin Phillips (2012). “Syntactic and Semantic Predictors of Tense in Hindi: An ERP Investigation.” Language and Cognitive Processes 27: 313–44. Dodd, Jordan (2014). “Realism and Anti-Realism about Experiences of Understanding.” Philosophical Studies 168(3): 745–67. Domaneschi, Filippo, Massimiliano Vignolo, and Simona Di Paola (2017). “Testing the Causal Theory of Reference.” Cognition 161: 1–9. Dorr, Cian, and John Hawthorne (2014). “Semantic Plasticity and Speech Reports.” Philosophical Review 123(3): 281–338. Dretske, Fred (1986). “Misrepresentation,” in R. Bogdan (ed.), Belief: Form, Content, and Function. New York: Oxford University Press, 17–36.

280



Drożdżowicz, Anna (2016). “Speakers’ Intuitions about Meaning Provide Empirical Evidence: Towards Experimental Pragmatics,” in Martin Hinton (ed.), Evidence, Experiment and Argument in Linguistics and Philosophy of Language. New York: Peter Lang, 65–90. Drożdżowicz, Anna (2018). “Speakers’ Intuitive Judgements about Meaning: The Voice of Performance View.” Review of Philosophy and Psychology 9(1): 177–95. Duhem, P. (1954 [1906]). The Aim and Structure of Physical Theory, trans. Philip P. Wiener. Princeton, NJ: Princeton University Press. Edelman, Shimon, and Morten H. Christiansen (2003). How Seriously Should We Take Minimalist Syntax? Trends in Cognitive Science 7: 60–1. Fanselow, Gisbert (2007). “Carrots—Perfect as Vegetables, but Please Not as a Main Dish.” Theoretical Linguistics 33(3): 353–67. Fanselow, Gisbert, Caroline Féry, Matthias Schlesewsky, and Ralf Vogel (eds.) (2006). Gradience in Grammar: Generative Perspectives. Oxford: Oxford University Press. Fanselow, Gisbert, and Stefan Frisch (2006). “Effects of Processing Difficulty on Judgments of Acceptability,” in G. Fanselow, C. Fery, M. Schlesewsky and R. Vogel (eds.), Gradience in Grammars: Generative Perspectives. Oxford: Oxford University Press, 291–316. Featherston, Sam (2005a). “The Decathlon Model of Empirical Syntax,” in Stephan Kepser and Marga Reis (eds.), Studies in Generative Grammar: Linguistic Evidence: Empirical, Theoretical and Computational Perspectives. Berlin: Mouton de Gruyter, 187–208. Featherston, Sam (2005b). “Universals and Grammaticality: Wh-Constraints in German and English.” Linguistics 43(4): 667–711. Featherston, Sam (2007). “Data in Generative Grammar: The Stick and the Carrot.” Theoretical Linguistics 33(3): 269–318. Featherston, Sam (2009). “A Scale for Measuring Well-Formedness: Why Linguistics Needs Boiling and Freezing Points,” in Sam Featherston and Susanne Winkler (eds.), The Fruits of Empirical Linguistics, vol. 1: Process. Berlin: Mouton de Gruyter, 47–74. Featherston, Sam (2017). “Data and Interpretation: Why Better Judgements Are Important, but Better Theory Is Perhaps More Important.” Paper presented at the Linguistic Intuitions, Evidence, and Expertise workshop, October 25–7, 2017, Aarhus, Denmark. Featherstone, Cara, Catriona Morrison, Mitch Waterman, and Lucy MacGregor (2013). “Syntax, Semantics or Neither? A Case for Resolution in the Interpretation of N500 and P600 Responses to Harmonic Incongruities.” PLoS ONE 8: e76600. doi: 10.1371/journal. pone.0076600. Fernández, Eva M., and Helen S. Cairns (2011). Fundamentals of Psycholinguistics. Oxford: Wiley Blackwell. Ferreira, Fernanda (2005). “Psycholinguistics, Formal Grammars, and Cognitive Science.” Linguistic Review 22: 365–80. Ferreira, Fernanda, Kiel Christianson, and Andrew. Hollingworth (2001). “Misinterpretation of Garden-Path Sentences: Implications for Models of ReAnalysis.” Journal of Psycholinguistic Research 30: 3–20. Feyerabend, Paul (1975). Against Method. London: Verso. Fiengo, Robert (2003). “Linguistic Intuitions.” Philosophical Forum 34(3/4): 253–66. Von Fintel, Kai (2004). “Would You Believe It? The King of France Is Back! (Presuppositions and Truth-Value Intuitions),” in M. Reimer and A. Bezuidenhout (eds.), Descriptions and Beyond, Oxford: Oxford University Press, 315–41.



281

Fischer, Eugen, and John Collins (2015). “Rationalism and Naturalism in the Age of Experimental Philosophy,” in Eugen Fischer and John Collins (eds.), Experimental Philosophy, Rationalism, and Naturalism: Rethinking the Philosophical Method. London: Routledge, 3–33. Fitzgerald, Gareth (2010). “Linguistic Intuitions.” British Journal for the Philosophy of Science, 61(1): 123–60. Fodor, Janet (1978). “Parsing Strategies and Constraints on Transformations.” Linguistic Inquiry 9(3): 427–73. Fodor, Janet, and Ivan Sag (1982). “Referential and Quantificational Indefinites.” Linguistics and Philosophy 5: 355–98. Fodor, Jerry A. (1975). The Language of Thought. New York: Crowell. Fodor, Jerry A. (1983). The Modularity of Mind: An Essay on Faculty Psychology. Cambridge, MA: MIT Press. Fox, Barbara A., and Sandra A. Thompson (1990). “A Discourse Explanation of the Grammar of Relative Clauses in English Conversation.” Language 66: 297–316. Fox, Danny (2002). “Antecedent-Contained Deletion and the Copy Theory of Movement.” Linguistic Inquiry 33: 63–96. Frank, Stefan L., Thijs Trompenaars, and Shravan Vasishth (2016). “Cross-Linguistic Differences in Processing Double-Embedded Relative Clauses: Working-Memory Constraints or Language Statistics?” Cognitive Science 40: 554–78. Frazier, Lyn (2008). “Processing Ellipsis: A Processing Solution to the Undergeneration Problem?” in C. B. Chang and H. J. Haynie (eds.), Proceedings of the 26th West Coast Conference on Formal Linguistics. Somerville, MA: Cascadilla Proceedings Project, 21–32. Fricker, Elisabeth (2003). “Understanding and Knowledge of What Is Said,” in A. Barber (ed.), Epistemology of Language. Oxford: Oxford University Press, 25–66. Garnsey, Susan, Michael Tenenhaus, and Robert Chapman (1989). “Evoked Potentials in the Study of Sentence Comprehension.” Journal of Memory and Language 29: 181–200. Genone, James, and Tania Lombrozo (2012). “Concept Possession, Experimental Semantics, and Hybrid Theories of Reference,” Philosophical Psychology 25: 717–42. Gerbrich, Hannah, Vivian Schreier, and Sam Featherston (2019). “Standard Items for English Judgement Studies: Syntax and Semantics,” in Sam Featherston, Robin Hörnig, Sophie von Wietersheim, and Susanne Winkler (eds.), Experiments in Focus: Information Structure and Processing. Berlin: Mouton de Gruyter, 305–28. Gerken, LouAnn, and Thomas G. Bever (1986). “Linguistic Intuitions Are the Result of Interactions between Perceptual Processes and Linguistic Universals. Cognitive Science 10: 457–76. Geurts, Bart, and Nausicaa Pouscoulous (2009). “Embedded Implicatures?!?” Semantics and Pragmatics 2(4). http://dx.doi.org/10.3765/sp.2.4. Gibson, Edward (1991). A Computational Theory of Human Linguistic Processing: Memory Limitations and Processing Breakdown. PhD thesis, Carnegie Mellon University, Pittsburgh, PA. Gibson, Edward (1998). “Linguistic Complexity: Locality of Syntactic Dependencies.” Cognition 68(1): 1–76. Gibson, Edward, and Evelina Fedorenko (2010). “Weak Quantitative Standards in Linguistics Research.” Trends in Cognitive Sciences 14(6): 233–4. Gibson, Edward, and Evelina Fedorenko (2013). “The Need for Quantitative Methods in Syntax and Semantics Research.” Language and Cognitive Processes 28(1/2): 88–124.

282



Gibson, Edward, Steven T. Piantadosi, and Evelina Fedorenko (2013). “Quantitative Methods in Syntax/Semantics Research: A Response to Sprouse and Almeida (2013).” Language and Cognitive Processes 28(3): 229–40. Gibson, Edward, Steve Piantadosi, and Kristina Fedorenko (2011). “Using Mechanical Turk to Obtain and Analyze English Acceptability Judgments.” Language and Linguistics Compass 5: 509–24. Gibson, Edward, and James Thomas (1999). “Memory Limitations and Structural Forgetting: The Perception of Complex Ungrammatical Sentences as Grammatical.” Language and Cognitive Processes 14: 225–48. Gleitman, Henry, and Lila Gleitman (1979). “Language Use and Language Judgment,” in C. Fillmore, D. Kemler, and W. Wang (eds.), Individual Differences in Language Ability and Language Behavior. New York: Academic Press, 103–26. Goldberg, Sanford (2010). Relying on Others: An Essay in Epistemology. Oxford: Oxford University Press. Goldman, Alvin (2011). “Reliabilism,” in Edward N. Zalta (ed.), Stanford Encyclopedia of Philosophy. https://plato.stanford.edu/archives/spr2011/entries/reliabilism. [Article first published in 2008.] Goldrick, Matthew (2011). “Utilizing Psychological Realism to Advance Phonological Theory,” in J. Goldsmith, J. Riggle, and A. Yu (eds.), Handbook of Phonological Theory (2nd edn.). Oxford: Wiley Blackwell, 631–60. Gombert, Jean (1994). “How Do Illiterate Adults React to Metalinguistic Training?” Annals of Dyslexia 44: 250–69. Gopnik, Alison, and Eric Schwitzgebel (1998). “Whose Concepts Are They Anyway? The Role of Philosophical Intuition in Empirical Psychology,” in M. R. DePaul and W. Ramsey (eds.), Rethinking Intuition: The Psychology of Intuition and Its Role in Philosophical Inquiry. Oxford: Rowman & Littlefield, 75–91. Gopnik, Alison, and Henry M. Wellman (1992). “Why the Child’s Theory of Mind Really Is a Theory.” Mind and Language 7: 145–71. Gordon, Peter, and Randall Hendrick (1997). “Intuitive Knowledge of Linguistic CoReference.” Cognition 62: 325–70. Greenbaum, Sidney (1976). “Contextual Influences on Acceptability Judgments.” International Journal of Psycholinguistics 6: 5–11. Gregory, Emma, Michael McCloskey, Zoe Ovans, and Barbara Landau (2016). “Declarative Memory and Skill-Related Knowledge: Evidence from a Case Study of Amnesia and Implications for Theories of Memory.” Cognitive Neuropsychology 33: 220–40. Grewendorf, Günther (2007). “Empirical Evidence and Theoretical Reasoning in Generative Grammar.” Theoretical Linguistics 33(3): 383. Grice, H. Paul (1989). Studies in the Way of Words. Cambridge, MA: Harvard University Press. Gross, Steven, and Jennifer Culbertson (2011). “Revisited Linguistic Intuitions.” British Journal for the Philosophy of Science 62(3): 639–56. Hachmann, Wibke, Lars Konieczny, and Daniel Müller (2009). “Individual Differences in the Processing of Complex Sentences,” in Niels Taatgen and Hedderik van Rijn (eds.), Proceedings of the 31st Annual Conference of the Cognitive Science Society. Austin, TX: Cognitive Science Society, 309–14. Haddican, Bill (2007). “The Structural Deficiency of Verbal Pro-Forms.” Linguistic Inquiry 38: 539–47. Haegeman, Liliane (1994). Introduction to Government and Binding Theory (2nd edn.). Oxford: Blackwell.



283

Hahne, Anja, and Angela D. Friederici (1999). “Electrophysiological Evidence for Two Steps in Syntactic Analysis: Early Automatic and Late Controlled Processes.” Journal of Cognitive Neuroscience 11: 194–205. Haider, Hubert (2007). “As a Matter of Facts: Comments on Featherston’s Sticks and Carrots.” Theoretical Linguistics 33: 381–95. Hakes, David (1980). The Development of Metalinguistic Abilities in Children. New York: Springer-Verlag. Hales, Steven D. (2006). Relativism and the Foundations of Philosophy. Cambridge, MA: MIT Press. Harris, Zelig (1954). “Distributional Structure.” Word 10: 775–93. Haug, Matthew, ed. (2014). Philosophical Methodology: The Armchair or the Laboratory? London: Routledge. Häussler, Jana, and Markus Bader (2015). “An Interference Account of the Missing-VP Effect.” Frontiers in Psychology 6. https://doi.org/10.3389/fpsyg.2015.00766. Häussler, Jana, and Tom Juzek (2016). “Detecting and Discouraging Non-Cooperative Behavior in Online Experiments Using an Acceptability Judgement Task,” in H. Christ, D. Klenovsak, L. Sönning and V. Werner (eds.), Methods and Linguistic Theories. Bamberg: University of Bamberg Press, 73–99. Häussler, Jana, and Tom Juzek (2017). “Hot Topics Surrounding Acceptability Judgement Tasks,” in Sam Featherston, Robin Hörnig, Reinhild Steinberg, Birgit Umbreit, and Jennifer Wallis (eds.), Proceedings of Linguistic Evidence 2016: Empirical, Theoretical, and Computational Perspectives. University of Tübingen. https://publikationen.unituebingen.de/xmlui/handle/10900/77066. Häussler, Jana, Tom Juzek, and Thomas Wasow (2016). “To Be Grammatical or Not to Be Grammatical: Is That the Question? Evidence for Gradience.” Poster session presented at the Annual Meeting of the Linguistic Society of America, Washington, DC. Hazout, Ilan (2004). “The Syntax of Existential Constructions.” Linguistic Inquiry 35: 393–430. Hickok, Gregory (2012). “Computational Neuroanatomy of Speech Production.” Nature Reviews Neuroscience 13(2): 135–45. Higginbotham, James (1989). “Knowledge of Reference,” in Alexander George (ed.), Reflections on Chomsky. Oxford: Blackwell, 153–74. Hofmeister, Philip, T. Florian Jaeger, Inbal Arnon, Ivan A. Sag, and Neal Snider (2013). “The Source Ambiguity Problem: Distinguishing Effects of Grammar and Processing on Acceptability Judgments.” Language and Cognitive Processes 28: 48–87. Hofmeister, Phillip, and Ivan Sag (2010). “Cognitive Constraints and Island Effects.” Language 86(2): 366–415. Hofmeister, Philip, Laura Staum Casasanto, and Ivan A. Sag (2014). “Processing Effects In Linguistic Judgment Data: (Super-)Additivity and Reading Span Scores.” Language and Cognition 6(1): 111–45. Horvath, Joachim (2010). “How (Not) to React to Experimental Philosophy.” Philosophical Psychology 23(4): 447–80. Hudson, Richard (1996). “The Difficulty of (So-Called) Self-Embedded Structures.” UCL Working Papers in Linguistics 8: 283–314. Hunter, David (1998). “Understanding and Belief.” Philosophy and Phenomenological Research 58(3): 559–80. Jackendoff, Ray (1987). Consciousness and Computation. Cambridge: MIT Press. Jutronić, Dunja (2012). “Je li jezična sposobnot izvor jezičnih intuicija?” [“Is Language Competence the Source of Linguistic Intuitions?”], in Snježana Prijić-Samaržija and

284



Petar Bojanić (eds.), Nenad Miščević: Sva lica filozofije [Nenad Miščević: All Faces of Philosophy]. Belgrade: Institut za filozofiju i društvenu teoriju, 129–43. Jutronić, Dunja (2014). “Which Are the Data That Competence Provides for Linguistic Intuitions?” European Journal of Analytic Philosophy 10(2): 119–43. Jutronić, Dunja (2018). “Intuitions Once Again! Object-Level vs. Meta-Level.” Croatian Journal of Philosophy 17(53): 283–91. Juzek, Tom S. (2016). Acceptability Judgement Tasks and Grammatical Theory. DPhil thesis, University of Oxford. Juzek, Tom S., and Jana Häussler (2019). “Semantic Influences on Syntactic Acceptability Ratings,” in A. Gattnar, R. Hörnig, M. Störzer and S. Featherston (eds.), Proceedings of Linguistic Evidence 2018. Tübingen: University of Tübingen. https://publikationen.unituebingen.de/xmlui/handle/10900/87132. Karanth, Prathibha, and M. G. Suchitra (1993). “Literacy Acquisition and Grammaticality Judgments in Children,” in R. Scholes (ed.), Literacy and Language Analysis. New York: Routledge, 143–56. Karlsson, Fred (2007a). “Constraints on Multiple Center-Embedding of Clauses.” Journal of Linguistics 43(2): 365–92. Karlsson, Fred (2007b). “Constraints on Multiple Initial Embedding of Clauses.” International Journal of Corpus Linguistics 12: 107–18. Katz, Jerrold (1974). “Where Things Now Stand with the Analytic–Synthetic Distinction.” Synthese 28: 283–319. Katz, Jerrold (1981). Language and Other Abstract Objects. New York: Rowman & Littlefield. Katz, Jerrold (1984). “An Outline of Platonist Grammar,” in Thomas G. Bever, John M. Carroll, and Lance A. Miller (eds.), Talking Minds. Cambridge, MA: MIT Press, 17–48. Kayne, Richard (1981). “On Certain Differences between French and English.” Linguistic Inquiry 12: 349–71. Keenan, Edward L., and Bernard Comrie (1977). “Noun Phrase Accessibility and Universal Grammar.” Linguistic Inquiry 8: 63–99. Keller, Frank (2000). Gradience in Grammar: Experimental and Computational Aspects of Degrees of Grammaticality. PhD thesis, University of Edinburgh. Kempen, Gerard, and Karin Harbusch (2008). “Comparing Linguistic Judgments and Corpus Frequencies as Windows on Grammatical Competence: A Study of Argument Linearization in German Clauses,” in A. Steube (ed.), The Discourse Potential of Underspecified Structures. Berlin: de Gruyter, 179–92. Klein, Colin (2007). “An Imperative Theory of Pain.” Journal of Philosophy 104: 517–32. Kluender, Robert, and Marta Kutas (1993). “Subjacency as a Processing Phenomenon.” Language and Cognitive Processes 8: 573–633. Knobe, Joshua, and Shaun Nichols (2008). “An Experimental Philosophy Manifesto,” in Joshua Knobe and Shaun Nichols (eds.), Experimental Philosophy. New York: Oxford University Press, 3–14. Koertge, Noretta (1977). “Galileo and the Problem of Accidents.” Journal of the History of Ideas 38(3): 389–408. Koksvik, Ole (2017). “The Phenomenology of Intuition.” Philosophy Compass 12(1). https://doi.org/10.1111/phc3.12387. Kolinsky, R., L. Cary, and J. Morais (1987). “Awareness of Words as Phonological Entities: The Role of Literacy.” Applied Psycholinguistics 8: 223–32.



285

Koriat, Asher (2007). “Metacognition and Consciousness,” in P. D. Zelazo, M. Moscovitch and E. Thompson (eds.), The Cambridge Handbook of Consciousness. Cambridge: Cambridge University Press, 289–326. Kratzer, Angeleka (1998). “Scope or Pseudo-Scope? Are There Wide-Scope Indefinites?” in Sue Rothstein (ed.), Events in Grammar. Dordrecht: Kluwer, 163–96. Kripke, Saul (1980). Naming and Necessity. Cambridge, MA: Harvard University Press. Kuhn, Thomas S. (1977). The Essential Tension: Selected Studies in Scientific Tradition and Change. Chicago, IL: University of Chicago Press. Kurvers, Jeanne, Ton Vallen, and Roeland van Hout (2006). “Discovering Features of Language: Metalinguistic Awareness of Adult Illiterates,” in Ineke van de Craats, Jeanne Kurvers, and Martha Young-Scholten (eds.), Low-Educated Second Language and Literacy Acquisition: Proceedings of the Inaugural Symposium Tilburg 2005. Utrecht: LOT, 69–88. Kutas, Marta, Katherine DeLong, and Nathaniel Smith (2011). “A Look around at What Lies Ahead: Prediction and Predictability in Language Processing,” in M. Bar (ed.), Predictions in the Brain. Oxford: Oxford University Press, pp. 190–207. Labov, William (1972). “Where Do Grammars Stop?” in Roger W. Shuy (ed.), Sociolinguistics: Current Trends and Prospects. Washington: Georgetown University School of Languages and Linguistics, 43–88. Labov, William (1996). “When Intuitions Fail,” in Lisa McNair, Kora Singer, Lise Dolbrin, and Michelle Aucon (eds.), Papers from the Parasession on Theory and Data in Linguistics. Chicago, IL: Chicago Linguistics Society, 77–106. Lackey, Jennifer (2008). Learning from Words. Oxford: Oxford University Press. Lakatos, Imre (1970). “Falsification and the Methodology of Scientific Research Programmes.” Criticism and the Growth of Knowledge 4: 91–196. Lam, Barry (2010). “Are Cantonese Speakers Really Descriptivists? Revisiting CrossCultural Semantics.” Cognition 115(2): 320–9. Laming, Donald (1997). The Measurement of Sensation. London: Oxford University Press. Landau, Idan (2010). “The Explicit Syntax of Implicit Arguments.” Linguistic Inquiry 41: 357–88. Langsford, Steven, Amy Perfors, Andrew T. Hendrickson, Lauren A. Kennedy, and Danielle J. Navarro (2018). “Quantifying Sentence Acceptability Measures: Reliability, Bias, and Variability.” Glossa: A Journal of General Linguistics 3(1), 37. http://doi.org/10.5334/ gjgl.396. Lau, Jey Han, Alexander Clark, and Shalom Lappin (2017). Grammaticality, Acceptability, and Probability: A Probabilistic View of Linguistic Knowledge. Cognitive Science 41: 1202–41. Lebeaux, David (1988). Language Acquisition and the Form of Grammar. PhD thesis, UMass, Amherst. Lehmann, Christian (2004). “Data in Linguistics.” Linguistic Review 21(3/4): 175–210. Levelt, Willem (1983). “Monitoring and Self-Repair in Speech.” Cognition 14: 41–104. Levelt, Willem (1993). Speaking: From Intention to Articulation. Cambridge, MA: MIT Press. Levin, Beth (1993). English Verb Classes and Alternations: A Preliminary Investigation. Chicago, IL: University of Chicago Press. Lewis, Richard L. (1996). “Interference in Short-Term Memory: The Magical Number Two (or Three) in Sentence Processing.” Journal of Psycholinguistic Research 25(1): 93–115. Lewis, Shevaun, and Colin Phillips (2015). “Aligning Grammatical Theories and Language Processing Models.” Journal of Psycholinguistic Research 44: 27–46.

286



Lidz, Jeffrey, and Annie Gagliardi (2015). “How Nature Meets Nurture: Universal Grammar and Statistical Learning.” Annual Review of Linguistics 1(12): 333–53. Linebarger, Marcia C., Myrna F. Schwartz, and Eleanor M. Saffran (1983). “Sensitivity to Grammatical Structure in So-Called Agrammatic Aphasics.” Cognition 13: 361–92. Linzen, Tal, and Yohei Oseki (2018). “The Reliability of Acceptability Judgments across Languages.” Glossa: A Journal of General Linguistics 3(1), 100. http://doi.org/10.5334/ gjgl.528. López, Luis (2001). “On the (Non)Complementarity of θ-Theory and Checking Theory.” Linguistic Inquiry 32: 694–716. Luck, Steven J. (2014). An Introduction to the Event-Related Potential Technique. Cambridge, MA: MIT Press. Ludlow, Peter (2011). The Philosophy of Generative Linguistics. Oxford: Oxford University Press. Ludwig, Kirk (2007). “The Epistemology of Thought Experiments: First Person versus Third Person Approaches.” Midwest Studies in Philosophy 31(1): 128–59. Luka, Barbara (2005). “A Cognitively Plausible Model of Linguistic Intuitions,” in S. S. Mufwene, E. Francis, and R. Wheeler (eds.), Polymorphous Linguistics: Jim McCawley’s Legacy. Cambridge, MA: MIT Press, 479–502. Lukatela, Katerina, Claudia Carello, Donald Shankweiler, and Isabelle Liberman (1995). “Phonological Awareness in Illiterates: Observations from Serbo-Croatian.” Applied Psycholinguistics 16(4): 463–88. Luria, Aleksandr (1976). Cognitive Development: Its Cultural and Social Foundations. Cambridge, MA: Harvard University Press. Mach, Ernst (2014). The Analysis of Sensations. Chicago, IL: Open Court. Machery, Edouard (2017). Philosophy within Its Proper Bounds. Oxford University Press. Machery, Edouard, Ron Mallon, Shaun Nichols, and Stephen P. Stich (2004). “Semantics, Cross-Cultural Style.” Cognition 92(3): B1–B12. Machery, Edouard, Christopher Y. Olivola, and Molly De Blanc (2009). “Linguistic and Metalinguistic Intuitions in the Philosophy of Language.” Analysis 69(4): 689–94. Machery, Edouard, and Stephen P. Stich (2012). “The Role of Experiment in the Philosophy of Language,” in G. Russell and D. G. Fara (eds.), The Routledge Companion to Philosophy of Language. New York: Routledge, 495–512. Machery, Edouard, Stephen Stich, David Rose, Amita Chatterjee, Kaori Karasawa, Noel Struchiner, Smita Sirker, Naoki Usui, and Takaaki Hashimoto (2017). “Gettier across Cultures 1.” Noûs 51(3): 645–64. Mahowald, Kyle, Peter Graff, Jeremy Hartman, and Edward Gibson (2016). “SNAP Judgments: A Small N Acceptability Paradigm (SNAP) for Linguistic Acceptability Judgments.” Language 92: 619–35. Malmgren, Anna-Sara (2011). “Rationalism and the Content of Intuitive Judgements.” Mind 120(478): 263–327. Manning, Christopher D. (2002). “Probabilistic Syntax,” in Rens Bod, Jennifer Hay, and Stefanie Jannedy (eds.), Probabilistic linguistics. Cambridge, MA: MIT Press, 289–341. Marantz, Alec (2005). “Generative Linguistics within the Cognitive Neuroscience of Language.” The Linguistic Review 22: 429–45. Martí, Genoveva (2008). “Against Semantic Multi-Culturalism.” Analysis 69(1): 42–8. Martin, Roger (2001). “Null Case and the Distribution of PRO.” Linguistic Inquiry 32: 141–66. Matthews, Robert J. (2006). “Could Competent Speakers Really Be Ignorant of Their Language?” Croatian Journal of Philosophy 6: 457–67.



287

Matthews, Robert J. (n.d.). “Linguistic Intuition: An Exercise in Linguistic Competence.” Unpublished ms. Maynes, Jeffrey (2012). “Linguistic Intuition and Calibration.” Linguistics and Philosophy 35(5): 443–60. Maynes, Jeffrey (2017). “On the Stakes of Experimental Philosophy.” Teorema: Revista internacional de filosofía 36(3): 45–60. Maynes, Jeffrey, and Steven Gross (2013). “Linguistic Intuitions.” Philosophy Compass 8(8): 714–30. McKinsey, Michael (1987). “Apriorism in the Philosophy of Language.” Philosophical Studies 52: 1–32. McLeod, Peter, and Zoltan Dienes (1996). “Do Fielders Know Where to Go to Catch the Ball or Only How to Get There?” Journal of Experimental Psychology: Human Perception and Performance 22: 531–43. Mehl, Matthias R., Simine Vazire, Nairán Ramirez-Esparza, Richard B. Slatcher, and James W. Pennebaker (2007). “Are Women Really More Talkative Than Men?” Science 317 (5834). doi: 10.1126/science.1139940. Miller, George A., and Noam Chomsky (1963). “Finitary Models of Language Users,” in R. D. Luce, R. R. Bush, and E. Galanter (eds.), Handbook of Mathematical Psychology, vol. 2. New York: Wiley, 419–92. Miller, Jim, and Regina Weinert (1998). Spontaneous Spoken Language: Syntax and Discourse. Oxford: Clarendon. Miščević. Nenad (2006). “Intuitions: The Discrete Voice of Competence.” Croatian Journal of Philosophy 6: 523–48. Miščević. Nenad (2009). “Competent Voices: A Theory of Intuitions.” http://oddelki.ff.unimb.si/filozofija/en/festschrift. Miščević, Nenad (2012). “Odgovor Dunja Jutronić” [“Answer to Dunja Jutronić”], in Snježana Prijić-Samaržija and Petar Bojanić (eds.), Nenad Miščević: Sva lica filozofije [Nenad Miščević: All Faces of Philosophy]. Belgrade: Institut za filozofiju i društvenu teoriju, 194–8. Miščević, Nenad (2014a). “Reply to Dunja Jutronić.” European Journal of Analytic Philosophy 10(2): 145–53. Miščević, Nenad (2014b). “Reply to Michael Devitt.” European Journal of Analytic Philosophy 10(2): 21–30. Miščević, Nenad (2018). “Intuitions: Epistemology and Metaphysics of Language.” Croatian Journal of Philosophy 17(53): 253–76. Momma, Shota, and Colin Phillips (2018). “The Relationship between Parsing and Generation.” Annual Review of Linguistics 4: 233–54. Montalbetti, Mario (1984). After Binding. PhD thesis, MIT, Cambridge, MA. Montero, Barbara (2016). Thought in Action. Oxford: Oxford University Press. Morais, José, Luz Cary, Jésus Alegria, and Paul Bertelson (1979). “Does Awareness of Speech as a Sequence of Phones Arise Spontaneously?” Cognition 7(4): 323–31. Mortensen, Kaija, and Jennifer Nagel (2016). “Armchair-Friendly Experimental Philosophy,” in Justin Sytsma and Wesley Buckwalter (eds.), A Companion to Experimental Philosophy. Oxford: Wiley Blackwell, 53–70. Munro, Robert, Steven Bethard, Victor Kuperman, Vicky Tzuyin Lai, Robin Melnick, Christopher Potts, Tyler Schnoebelen, and Harry Tily (2010). “Crowdsourcing and Language Studies: The New Generation of Linguistic Data,” in Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. Stroudsburg, PA: Association for Computational Linguistics, 122–30.

288



Myers, James (2009). “Syntactic Judgment Experiments.” Language and Linguistics Compass 3: 406–23. Nado, Jennifer (2014). “Why Intuition?” Philosophy and Phenomenological Research 89(1): 15–41. Nagel, Jennifer, Valerie San Juan, and Raymond A. Mar (2013). “Lay Denial of Knowledge for Justified True Beliefs.” Cognition 129(3): 652–61. Nagel, Jennifer (2012). “Intuitions and Experiments: A Defense of the Case Method in Epistemology.” Philosophy and Phenomenological Research 85(3): 495–527. Neale, Stephen (2004). “This, That, and the Other,” in A. Bezuidenhout and M. Reimer (eds.), Descriptions and Beyond. Oxford: Clarendon, 68–82. Neville, Helen, Janet L. Nicol, Andrew Barss, Kenneth I. Forster, and Merrill F. Garrett (1991). ”Syntactically Based Sentence Processing Classes: Evidence from Event-Related Brain Potentials.” Journal of Cognitive Neuroscience 3: 151–65. Newmeyer, Frederick (1983). Grammatical Theory: Its Limits and Its Possibilities. Chicago, IL: University of Chicago Press. Newmeyer, Frederick (2003). “Grammar Is Grammar and Usage Is Usage.” Language 79: 682–707. Newmeyer, Frederick (2007). “Commentary on Sam Featherston, ‘Data in generative grammar: The stick and the carrot.’ ” Theoretical Linguistics 33: 395–9. Noveck, Ira, and Dan Sperber (2007). “The Why and How Of Experimental Pragmatics: The Case of Scalar Inferences,” in N. Burton-Roberts (ed.), Advances in Pragmatics. Basingstoke: Palgrave Macmillan, 184–212. Nozari, Nazbanou, Gary Dell, and Myrna Schwartz (2011). “Is Comprehension the Basis for Error Detection? A Conflict-Based Theory of Error Detection in Speech Production.” Cognitive Psychology 63: 1–33. Nozari, Nazbanou, and Jared Novick (2017). “Monitoring and Control in Language Production.” Current Directions in Psychological Science 26: 403–10. O’Callaghan, Casey (2011). “Against Hearing Meanings.” Philosophical Quarterly 61(245): 783–807. O’Callaghan, Casey (2015). “Speech Perception,” in M. Matthen (ed.), Oxford Handbook of the Philosophy of Perception. Oxford: Oxford University Press, 475–94. Oomen, Claudy, Albert Postma, and Herman Kolk (2005). “Speech Monitoring in Aphasia: Error Detection and Repair Behavior in a Patient with Broca’s Aphasia,” in R. J. Hartsuiker, R. Bastiaanse, A. Postma, and F. Wijnen (eds.), Phonological Encoding and Monitoring in Normal and Pathological Speech. New York: Psychology Press, 209–25. Open Science Collaboration (2015). “Estimating the Reproducibility of Psychological Science.” Science 349(6251). doi: 10.1126/science.aac4716. Overgaard, Morten (2011). “Visual Experience and Blindsight: A Methodological Review.” Experimental Brain Research 209(4): 473–9. Pallier, Christophe, Anne-Dominique Devauchelle, and Stanislas Dehaene (2011). “Cortical Representation of the Constituent Structure of Sentences,” Proceedings of the National Academy of Sciences 108: 2252–7. Pappas, George (2017). “Internalist vs. Externalist Conceptions of Epistemic Justification,” in E. Zalta (ed.), The Stanford Encyclopedia of Philosophy. https://plato.stanford.edu/ archives/fall2017/entries/justep-intext. Parker, Dan, and Colin Phillips (2016). “Negative Polarity Illusions and the Format of Hierarchical Encodings in Memory.” Cognition 157: 321–39.



289

Patel, Aniruddh, John Iversen, Marlies Wassenaar, and Peter Hagoort (2008). “Musical Syntactic Processing in Agrammatic Broca’s Aphasia.” Aphasiology 22: 776–89. Peacocke, Christopher (1992). A Study of Concepts. Cambridge, MA: MIT Press. Peacocke, Christopher (2004). The Realm of Reason. Oxford: Oxford University Press. Pearlmutter, Neal J., Susan M. Garnsey, and Kathryn Bock (1999). “Agreement Processes in Sentence Comprehension.” Journal of Memory and Language 41: 427–56. Pereplyotchik, David (2017). Psychosyntax: The Nature of Grammar and Its Place in the Mind. Springer (eBook). Pesetsky, David (1987). “Wh-in-situ: Movement and Unselective Binding,” in Eric Reuland and Alice ter Meulen (eds.), The Representation of (In)Definiteness. Cambridge, MA: MIT Press, 98–129. Phelan, Mark (2014). “Experimental Pragmatics: An Introduction for Philosophers.” Philosophy Compass 9(1): 66–79. Phillips, Colin (2009). “Should We Impeach Armchair Linguists?” in S. Iwasaki, H. Hoji, P. M. Clancy, and S.-O. Sohn (eds.), Japanese and Korean Linguistics, vol. 17. Stanford, CA: CSLI Publications, 49–64. Phillips, Colin (2013). “On the Nature of Island Constraints. I: Language Processing and Reductionist Accounts,” in Jon Sprouse and Norbert Hornstein (eds.), Experimental Syntax and Island Effects. Cambridge: Cambridge University Press, 64–108. Phillips, Colin, Matthew W. Wagers, and Ellen F. Lau (2011). “Grammatical Illusions and Selective Fallibility in Real-Time Language Comprehension,” in J. Runner (ed.), Experiments at the Interfaces. Bingley: Emerald, 147–80. Pickering, Martin, and V. Ferreira (2008). “Structural Priming: A Critical Review.” Psychological Bulletin 134(3): 427–59. Pickering, Martin, and Simon Garrod (2013). “An Integrated Theory of Language Production and Comprehension.” Behavioral and Brain Sciences 36(4): 329–47. Pickering, Martin John, and Simon Garrod (2014). “Self-, Other-, and Joint Monitoring Using Forward Models.” Frontiers in Human Neuroscience 8(132). http://eprints.gla.ac. uk/93233. Pietroski, Paul (2002). “Small Verbs, Complex Events: Analyticity without Synonymy,” in L. Antony and N. Hornstein (eds.), Chomsky and His Critics. Oxford: Blackwell, 179–214. Pietroski, Paul (2008). “Think of the Children.” Australasian Journal of Philosophy 86: 657–69. Plakias, Alexandra (2015). “Experimental Philosophy,” in Oxford Online Handbooks of Philosophy. doi: 10.1093/oxfordhb/9780199935314.013.17. Plantinga, Alvin (1993). Warrant and Proper Function. New York: Oxford University Press. Pollard, Carl, and Ivan A. Sag (1994). Head-Driven Phrase Structure Grammar. Chicago, IL: University of Chicago Press. Postma, A. (2000). “Detection of Errors during Speech Production: A Review of Speech Monitoring Models.” Cognition 77(2): 97–132. Poulton, Eustace (1989). Bias in Quantifying Judgements. Hove: Erlbaum. Prinz, Jesse (2011). “The Sensory Basis of Cognitive Phenomenology,” in T. Bayne and M. Montague (eds.), Cognitive Phenomenology. Oxford: Oxford University Press, 174–96. Pust, Joel (2000). Intuitions as Evidence. New York: Routledge. Putnam, Hilary (1979). Is Semantics Possible?” Metaphilosophy 1: 187–201. Pylyshyn, Z. (2006). Seeing and Visualizing: It’s Not What You Think. Cambridge, MA: MIT Press.

290



Quine, Willard Van Ormand (1951). “Two Dogmas of Empiricism,” in W. V. O. Quine, From a Logical Point of View (2nd edn.). Cambridge, MA: Harvard University Press, 20–46. Quine, Willard Van Ormand (1976 [1954]). “Carnap and Logical Truth,” in W. V. O. Quine, Ways of Paradox and Other Essays (2nd edn.). Cambridge, MA: Harvard University Press, 107–132. Rabagliati, Hugh, and Alexander Robertson (2017). “How do Children Learn to Avoid Referential Ambiguity? Insights from Eye-Tracking.” Journal of Memory and Language 94: 15–27. Ramachandra, Vijayachandra, and Prathibha Karanth (2007). “The Role of Literacy in the Conceptualization of Words: Data from Kannada-Speaking Children and Non-Literate Adults.” Reading and Writing 20(3): 173–99. Rattan, Gurpreet (2006). “The Knowledge in Language.” Croatian Journal of Philosophy 6: 505–21. Recanati, François (2004). Literal Meaning. Cambridge: Cambridge University Press. Recanati, François (2010). Truth-Conditional Pragmatics. Oxford: Oxford University Press. Recanati, François (2013). “Reply to Devitt.” Teorema 32(2): 103–8. Reich, Peter A. (1969). “The Finiteness of Natural Language.” Language 45: 831–43. Reinhart, Tanya (1983). Anaphora and Semantic Interpretation. Chicago, IL: University of Chicago Press. Reinhart, Tanya (1997). “Quantifier Scope: How Labor Is Divided between QR and Choice Functions.” Linguistics and Philosophy 20: 335–97. Reisberg, Daniel (1999). “Learning,” in Robert A. Wilson and Frank C. Keil (eds.), The MIT Encyclopedia of the Cognitive Sciences. Cambridge, MA: MIT Press, 460–1. Remez, Robert E., Philip E. Rubin, David B. Pisoni, and Thomas D. Carrell (1981). “Speech Perception without Traditional Speech Cues.” Science 212(4497): 947–9. Rey, Georges (1998). “A Naturalistic A Priori.” Philosophical Studies 92: 25–43. Rey, Georges (2003). “Representational Content and a Chomskyan Linguistics,” in Alex Barber (ed.), Epistemology of Language. Oxford: Oxford University Press, 140–86. Rey, Georges (2006). “Conventions, Intuitions and Linguistic Inexistents: A Reply to Devitt.” Croatian Journal of Philosophy 6: 549–69. Rey, Georges (2014a). “Innate and Learned: Carey, Mad Dog Nativism, and the Poverty of Stimuli and Analogies (Yet Again).” Mind & Language 29(2): 109–32. Rey, Georges (2014b). “The Possibility of a Naturalistic Cartesianism Regarding Intuitions and Introspection,” in Matthew Haug (ed.), Philosophical Methodology: The Armchair or the Laboratory? London: Routledge, 243–67. Rey, Georges (forthcoming-a).“Explanation First! The Priority of Scientific Over ‘Commonsense’ Metaphysics,” in A. Bianchi (ed.), Language and Reality from a Naturalistic Perspective: Themes from Michael Devitt. New York: Springer. Rey, Georges (forthcoming-b). Representation of Language: Philosophical Issues in a Chomskyan Linguistics. Oxford: Oxford University Press. Richards, Norvin (2004). “Against Bans on Lowering.” Linguistic Inquiry 35: 453–63. Riemer, Nick (2009). “Grammaticality as Evidence and as Prediction in a Galilean Linguistics.” Language Sciences 31(5): 612–33. Rizzi, Luigi (1990). Relativized Minimality. Cambridge, MA: MIT Press. Ryan, Ellen Bouchard, and George W. Ledger (1984). “Learning to Attend to Sentence Structure: Links Between Metalinguistic Development and Reading,” in John Downing and Renate Valtin (eds.), Language Awareness and Learning to Read. New York: Springer-Verlag, 149–71.



291

Sachs, Jaqueline (1967). “Recognition Memory for Syntactic and Semantic Aspects of Connected Discourse.” Perception and Psychophysics 2: 437–42. Santana, Carlos (2016). “What Is Language?” Ergo: An Open Access Journal of Philosophy 3(19). http://dx.doi.org/10.3998/ergo.12405314.0003.019. Santana, Carlos (2018). “Why Not All Evidence Is Scientific Evidence.” Episteme 15(2): 209–27. Savage, C. Wade (1970). The Measurement of Sensation. Berkeley: University of California Press. Schindler, Samuel (2013). “Theory-Laden Experimentation.” Studies in History and Philosophy of Science, Part A 44(1): 89–101. Schindler, Samuel (2018). Theoretical Virtues in Science: Uncovering Reality through Theory. Cambridge: Cambridge University Press. Schindler, Samuel, and Pierre Saint-Germier (n.d.). “Putting Philosophical Expertise to the Test.” Paper accepted for publication (American Philosophical Association in Vancouver, 2019). Schnall, Simone, Jonathan Haidt, Gerald L. Clore, and Alexander H. Jordan (2008). “Disgust as Embodied Moral Judgment.” Personality and Social Psychology Bulletin 34: 1096–109. Schulz, Eric, Edward T. Cokely, and Adam Feltz (2011). “Persistent Bias in Expert Judgments about Free Will and Moral Responsibility: A Test of the Expertise Defense.” Consciousness and Cognition 20(4): 1722–31. Schütze, Carson T. (1996). The Empirical Base of Linguistics: Grammaticality Judgments and Linguistic Methodology. Chicago, IL: University of Chicago Press. [Reprinted in 2016 by Language Science Press in Berlin.] Schütze, Carson T. (2009). “Web Searches Should Supplement Judgements, Not Supplant Them.” Zeitschrift für Sprachwissenschaft 28: 151–6. Schütze, Carson T. (2011). “Linguistic Evidence and Grammatical Theory.” Wiley Interdisciplinary Reviews: Cognitive Science 2: 206–21. Schütze, Carson T., and Jon Sprouse (2013). “Judgment Data,” in Robert J. Podesva and Devyani Sharma (eds.), Research Methods in Linguistics. Cambridge: Cambridge University Press, 27–50. Schwitzgebel, Eric, and Fiery Cushman (2012). “Expertise in Moral Reasoning? Order Effects on Moral Judgment in Professional Philosophers and Non-Philosophers.” Mind & Language 27(2): 135–53. Schwitzgebel, Eric, and Fiery Cushman (2015). “Philosophers’ Biased Judgments Persist despite Training, Expertise and Reflection.” Cognition 141: 127–37. Seyedsayamdost, Hamid (2015a). “On Gender and Philosophical Intuition: Failure of Replication and Other Negative Results.” Philosophical Psychology 28(5): 642–73. Seyedsayamdost, Hamid (2015b). “On Normativity and Epistemic Intuitions: Failure of Replication.” Episteme 12(1): 95–116. Shadmehr, Reza, Maurice Smith, and John Krakauer (2010). “Error Correction, Sensory Prediction, and Adaptation in Motor Control.” Annual Review of Neuroscience 33: 89–108. Shea, Nicholas (2007). “Consumers Need Information: Supplementing Teleosemantics with an Input Condition.” Philosophy and Phenomenological Research 75: 404–35. Shea, Nicholas (2012). “Reward Prediction Error Signals Are Meta-Representational.” Noûs 48: 314–41. Siegel, Susanna (2006). “Which Properties Are Represented in Perception?” in T. S. Gendler and J. Hawthorne (eds.), Perceptual Experience. Oxford: Oxford University Press, 481–503.

292



Silbert, Lauren J., Christopher J. Honey, Erez Simony, David Poeppel, and Uri Hasson (2014). “Coupled Neural Systems Underlie the Production and Comprehension of Naturalistic Narrative Speech,” Proceedings of the National Academy of Sciences 111 (43): E4687–E4696. Simon, Herbert A. (1992). “What Is an ‘Explanation’ of Behaviour?” Psychological Science 3(3): 150–61. Smith, Barry C. (2006). “Why We Still Need Knowledge of Language.” Croatian Journal of Philosophy 6: 431–56. Smith, Neil (2004 [1999]). Chomsky: Ideas and Ideals (2nd edn.). Cambridge: Cambridge University Press. Smith, Neil (2014). “Philosophical and Empirical Approaches to Language,” in M. Haug (ed.), Philosophical Methodology: The Armchair or the Laboratory? London: Routledge, 294–317. Smith, Neil, and Nicholas Allott (2016). Chomsky: Ideas and Ideals. Cambridge: Cambridge University Press. Soames, Scott (1984). “Linguistics and Psychology.” Linguistics and Philosophy 7: 155–79. Soames, Scott (1985). “Semantics and Psychology,” in Jerold Katz (ed.), Philosophy of Linguistics. Oxford: Oxford University Press, 204–26. Song, Sanghoun, Jae-Woong Choe, and Eunjeong Oh (2014). “FAQ: Do Non-Linguists Share the Same Intuition as Linguists?” Language Research 50, 357–86. Sontag, Katrin (2007). “Parallel Worlds: Fieldwork with Elves, Icelanders and Academics.” MA dissertation, University of Iceland. Sorace, Antonella, and Frank Keller (2005). “Gradience in Linguistic Data.” Lingua 115(11): 1497–1524. Sperber, Dan, and Deirdre Wilson (2002). “Pragmatics, Modularity and Mindreading.” Mind & Language 17: 2–23. Sprouse, Jon (2007). “Continuous Acceptability, Categorical Grammaticality, and Experimental Syntax.” Biolinguistics 1: 118–29. Sprouse, Jon (2008). “The Differential Sensitivity of Acceptability to Processing Effects.” Linguistic Inquiry 39(4): 686–94. Sprouse, Jon (2009). “Revisiting Satiation”. Linguistic Inquiry 40(2): 329–41. Sprouse, Jon (2011). “A Test of the Cognitive Assumptions of Magnitude Estimation: Commutativity Does Not Hold for Acceptability Judgments.” Language 87(2): 274–88. Sprouse, Jon (2013). “Acceptability Judgments,” in M. Aronoff (ed.), Oxford Bibliographies Online: Linguistics. Oxford: Oxford University Press. http://www.socsci.uci.edu/ ~jsprouse/papers/Acceptability.Judgments.OUP.pdf. Sprouse, Jon (2015). “Three Open Questions in Experimental Syntax.” Linguistics Vanguard 1(1): 89–100. Sprouse, Jon (2018). “Acceptability Judgments and Grammaticality: Prospects and Challenges,” in N. Hornstein, C. Yang, and P. Patel-Grosz (eds.), Syntactic Structures after 60 Years: The Impact of the Chomskyan Revolution in Linguistics. Berlin: Mouton de Gruyter, pp. 195–224. Sprouse, Jon, and Diogo Almeida (2012a). “Assessing the Reliability of Textbook Data in Syntax: Adger’s Core Syntax. ” Journal of Linguistics 48(3): 609–52. Sprouse, Jon, and Diogo Almedia (2012b). “The Role of Experimental Syntax in an Integrated Science,” in Kleanthes Grohmann and Cedric Boeckx (eds.), The Cambridge Handbook of Biolinguistics. Cambridge: Cambridge University Press, 181–202. Sprouse, Jon, and Diogo Almeida (2012c). “Power in Acceptability Judgment Experiments and the Reliability of Data in Syntax.” Unpublished manuscript. https://ling.auf.net/ lingbuzz/001520.



293

Sprouse, Jon, and Diogo Almeida (2013). “The Empirical Status of Data in Syntax: A Reply to Gibson and Fedorenko.” Language and Cognitive Processes 28: 222–8. Sprouse, Jon, and Diogo Almeida (2017a). “Design Sensitivity and Statistical Power in Acceptability Judgment Experiments.” Glossa: A Journal Of General Linguistics 2(1), 14. http://doi.org/10.5334/gjgl.236. Sprouse, Jon, and Diogo Almeida (2017b). “Setting the Empirical Record Straight: Acceptability Judgments Appear to Be Reliable, Robust, and Replicable.” Behavioral and Brain Sciences 40. https://doi.org/10.1017/S0140525X17000590. Sprouse, Jon, and Norbert Hornstein (eds.) (2013). Experimental Syntax and Island Effects. Cambridge: Cambridge University Press. Sprouse, Jon, and Carson T. Schütze (2020). “Grammar and the Use of Data,” in Bas Aarts, Jill Bowie, and Gergana Popova (eds.), The Oxford Handbook of English Grammar. Oxford: Oxford University Press, 40–58. Sprouse, Jon, Carson T. Schütze, and Diogo Almeida (2013). A Comparison of Informal and Formal Acceptability Judgments Using a Random Sample from Linguistic Inquiry 2001–2010.” Lingua 134: 219–48. Stabler, Edward P. (1991). “Avoid the Pedestrian’s Paradox,” in Robert C. Berwick, Steven P. Abney, and Carol Tenny (eds.), Principle-Based Parsing. Dordrecht: Springer, 199–237. Stanley, Jason, and Zoltan Szabó (2000). “On Quantifier Domain Restriction.” Mind & Language 15(2/3): 219–61. Stevens, Jon Scott, Anton Benz, Sebastian Reuße, Ronja Laarmann-Quante, and Ralf Klabunde (2014). “Indirect Answers as Potential Solutions to Decision Problems,” in Verena Rieser and Philippe Muller (eds.), Proceedings of the 18th Workshop on the Semantics and Pragmatics of Dialogue. Edinburgh: SEMDIAL, 145–54. Stevens, Stanley (1975). Psychophysics: Introduction to Its Perceptual, Neural and Social Prospects. New York: John Wiley. Stich, Stephen P. (1996). Deconstructing the Mind. New York: Oxford University Press. Sulzby, Elizbeth, and William H. Teale (1991). “Emergent Literacy,” in R. Barr, M. L. Kamil, P. B. Mosenthal, and P. D. Pearson (eds.), Handbook of Reading Research, vol. 2. Mahwah, NJ: Lawrence Erlbaum Associates, 727–58. Swain, Stacey, Joshua Alexander, and Jonathan M. Weinberg (2008). “The Instability of Philosophical Intuitions: Running Hot and Cold on Truetemp.” Philosophy and Phenomenological Research 76(1): 138–55. Sytsma, Justin, Jonathan Livengood, Ryoji Sato, and Mineki Oguchi (2015). “Reference in the Land of the Rising Sun: A Cross-Cultural Study on the Reference of Proper Names.” Review of Philosophy and Psychology 6(2): 213–30. Szucs, Denes, and John P. A. Ioannidis (2017). “Empirical Assessment of Published Effect Sizes and Power in the Recent Cognitive Neuroscience and Psychology Literature.” PLoS Biology 15(3): e2000797. https://doi.org/10.1371/journal.pbio.2000797. Textor, Mark (2009). “Devitt on the Epistemic Authority of Linguistic Intuitions.” Erkenntnis 71: 395–405. Thompson, Sandra A., and Paul J. Hopper (2001). “Transitivity, Clause Structure, and Argument Structure: Evidence from Conversation,” in Joan L. Bybee and Paul Hopper (eds.), Frequency and the Emergence of Linguistic Structure. Amsterdam: John Benjamins, 27–60. Tian, Xing, and David Poeppel (2010). “Mental Imagery of Speech and Movement Implicates the Dynamics of Internal Forward Models.” Frontiers in Psychology 1. doi: 10.3389/fpsyg.2010.00166.

294



Tobia, Kevin, Wesley Buckwalter, and Stephen Stich (2013). “Moral Intuitions: Are Philosophers Experts?” Philosophical Psychology 26(5): 629–38. Tolchinsky, Liliana (2004). “Childhood Conceptions of Literacy,” in T. Nunes and P. Bryant (eds.), Handbook of Childrens’ Literacy. Dordrecht: Kluwer 11–30. Tomasello, Michael (ed.) (1998). The New Psychology of Language: Cognitive and Functional Approaches to Language Structure. Mahwah, NJ: Lawrence Erlbaum. Tucker, Chris (2013). “Seemings and Justification: An Introduction,” in Chris Tucker (ed.), Seemings and Justification: New Essays on Dogmatism and Phenomenal Conservatism. New York: Oxford University Press, 1–29. Turri, John (2013). “A Conspicuous Art: Putting Gettier to the Test.” Philosophers’ Imprint 13(10): 1–16. Vasishth, Shravan, Sven Brüssow, Richard L. Lewis, and Heiner Drenhaus (2008). “Processing Polarity: How the Ungrammatical Intrudes on the Grammatical.” Cognitive Science 32: 685–712. Vasishth, Shravan, Katja Suckow, Richard L. Lewis, and Sabine Kern (2010). “Short-Term Forgetting in Sentence Comprehension: Crosslinguistic Evidence from Verb-Final Structures.” Language and Cognitive Processes 25: 533–67. Vigliocco, G., and D. Vinson (2003). “Speech Production,” in L. Nadel (ed.), Encyclopedia of Cognitive Science, vol. 4. London: Nature Publishing Group, 182–9. Wagers, Matthew, Ellen Lau, and Colin Phillips (2009). “Agreement Attraction in Comprehension: Representations and Processes.” Journal of Memory and Language 6: 206–37. Ward, Paul, and A. Mark Williams (2003). “Perceptual and Cognitive Skill Development in Soccer: The Multidimensional Nature of Expert Performance.” Journal of Sport & Exercise Psychology 25: 93–111. Wasow, Thomas (2009). “Gradient Data and Gradient Grammars,” in M. Elliott, J. Kirby, O. Sawada, E. Staraki, and S. Yoon (eds.), Proceedings of the 43rd Annual Meeting of the Chicago Linguistics Society. Chicago, IL: Chicago Linguistic Society, 255–71. Wasow, Thomas, and Jennifer Arnold (2005). “Intuitions in Linguistic Argumentation.” Lingua 115(11): 1481–96. Weinberg, Jonathan M. (2009). “On Doing Better, Experimental-Style.” Philosophical Studies 145(3): 455–64. Weinberg, Jonathan M. (2016). “Intuitions,” in H. Cappelen, T. Gendler, and J. P. Hawthorne (eds.), The Oxford Handbook of Philosophical Methodology. Oxford: Oxford University Press, 287–308. Weinberg, Jonathan M., Shaun Nichols, and Stephen Stich (2001). “Normativity and Epistemic Intuitions.” Philosophical Topics 29(1/2): 429–60. Weinberg, Jonathan M., Chad Gonnerman, Cameron Buckner, and Joshua Alexander (2010). “Are Philosophers Expert Intuiters?” Philosophical Psychology 23(3): 331–55. Wells, Justine B., Morten H. Christiansen, David S. Race, Daniel J. Acheson, and Maryellen C. MacDonald (2009). “Experience and Sentence Processing: Statistical Learning and Relative Clause Comprehension.” Cognitive Psychology 58: 250–71. Wellwood, Alexis, Roumyana Pancheva, Valentine Hacquard, and Colin Phillips (2018). “The Anatomy of a Comparative Illusion.” Journal of Semantics 35: 543–83. Wendt, Dorothea, Thomas Brand, and Birger Kollmeier (2014). “An Eye-Tracking Paradigm for Analyzing the Processing Time of Sentences with Different Linguistic Complexities.” PLoS One 9(6): e100186. https://doi.org/10.1371/journal.pone.0100186. Weskott, Thomas, and Gisbert Fanselow (2008). “Variance and Informativity in Different Measures of Linguistic Acceptability,” in Natasha Abner and Jason Bishop (eds.),



295

Proceedings of the 27th West Coast Conference on Formal Linguistics (WCCFL). Somerville, MA: Cascadilla Press, 431–9. Weskott, Thomas, and Gisbert Fanselow (2011). “On the Informativity of Different Measures of Linguistic Acceptability.” Language 87(2): 249–73. Wexler, Kenneth, and Peter Culicover (1980). Formal Principles of Language Acquisition. Cambridge, MA: MIT Press. Wheatley, Thalia, and Jonathan Haidt (2005). “Hypnotically Induced Disgust Makes Moral Judgments More Severe.” Psychological Science 16: 780–4. Williamson, Timothy (2007a). “On Being Justified in One’s Head,” In Mark Timmons, John Greco, and Alfred R. Mele (eds.), Rationality and the Good: Critical Essays on the Ethics and Epistemology of Robert Audi. Oxford: Oxford University Press, 106–22. Williamson, Timothy (2007b). The Philosophy of Philosophy. Oxford: Blackwell. Williamson, Timothy (2011). “Philosophical Expertise and the Burden of Proof.” Metaphilosophy 42(3): 215–29. Wilson, Deirdre, and Dan Sperber (2012). Meaning and Relevance. Cambridge: Cambridge University Press. Witzel, Christoph (forthcoming). “Misconceptions about Colour Categories.” Review of Philosophy and Psychology. Wolpert, Daniel (1997). “Computational Approaches to Motor Control.” Trends in Cognitive Sciences 1: 209–16.

Index acceptability judgments 1, 3, 6–9, 13–14, 16, 18, 21–3, 26–8, 31, 35, 59, 67, 69, 72–3, 77, 79, 80 note, 82 table, 83–5, 96, 99, 102, 115 note, 124, 133, 135, 139, 145, 169, 173–4, 182–3, 195, 201, 215–32, 224 figure, 231 figure, 233–51, 254, 257–66, 269, 270 note, see also intuitions/intuitive judgments source ambiguity problem 238–9, 253 theories of 215–17, 222, 232 aggregation effects (in experiments) 247, 249–50, 253–4 Almeida, D. 7, 165, 167, 168 figure, 189, 194 note, 223, 225–6, 226 figure, 258, 260, 266 ambiguity (semantic or syntactic) 20, 32, 37–8, 96–100, 133, 190, 196, 206, 239–40, 243, 245, 247, 253, 263–4 Anderson, John R. 58 note argument structure (of a sentence) 151–3 armchair methods see also intuitions/intuitive judgments: armchair judgments; traditional method in linguistics 6–7, 9, 119–20, 125–6, 134, 166, 172, 175, 178–9 in philosophy 109, 112, 114, 121, 266, 269, 273 Arnold, Jennifer 41, 56, 130, 142, 223, 236 note, 256–7 bias 1, 189 note, 197, 222, 240, 257 confirmation bias 13, 17, 133–4 scale bias 247, 250, 253–4 theoretical 8–9, 149, 216, 223, 225, 232, 241–2, 257, 259–60, 263–4, 268–9, 271–3 Bloomfield, Leonard 130 Bock, J. Kathryn 48 Bond, Zinny 48 Cairns, H. 38, 48 Carroll, Lewis 47, 104 Cartesian 34, 36, 72 access 75, 77, see also direct access cataphors 154 center embedding 25, 214, 239–40 central processor see language processing Chomsky, Noam 6, 13, 15 note, 16, 25, 27, 38, 43, 47, 70, 72 note, 93, 97, 101–2, 104, 129–30,

135, 163, 166, 176, 178, 180, 217, 233, 235–6, 239, 242–3, 261, 264–5, 268 Chomskyan/Chomskian linguistics 1–2, 33, 35–6, 38, 42, 48 note, 51, 52 note, 59–61, 63, 70, 156, 255, see also formal linguistics; generative linguistics Collins, John 5, 15 note, 33 note, 51 note, 52 note, 69–70, 91, 94, 129, 260 competence (grammatical) 1–5, 25, 29–30, 33, 36, 38, 41, 52–6, 58–62, 66–7, 69–70, 72–80, 82 table, 83–4, 90, 92, 99, 108, 118 note, 131, 134–9, 141, 158–60, 163, 235, 237 note, 238–9, 263 note, see also Voice of Competence privileged access 15–19, 25, 34–5, 52, 73 comprehension 14, 18–20, 27–8, 30–2, 36–7, 67, 110, 115–18, 135, 137, 158, 221, 229 conversational data see corpora co-reference 13–14, 28, 39–40, 48, 52, 67 corpora 149, 151–63 Cowart, Wayne 6, 48, 149, 217, 235, 251, 255–7, 259 Culbertson, Jennifer 14, 17–18, 52 note, 59 note, 69 note, 72 note, 73, 131 decathlon model (grammar) 237–8, 253 deduction 16, 70, 74, 79–80, 82 table, 83, 85, 133–4 descriptive adequacy 110, 119–20, 177 Devitt, Michael 2–4, 14–19, 21, 25–32, 33–49, 51–67, 69–77, 82–3, 95, 109, 112, 118 note, 120–1, 125–6, 132–7, 236, 241, 270 diacritics 235, 242–6, 248, 250, 253 Dienes, Zoltan 64 direct access 5, 64, 74, 80, 82 table, 85, 103, see also Cartesian: access Domaneschi, Filippo 57 note effect size (statistical) 79 note, 139, 176, 183, 226–8, 231–2, 233 note, 258, 266 E-language 16, 33 note, 35 epistemological tolerance 265 error signals see language processing: error signals etiology of intuitions see intuitions/intuitive judgments: etiology of

298



experience 2–3, 17, 25 note, 28 note, 34 note, 41, 47, 53 note, 54, 58, 60, 137, 140, 160, 219, 238, 260 auditory-parsing 42 conscious 109, 115 firsthand 157 linguistic 133, 135 perceptual 5, 37, 109, 114–15, 123–4 personal 180 reflections on 78, 82 table, 84 speaker’s 1, 55, 72–3 experiment(s) 7–9, 17 note, 37, 39, 46, 48, 108, 139–41, 144, 149–50, 158, 169–70, 172–3, 175, 176 table, 179, 181 table, 183–4, 186, 190–202, 204, 207, 212–13, 223, 225, 228–9, 235, 239, 244–53, 247 figure, 249 figure, 252 figure, 255–6, 259–60, 261 note, 263 note, 264, 266–7, 271–2 experimental method/approach 8–9, 120, 126, 138, 141, 166–8, 182, 196, 255–8, 262, 264, 269, 270–3 data 7, 150, 163, 166, 179, 182, 185 extraneous factors 80 note, 257, 260, 270–1 judgments 166 note, 167–8, 172, 178–9, 183–4, 187, 233, 245, 271 linguistics 230 logic 8, 219–20, 229, 232 philosophy 5, 110, 112, 120–1, 255, 267–8, 270–1 pragmatics 120 semantics 120 studies 125, 183, 185, 232, 257, 270 syntax/syntacticians 6–7, 9, 179, 255–6, 258 note, 264–9 validity 257, 259, 271, 273 variation 169, 177 table, 184, 186, 198–9, 235 expertise 17–18, 25, 55, 133 expertise defense (philosophy) 270, 273 experts 2, 54, 56, 85, 112, 233, 246, 259–60, 268, 272–3 extra-grammatical factors 233–4, 236, 240, 242, 245, 247, 252–3, 261 note, 263 note extraneous factors see experimental method: extraneous factors Featherston, Sam 6–7, 9, 31, 167–9, 177, 187, 223, 233–4, 237–8, 244–6, 245 table, 251, 253–4, 256–7, 259, 261–2 Fedorenko, Evelina 142, 223, 235, 257, 259, 270 note Fernández, Eva M. 38, 48 Ferreira V. 48, 223, 257 Fitzgerald, Gareth 51 note, 52 note, 59 note, 75, 95, 135–6

Fodor, Janet 96 Fodor, Jerry A 15, 37–8, 40 note, 41 formal linguistics 1, 7, 77, 84, 151, see also generative linguistics; Chomskyan linguistics frequency (of linguistic usage) 6, 78–81, 152, 156–7, 161–2, 222, 227–8, 238, 249, 259, 261–2 fruitfulness (methodological) 3, 131–2, 141 Gagliardi, A. 43 Galilean linguistics 265 garden-path sentences/effect 47–8, 99–100, 136, 196, 205, 209, 240, 246–7, 259–60 Garnsey, Susan 48 generative linguistics 7, 36, 69–73, 75–85, 89, 129–30, 132, 237 note, 243, 261, see also formal linguistics; Chomskyan linguistics generative grammar 56, 149–51, 173 generative grammarians 56, 149–50, 163 generative syntax 69, 92, 129, 132, 144, 215–17 transformational grammar 161 Gibson, Edward 142, 194 note, 214 note, 223, 235, 238–40, 244, 247–8, 256–7, 259, 266, 270 note Gleitman, Henry 130 Gleitman, Lila 130 gradience: in grammar 1, 8, 233–4, 245–7, 249–54, 257, 262–3 in judgments 1, 6, 8–9, 237–8, 245–6, 247 figure, 249 figure, 251, 252 figure, 254, 257, 261–4, 269, 272 the puzzle of 9, 233–4, 252–4 theories of 220 grammar 3–4 6, 8–9, 36, 44, 63 note, 86, 116, 137, 156, 160–2, 175, 190, 194, 207 note, 218, 235, 239, 252, 265, see also gradience: in grammar; mentalist view; rules; universal grammar autonomy/independence of, thesis 104 building/constructing/development 7, 165–6, 168, 178–80, 186–7, 262–4 categorical 245, 253–4, 261–2 constraints 15–16, 18, 20, 23–4, 27, 38, 104, 162, 177, 182, 185 grammatical illusions 247–8 I-/as related to I-language 35, 42, 46, 159 models of 161, 167, 175, 182, 187, 237, 242, 253, 265 reasoning about 233, 236, 238, 240–4 probabilistic accounts of 238 psychological conception of 59

 speaker’s theory of 73, 84, 133 theories of grammar 2, 56, 69, 73, 79, 84–5, 104, 133, 149–51, 159, 163–4, 173, 255, 259–61, 263–4, 273 grammatical competence see competence grammaticality 2, 5, 7, 16, 18–19, 21, 24–6, 27 note, 28–9, 53–4, 70, 72–3, 77, 80 note, 82 table, 90, 102–3, 105–7, 130, 144, 173–4, 177, 199, 233–8, 240–8, 250, 253, 257 grammaticality judgment 16, 18, 25–6, 27 note, 69 note, 70, 72–3, 79, 84, 102, 138–9, 141, 196, 235, 237 note, 246, 260–5 Gross, Steven 2–4, 14, 16–21, 27, 51, 52 note, 57, 59–60, 65–8, 69 note, 72 note, 73–4, 95, 119 note, 124, 131, 135–6, 139–40, 190, 243, 254 Halle, M. 38, 43 Harris, Zelig 130 I-language 16, 33–5, 42, 63, 159 indefinite DPs 96–8 indexicals 37 introspection 8, 22, 25, 46, 123, 136, 149–58, 161–9, 218–19, 234, 236 introspective data 149–58, 161–4 intuitions/intuitive judgments 25, 35–6, 38, 45, 51, 53, 55–7, 59, 62, 64, 66, 69–77, 79–80, 83–4, 89, 93, 101, 103, 139, 236, 273, see also acceptability judgments; experimental method; gradience; grammaticality judgment; reliability (of intuitions) a priori 31, 34 note, 57 note, 129, 145, 268 note armchair judgments 1, 7, 9, 114, 166–8, 172, 175, 177–80, 186–7 as “case judgments” 268 cause of see Modest Explanation; Voice of Competence Chomskyan view of see Chomskyan linguistics contrasted with linguistic usage 52, 58 about co-reference 28 etiology of 1–4, 14–19, 21, 23, 26–30, 34, 53, 55, 67, 72, 80, 131–7, 139–40 as evidence 1–9, 15–16, 18, 26, 40, 73, 77, 255, 264 of folk/laypeople 2–3, 54, 56, 99, 112, 121, 125, 133, 137, 140, 142, 233, 242, 245–6, 259–60, 267–8, 270–1, 273 informational content of/message of 2–5, 28, 36–7, 39, 46–9, 52, 57, 62, 64, 66, 70, 72–4, 76–7, 80, 84

299

of linguists 1, 6–8, 13–14, 17–18, 21, 25, 34, 52, 112, 133, 166, 190–1, 194, 240, 258–9, 268 metalinguistic 18, 22, 25–6, 28, 52–3, 58–60, 65, 67, 95, 98–9, 112, 134–9, 137–42 nature of 5, 9, 52, 58–9, 98–9, 110, 115 non-evidential role of 6, 142–5 perceptual view on 5, 8, 109, 113–14 of philosophers 1, 270 philosophical 9, 129 note, 269 about reference (semantic) 57, 121, 126, 268, 271 semantic/about meaning 1, 5, 33, 89, 94–5, 100–1, 103, 109, 111, 115, 120, 138 source ambiguity problem 238–9, 253 syntactic 1, 5–6, 58, 65, 72, 76, 78–80, 84–6, 89, 93–4, 96, 101–4, 105 note, 109 note, 115 note, 118 theory ladenness of 15, 17–18, 36, 53–5, 59 note, 241 in thought experiments 272 validity 9, 17 note, 178, 215–17, 221–2, 228–9, 232 variation of 125–6, 149, 156, 221, 268–9, 272 and the visual analogy 40, 58, 62, 114–15, 124 islands 93, 96–7, 101, 105–7, 182, 186, 230–1, 231 figure Jackendoff, Ray 40 note justification (of intuitions) 1–8, 22, 69, 71, 83, 108, 110, 113–15, 117–19, 122–5, 127, 131–2, 134, 166, 179, 241, 268–9 reliabilist strategy 5, 109–10, 112, 115–16, 118–22, 124–6 Jutronić, Dunja 51 note, 53 note Labov, William 13, 130, 142, 159, 168 language acquisition 14–16, 217, 222 language processing 27 note, 47, 63, 149, 214, 217–18, 228 difficulty/problem 23, 238, 240 error signals 3, 13–14, 18–19, 21–7, 29–31, 57–8, 66–7, 136, 218, 229 language system 47, 57–8, 61, 65–6, 116, 118 memory limitations 2, 18, 30, 65, 70, 73, 75, 80 note, 218, 236, 239, 245, 247–8, 259, 265–6 monitoring mechanisms/processes 3, 14, 19–26, 29–30, 60, 66, 110, 115–18, 136 phenomenology of 66 central processor 3–4, 36, 38–46, 52–3, 55, 57–8, 61–6, 74, 76, 204, 209–10 rules 76 and structural descriptions 3–4, 17, 21, 38–43, 45–47, 49, 57–8, 60–6 subcentral processing 52, 65–6

300



language production 2, 4–5, 20, 37–8, 56, 60, 108–10, 116–18, 125, 134–5, 137, 158, 182, 222, 237–8 late merge 102 Lidz, Jeffrey 43 Likert scales 189, 226 figure, 261–2 linear optimality theory see optimality theory linguistic competence see competence; Voice of Competence linguistic data 3, 5, 7–8, 53 note, 177, 195 linguistic usage 30, 48 note, 52, 56–7 note, 59, 66–7, 137, 141–2, 144, 150, 157, 160, 243 contrasted with linguistic intuitions 52, 58 literacy 136, 138–9 Ludlow, Peter 51 note, 52 note, 70, 95–6 Machery, Edouard 9, 109, 112, 121, 125–6, 241, 255, 267–73 magnitude estimation 167, 172–6, 176 table, 176 figure, 180, 180 table, 181 table, 181 figure, 189, 224, 224 figure, 226, 226 figure, 261–2 Marr, D. 40 note Matthews, Robert J. 14, 51 note Maynes, Jeffrey 2–3, 16–21, 27, 59, 74, 124, 135–6, 144 note McKinsey, Michael 57 note McLeod, Peter 64 meaning 5, 33, 44 note, 47–8, 69 note, 103, 109–26, 150, 169, 174, 185, 196, 201–7, 211, 221, 272 memory limitations see language processing mentalist view/psychological account/mentalist conception of grammar 14–19, 25, 59, 70–2, 75, 77, 81–5, 82 table, 130, 134, 238 grammar and logicese 106 internal grammar 35 mental grammar 3–4, 70, 73, 79, 83, 85, 149 non-mentalist view of 72, 75 metaphilosophy 255 method of cases 121, 255, 267–8, 271–2 minimal pair 176, 179 note, 197 note, 199, 200, 202, 203 note, 207, 213, 219, 256 Miščević, Nenad 51 note, 53 missing-VP effect 238, 247 Modest Explanation (ME) 4, 51, 53–7, 59–60, 62, 64, 66, 68 module/modular 15–17, 19 note, 23, 25, 31, 36, 38, 39 note, 40 note, 42–3, 45, 52, 56, 64, 65 note, 73, 102, 116, 237 Momma, Shota 38 note monitoring mechanisms/processes see language processing: monitoring mechanisms

naturalized 37 non-conceptual content 62–3 non-conceptual structural descriptions (NCSD) 4, 41–2, 45–6, 49, 62–4 non-mentalist conception of linguistics 72, 75 Occamist considerations 56, 58 optimality theory (grammar) 26, 253, 263 note paleontologist’s intuitions 54, 58 parser/parsing 2, 4, 8, 17–18, 20–1, 23–4, 27 note, 29, 36, 38–9, 45, 47–8, 59, 63 note, 64, 74, 159, 173, 190, 193 note, 195, 204–6, 208–13, 235, 257 Peacocke, Christopher 24, 41–2 perception 8, 20, 24, 28 note, 31, 33, 35, 37–8, 40–3, 46–9, 54–6, 58, 62–4, 109, 113–15, 123–4, 136, 218–19, 268 note, see also intuitions: perceptual view on blindsight 123 metalinguistic 57, 60 perceptual judgments 24, 53–5, 58, 109, 115 perceptual processing 40, 46 Pereplyotchik, David 29 note, 60 note performance factors 1, 9, 70, 72, 75, 77, 80–1, 83–5, 90, 235, 241, 245, 247, 253–4, 259–60, 262, 263 note, 264–5, 273 phenomenology 21, 42, 60, 66, 112, 114–15, 119, 121, 124, 267 presentational 113 Phillips, Colin 38 note, 166 note, 182, 218, 222, 233, 247–8, 258, 261, 264, 266 note philosophy of language 5, 109–11, 120–1, 267 phonology/phonological 30, 33, 37–8, 40, 45, 47–9, 62, 116–17, 139, 141, 161, 218, 221, 228 Pickering, Martin 20, 48, 117, 223 Pietroski, Paul 21, 30, 51 note, 70, 109 note, 115 note pragmatics 5, 29 note, 91, 109, 120, 218, 228 experimental see experimental method/ approach: pragmatics probabilistic grammar see grammar: probabilistic accounts of processing see language processing production see language production psychological conception of grammar see grammar: psychological conception of psychology 2, 5, 14, 16, 31, 41, 43, 64, 130, 139–40, 169, 223, 227, 232, 256, 260 note, 263 Pylyshyn, Z. 40 note

 quantitative methods see statistical analyses Quine, Willard Van Ormand 33, 34 note, 143 Rattan, Gurpreet 51 note Recanati, François 91, 111, 120 reference (semantic) 57, 121, 126, 245, 268, 271 causal theory of 267–8 descriptive theory of 268 speaker’s reference 271 Reisberg, Daniel 58 note reliabilist strategy see justification (of intuitions) reliability (of intuitions/intuitive judgments) 5, 9, 15, 17, 25, 33, 38, 52, 57, 66, 77–9, 83–4, 109–10, 112, 113 note, 116–18, 119 note, 121–7, 131–2, 134, 141–2, 150, 161, 172, 175, 202, 214 note, 216, 222–5, 229, 232, 256–8, 264, 266, 268 note, 269–70, 271 representation/representational content 15 note, 16, 20, 22–4, 34, 38–41, 43–5, 47, 48 note, 52, 57, 60, 64–7, 70–1, 74, 76, 117, 218 metalinguistic 55 Rey, Georges 3–4, 19 note, 27 note, 33 note, 34 note, 43 note, 51–3, 55, 57–68, 69–70, 74 note, 83, 110, 118, 124–5, 135–6 rules (grammatical) 34, 38–9, 42, 44, 63 note, 64, 69–72, 74, 76, 154, 178, 180, 186–7, see also language processing embodied 15–16, 24, 27, 30, 34, 34, 39, 55–6, 70, 72, 76–7, 82 table, 83, 86 mental 74, 86 structure 72, 75–7, 81–3, 82 table, 86 violation of 18, 21, 24, 25, 181, 187, 195, 231, 233 note, 237 scale bias see bias: scale scale effect (in experiments) 247, 251–4 Schütze, Carson T. 6–8, 14, 16, 32, 72 note, 73, 95, 133, 142, 149, 165–7, 217, 219, 233, 235, 237, 241, 243, 255–6, 259, 262–4 scientific evidence 129–31, 133–4, 137, 142, 145 semantics 5, 29 note, 33, 48, 89–93, 95, 98, 100–1, 103–4, 107, 111, 120, 129–30, 132, 144–5, 218, 221, 228, 257, 272 note, see also reference (semantic); intuitions: semantic theory 5, 89, 92, 108 vacuous quantification 93, 105–7 experimental see experimental method: semantics sensitivity 9, 49, 62, 123, 216, 222, 226–7, 257, 262, 269

301

sluicing 154 Smith, Barry C. 51 note Smith, Neil 2, 37 note, 59 source ambiguity problem see acceptability judgments speaker’s reference see reference (semantic): speaker’s reference Sprouse, Jon 7–8, 14, 95, 105 note, 143, 165, 167, 168 figure standard linguistic entities (SLE) 38, 43 note statistical analyses/methods/test techniques 6, 13, 78, 134, 156, 167, 201, 230–1, 234–5, 255–6, 266, 269, see also experimental method Stich, Stephen P. 9, 57 note, 112 note, 121, 241, 255, 268–9, 271–2 structural descriptions see language processing structure rules see rules: structure syntactic intuitions see intuitions/intuitive judgements: syntactic syntacticians 7–8, 17, 124, 130, 140, 143–4, 157, 179, 182, 184, 194 note, 215–22, 225–6, 228–32, 233–4, 244, 255, 260, 268–9 syntactic theory 1, 5–6, 33, 89–90, 92, 108, 133, 140–1, 144, 151, 163, 179, 193, 202, 215, 217–18, 220–1, 229–30, 232, 260 syntax 6, 8–9, 14, 33, 37, 39, 41, 47–8, 63, 65, 69, 89–92, 95, 100, 103–7, 129–32, 140, 144–5, 151, 163, 165–6, 172–3, 185–6, 215, 217–21, 223, 225, 228–30, 255–6, 258, 264, 266, 269–70 Textor, Mark 51, 53, 59 note, 69–70, 109 note, 115 note theoretical bias see bias: theoretical theories of syntax see syntactic theories thought experiments see method of cases traditional method 9, 125, 172, 256–8, 266, 268–70, 272–3, see also armchair methods universal grammar 92, 159 understanding test 54–5, 59 usage data/facts 56–9, 66–7, 141–2, 160 usage-oriented linguistics 150 vacuous quantification see semantics validity see intuitions/intuitive judgments: validity; experimental method/approach: validity variation in intuitions see intuitions/intuitive judgments: variation of variation (experimental control) see experimental method/approach: variation

302



Vigliocco, G. 44 note Vinson, D. 44 note visual analogy 40, 58, 62 Voice of Competence (VoC) 2–4, 15–19, 23, 25–6, 28–31, 33–6, 41, 45–6, 49, 51–3, 55–68, 69–84, 82 table, 118 note, 134–41 “non-standard version” 70, 76 “standard version” 70, 76, 82

Wasow, Thomas 56, 130, 142, 223, 236 note, 241, 245, 256–7 well-formedness 1, 69, 93, 104, 158, 169, 173, 174 figure, 175–7, 180, 182–6, 219, 245 figure cardinal 169–70, 169 figure, 170 figure, 171 figure, 183, 185–7 perceived 182, 262 wh-movement 105, 107, 153