263 57 2MB
English Pages 220 [237] Year 2014
Language Production and Interpretation: Linguistics Meets Cognition
Current Research in the Semantics/Pragmatics Interface Series Editors
Klaus von Heusinger Ken Turner Editorial Board Nicholas Asher, Université Paul Sabatier, France Johan van der Auwera, University of Antwerp, Belgium Betty Birner, Northern Illinois University, USA Claudia Casadio, Università degli studi G. d’Annunzio Chieti Pescara, Italy Ariel Cohen, Ben Gurion University, Israel Marcelo Dascal, Tel Aviv University, Israel Paul Dekker, University of Amsterdam, the Netherlands Regine Eckardt, University of Göttingen, Germany Markus Egg, Humbolt University Berlin, Germany Donka Farkas, University of California, Santa Cruz, USA Bruce Fraser, Boston University, USA Thorstein Fretheim, Norwegian University of Science and Technology, Norway Brendan Gillon, McGill University, Canada Jeroen Groenendijk, University of Amsterdam, the Netherlands Yueguo Gu, Chinese Academy of Social Sciences, PRC Larry Horn, Yale University, USA Yan Huang, University of Auckland, New Zealand Asa Kasher, Tel Aviv University, Israel Manfred Krifka, Humboldt University, Germany Susumu Kubo, Matsuyama University, Japan
Chungmin Lee, Seoul National University, South Korea Stephen Levinson, Max Planck Institute for Psycholinguistics, the Netherlands Claudia Maienborn, University of Tübingen, Germany Tony McEnery, Lancaster University, UK Alice ter Meulen, University of Geneva, Switzerland François Nemo, University of Orléans, France Peter Pelyvas, University of Debrecen, Hungary Jaroslav Peregrin, Czech Academy of Sciences and University of Hradec Králové, Czech Republic Allan Ramsay, University of Manchester, UK Rob van der Sandt, Radboud University Nijmegen, the Netherlands Kjell Johan Sæbø, University of Oslo, Norway Robert Stalnaker, Massachusetts Institute of Technology, USA Martin Stokhof, University of Amsterdam, the Netherlands Gregory Ward, Northwestern University, USA Henk Zeevat, University of Amsterdam, the Netherlands Thomas Ede Zimmermann, University of Frankfurt, Germany
VOLUME 30
The titles published in this series are listed at brill.com/crispi
Language Production and Interpretation: Linguistics Meets Cognition By
Henk Zeevat
LEIDEN • BOSTON 2014
Library of Congress Cataloging-in-Publication Data Zeevat, Henk, 1952Language production and interpretation : linguistics meets cognition / By Henk Zeevat. pages cm. – (Current Research in the Semantics/Pragmatics Interface ; Volume 30) Includes bibliographical references. ISBN 978-90-04-25289-9 (hardback : alk. paper) – ISBN 978-90-04-25290-5 (e-book) 1. Computational linguistics. 2. Psycholinguistics. 3. Cognition. 4. Semantics. 5. Pragmatics. 6. Grammar, Comparative and general–Syntax. I. Title. P98.Z33 2014 006.3'5–dc23 2013043773
This publication has been typeset in the multilingual “Brill” typeface. With over 5,100 characters covering Latin, IPA, Greek, and Cyrillic, this typeface is especially suitable for use in the humanities. For more information, please see www.brill.com/brill-typeface. ISSN 1472-7870 ISBN 978-90-04-25289-9 (hardback) ISBN 978-90-04-25290-5 (e-book) Copyright 2014 by Koninklijke Brill NV, Leiden, The Netherlands. Koninklijke Brill NV incorporates the imprints Brill, Brill Nijhoff, Global Oriental and Hotei Publishing. All rights reserved. No part of this publication may be reproduced, translated, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission from the publisher. Authorization to photocopy items for internal or personal use is granted by Koninklijke Brill NV provided that the appropriate fees are paid directly to The Copyright Clearance Center, 222 Rosewood Drive, Suite 910, Danvers, MA 01923, USA. Fees are subject to change. This book is printed on acid-free paper.
CONTENTS
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1
2
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Aristotelian Competence Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Against ACG: Ambiguity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Against ACG: Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Against ACG: The Gap between Production and Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Production Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 The Primacy of Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Strategies for Coordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Bayesian Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Simulated Production in Interpretation . . . . . . . . . . . . . . . . . 1.4.2 Mirror Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 The Other Chapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 2 5 8 9 11 12 17 19 24 26 27 32
Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Optimality Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Reversing Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Optimality-Theoretic Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Optimality-Theoretic Syntax for Word Order in Dutch . . 2.2.2 Provisional German . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Provisional English. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 The Production Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Procedural Interpretation of the Constraints . . . . . . . . . . . . 2.4 Higher Level Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Other Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 More Dutch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 A Worked Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Incremental Syntax Checking in Interpretation . . . . . . . . . 2.5.4 Quantification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
37 39 42 45 45 51 52 55 57 61 64 64 66 67 69 71
vi
contents
3
Self-Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.1 Optional Discourse Markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.1.1 General Self-Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.2 Word Order Freezing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.3 Pronouns and Ellipsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.4 Differential Case Marking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.5 A Case for Phonological Self-Monitoring? . . . . . . . . . . . . . . . . . . . . . . 96 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4
Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.1 The Interpretation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.2 Vision and Pragmatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.2.1 Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.2.2 Other Cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.2.3 Pragmatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.2.4 Clark Buys Some Nails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.2.5 Scalar Implicatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.2.6 Relevance Implicatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5
Mental Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.1 From Links to Representation Structures . . . . . . . . . . . . . . . . . . . . . . 142 5.2 Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 5.2.1 Logical Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 5.3 Mental Representations in Philosophy . . . . . . . . . . . . . . . . . . . . . . . . . 152 5.4 Belief . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5.5 Definiteness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 5.6 Comparison with Discourse Semantics . . . . . . . . . . . . . . . . . . . . . . . . 177 5.6.1 From Contexts into Discourse Representation Theory . . . 180 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6
Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 6.1 Rounding Off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 6.2 Computational Linguistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 6.3 Pragmatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 6.4 Semantic Compositionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 6.5 LFG 3.0 and PrOT 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 6.6 Language Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 6.7 Conceptual Glue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
contents
vii
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
ACKNOWLEDGMENTS
It is arbitrary where to start the history of this book. I place it in a conversation with Jo Calder in 1987. The question came up why we did not do our parsing with the deterministic techniques we were using for pronoun resolution to gain speed, an impressive time difference on the computers of that time. Jo told me to keep quiet about this idea. The time was obviously not right. I certainly did not know how to defend the idea apart from pointing to its efficiency. I just hope that now, the time is right. Important at that same time were talks of and discussions with Robert Dale on natural language generation which convinced me that pronouns should be looked at from the production perspective in order to get a comprehensible theoretical picture. Maybe all of grammar should be like that. Later I wrote a paper on idiomatic expression to make that point: there is production blocking of non-idiomatic alternatives, but no interpretational blocking from idiom (Zeevat, 1995). Years later, in the 1990s, Jerry Hobbs, during a meal of potato chips in London (Ontario), convinced me that just production was not good enough, since there are cases even about pronouns which essentially need the interpretation perspective. An example would be the impossibility of John and Bill had a meal. He had spaghetti.1 Things really got going with optimality theory (OT) and especially with Reinhard Blutner’s ideas on bidirectionality and a talk by Paul Smolensky about the dangers of naive bidirectionality in optimality theory. Reinhard provided a good way to think about these things (and this book still consistently uses his notion of bidirectionality as a filter) and Smolensky gave the challenges a proper theory of bidirectionality has to face. In Zeevat (2000), I developed a notion of optimality theoretic pragmatics where the pragmatics had essentially the function of meeting these challenges. It took me years to realise that this was indeed an interesting way to do pragmatics and not a quaint artefact. The realisation came when teaching a course in Berlin on rhetorical relations. Such relations suddenly made sense in this pragmatic approach. Very helpful was a discussion with Paul Smolensky about syntactic rat-rad cases which led to a precise formulation of the puzzle.
1
A monitoring effect, in the theory of this book. See chapter 3.
x
acknowledgments
The new ideas of Uwe Reyle, Hans Kamp, and Antje Rossdeutscher about combining DRSs by generalised presupposition resolution (Reyle et al., 2007) were also important, especially the discussions with Uwe about applying this approach to all of semantic combination. The minor deviations from DRT orthodoxy in Chapter 5 Mental Representation are natural and wellmotivated. It emerged from working on Chapter 4 Interpretation with the aim of deriving normal DRSs. This can however only be done by way of the proposed formalism. Finally, in late 2008, Remko Scha pointed out the similarities between my pragmatics and Bayesian interpretation and the present book basically develops this view. It explains why it is a good idea to do syntactic parsing in the way one does pronoun resolution, why there is proper blocking only in production, why interpretation needs to be bidirectional and has to include simulated production and why the production direction is the most fruitful one for finding linguistic generalisations. And it clarifies what the pragmatic hearer is after: the most probable interpretation of the utterance. Speaker adaptation to this strategy is then a necessary assumption needed to explain why normally coordination is reached. This gives bidirectionality also in speaking: the speaker needs to make sure—within the expressive possibilities of the language—that the hearer will get her right, by running a simulated interpretation. Bayesian interpretation is shared not only with accounts of perception, but also with interpretation by abduction, so that the whole enterprise ends up as a more radical reconstruction of Hobbs’s abductive pragmatics. The above describes the intellectual history of this book, and thanks to all of you for the wonderful discussions. Reinhard deserves a special thanks for disagreeing with me all the way, thus forcing me to keep thinking and improving. And Remko as well for extended comments on the whole book and continuous encouragement. Many thanks to all who commented on earlier versions of this work, especially David Beaver, Hanyung Lee, Scott Grimm, Gerd Jäger, Hans-Martin Gärtner, Helen de Hoop, Petra Hendriks, Lotte Hogeweg, Edgar Onea, and Cathrine Fabricius-Hansen. Many thanks to Edgar Onea for inviting me to give a course on the unfinished material at the Courant Centre in Göttingen and to the audience at this occasion for their helpful questions. Many thanks to Anna Pilatova for being a wonderful editor in the last phase of writing. Many thanks to two anonymous reviewers and the series editor Klaus von Heusinger for many very useful suggestions.
acknowledgments
xi
Many thanks to the Centre for Advanced Study in Oslo for their generous support during the year 2010–2011 when I was a fellow at the project Meaning and Understanding across Languages led by Cathrine Fabricius-Hansen and where the first drafts of what became this book were written.
PREFACE
When as a young natural language semanticist with a philosophical background I started working in computational linguistics and artificial intelligence, there were only small adaptation problems. Typically, my new colleagues knew everything they needed or felt that they should know. Current academic culture is different: people work hard at rather small aspects of language-related problems and pay much less attention to neighbouring disciplines. This book is unusual in trying to address the broader picture—the whole of language production and interpretation—and in taking only a very loose disciplinary allegiance. This follows from my conviction that any real contribution to the philosophy of language, natural language semantics and pragmatics, theoretical linguistics, historical linguistics, typology, computational linguistics, artificial intelligence, cognitive science, or psychology of language that meets any of the other listed fields should be interpretable in those other fields and be consistent with any proper insights obtained in the fields it is relevant to. And any theorising on language from one perspective that is interpretable in another field should respect the main results of the other field. To use an example close to my research interests, a semantic account of a particle should be consistent with theories about how and why particles are recruited by natural languages, be in harmony with the typological study of particles, allow a computational treatment in which its contribution to meaning in the context can be computed, and yield a syntactic treatment in which its insertion and the position of its insertion can be explained. The theory of production and interpretation pursued in this book should be good semantics, pragmatics, and grammar, but it should also be sound psychological speculation, and be consistent with philosophical accounts of meaning, use, and thought. It should be relevant to implementation if not downright implementable, consistent with theories about comparable problems in AI (e.g., computer vision), and take into account historical linguistics and typology (especially in semantics and syntax, it is still not common enough to ask how the postulated meanings can have emerged in the evolution of a language or how well they fit into a typological perspective). Needless to say, these are ambitions I can live up to only to a limited degree, being a philosopher and semanticist by training, a computational
xiv
preface
linguist, syntactician, and AI person by later experience, and having even weaker acquaintance relations with cognitive science, historical linguistics, typology, and psychology. But I have done my best and hope that the book may help readers from any of the above-mentioned fields to broaden their perspective on the two central processes in language (production and interpretation) and help convince them of the need to focus on what should be the core issue, namely, how coordination on meaning is reached in verbal communication. This book is an attempt to give a comprehensive account of the coordination problem for language, i.e. to offer an explanation of the fact that human verbal communication normally leads to coordination on meaning and that it does so effortlessly and in very little time. In functioning verbal communication, a speaker intends to achieve some effect on the hearer, chooses and articulates an utterance with the aim to achieve that effect for the benefit of a hearer who interprets it. The hearer’s interpretation contains a hypothesis about what the speaker wanted to achieve, and communication succeeds if the hearer’s hypothesis matches what the speaker intended. Human verbal communication normally works, which is surprising given the systematic underdetermination of meaning by form in natural languages. All attempts to give formal models of aspects of language seem to have contributed to further establishing this underdetermination. This holds for all reasonable models of speech, morphology, syntax, lexicology, semantics, pragmatics, pronoun resolution, rhetorical structure, and information structure. If the hearer were to pick an arbitrary ‘legal’ meaning for a form and the speaker an arbitrary correct form for a meaning, parity would be a rare exception indeed. Just how rare depends on the average number of possible interpretations per utterance. If this were close to 1 for a standard utterance, say 1.2, the probability of full coordination is 1/1.2= 0.8 and that would approach human performance. But the combined fields mentioned above lead to far higher degrees of ambiguity. So how do speakers and hearers manage? For the hearer, there clearly exists the best strategy: she must not make an arbitrary choice but rather choose the interpretation that is most probable given the utterance, the context, and everything else she knows. Any other strategy would increase the likelihood of her being wrong. This strategy has a correlate for the speaker: she must choose an utterance for which the most probable interpretation is the one she intends. If both parties can execute these strategies, communication will work. It is difficult to imagine another pair of strategies that would result in coordination for languages like our own, especially if it also must have naturally emerged in the history of communication. The hearer
preface
xv
strategy is the best independently of any speaker strategy. It is simply the rational strategy to employ in any interpretation process, linguistic or other. The speaker strategy is then just a rational adaptation to the rational hearer strategy, the best way to increase her chances to get what she wants in making her utterance. Specifying the strategy on this level of generality is, however, only a small part of solving the problem. How does the hearer know what is the most probable interpretation and how does he manage to find a set of interpretations that will contain it? And how does the speaker notice that a particular form does not have the intended meaning as its most probable interpretation and how does she select an utterance for which the hearer will select the intended reading? The chapters 2–5 try to give satisfactory answers to these questions. In all four cases, the answer comes in the form of a formalism that is implementable. The formalism of Chapter 2 Syntax is a grammar that can be implemented as a quasi-deterministic map from semantic structure to utterances in a particular language. The formalism in Chapter 3 Self-Monitoring is a filter on this map defined in terms of the prior probabilities assumed in Chapter 4 Interpretation. That chapter defines an algorithm for hierarchical cue selection and integration including a syntactic filter defined in terms of the grammar of Chapter 2 Syntax. The algorithm—comparable to algorithms in computer vision—can be interpreted from a linguistic perspective as an incremental parser for a stochastic free categorial grammar. It defines an update (or downdate) of the given linguistic and non-linguistic context from the cues provided by an input utterance. Chapter 5 Mental Representation provides the foundations of the formalism that represents contexts and contextual updates. By defining formalisms for grammar and semantic representation, the book is about classical formal grammar for natural languages. As such the full account should meet the normal requirements of explaining which strings of words are well-formed and what they mean, independently of whether these data are obtained as speaker intuitions, from a corpus, by elicitation or by comprehension experiments. The formalisms are however designed with a view of capturing human language production and interpretation and human representation of information. That means that they can also be refuted, confirmed and refined by empirical data about the psychology of these processes and about their neural implementation. The proposal should also be interpretable as a programme for further development of systems with linguistic abilities that are comparable to our own.
chapter one INTRODUCTION
The aim of this book is to give an account of production and interpretation of natural language utterances that would be linguistically, psychologically, and computationally plausible. The central goal will be to explain the fact that human speakers trust they will reach coordination with their hearer on the intentions that motivated them to speak. This trust is compatible only with success rates for coordination in standard conversation well over 0.9: failure should be the exception—while being quite possible. In this chapter, an argument will be presented for a theory of production and interpretation where the role of competence grammar is restricted to one aspect of production, the linguistic rules that constrain the mapping from a speaker intention to its verbal expression. Interpretation is understood as analogous to interpretation in computer vision, a mapping from the utterance to the interpretation that maximises the product of its prior probability in the context and the probability that it will be produced given the interpretation, a probability defined by the competence grammar. By Bayes’ theorem, that interpretation is the most probable interpretation of the utterance. If, as speakers seem to assume and as is a necessary assumption for the evolutionary emergence of natural languages, coordination between speaker and hearer is standardly achieved, it is not sufficient that the hearer follows a strategy of probability maximisation in the face of the massive underdetermination of meaning by form in natural languages.1 It follows that for coordination, one needs to assume that speakers have adapted to the hearer strategy and design their utterances so that their most probable interpretation is in fact the interpretation the speaker intends. The arguments against an account of interpretation based on symbolic Aristotelian Competence Grammars as defined in section 1.1 are that they cannot help in explaining coordination on meaning (section 1.1.1), that the
1 This property sets them apart from the formal languages in logic and computer science that have been designed in such a way that the form completely determines the meaning in terms of model theory or the process that should be executed.
2
chapter one
parsing algorithms to which they give rise do not have the linear time complexity that seems characteristic of human performance (section 1.1.2) and that they would predict that whatever a speaker can understand, she could also produce, a prediction in conflict with the production-comprehension gap found in empirical studies (section 1.1.3). In favour of the Bayesian account of interpretation, three arguments are given. The first are the arguments for an architecture of a grammar in which the grammar maps meanings to forms: linguistic generalisations are better captured when the grammar tries to map meanings on forms and can employ the Elsewhere Principle to explain blocking (section 1.2.1). Bayesian interpretation can immediately use such a grammar and does not need a similar rule-based interpretation grammar. Further, there is considerable evidence for simulated production in human language interpretation processes (section 1.4.1). Mirror neurons can also be interpreted as achieving a simulation of motor movements as a component of Bayesian interpretation processes and thus as embodying a successful perceptual strategy that would also be employed in human language understanding (section 1.4.2). Bayesian interpretation can also be read into Grice’s concept of nonnatural meaning and in Liberman’s motor theory of speech perception. In fact, the intuition behind these theories is very much the appeal of Bayesian interpretation. The connection with Grice also makes it clear that Bayesian interpretation is closely connected with pragmatics. We will in fact argue with Hobbs et al. (1990) that pragmatic interpretation is a side effect of interpretation as finding the best explanation of the utterance as in Hobbs’ interpretation by abduction or in Bayesian interpretation. 1.1. Aristotelian Competence Grammars The dominant view in linguistics is that an account of a given language should be given as an Aristotelian Competence Grammar (ACG) of that language. A competence grammar would be an account of linguistic competence only, i.e., its aim would be to explain the linguistic knowledge a competent speaker of a language applies in producing and interpreting utterances and it would not concern itself with processes involved in language use. An Aristotelian grammar for a particular language is a grammar which characterises the language as a relation between the expressions of the language and their meanings.2 2
Calling this kind of grammar Aristotelian can be based on a brief discussion in chap-
introduction
3
On the other hand, it would be an advantage for linguistic grammars if they were able to make direct predictions about observable behaviour such as human utterance production and interpretation because if that were so, their empirical basis would reach beyond the standard acceptability judgements and semantic intuitions about utterances that—according to some—provide direct access to linguistic competence.3 If grammars produce generalisations about actual language production and interpretation and about the communication process in which they result, they come to share their object of study with related disciplines such as psycholinguistics, artificial intelligence, computational linguistics, and neurolinguistics. That would increase the relevance of linguistics. It is one of the main claims of this book that formal grammar is quite possible but only if it can be interpreted as a contribution to understanding the processes studied by psycholinguistics, neurolinguistics, computational linguistics, and artificial intelligence, and in particular, if it is consistent with established findings in these related disciplines. Moreover, as will be argued in Chapter 3 Self-Monitoring, many formal properties of actual utterances—in some languages even all formal properties of utterances—cannot be understood just on the basis of competence grammar. And finally, an ACG unavoidably determines a theory of production and interpretation. It will be argued below that given the massive ambiguity inherent in the utterances of human languages, this property of natural languages—which must be reflected in ACGs if they are to be empirically adequate—stands in the way of a proper account of communicative coordination based on account of production and interpretation inherent in ACGs. From a logical perspective, an ACG would be an axiomatisation4 of the Aristotelian relation R.5 Production and interpretation can be defined in
ter 16 of De Interpretatione (Minio-Paluello, 1963) where Aristotle discusses the relation between the expressions of individual languages and (universal) mental representations. The notion of competence grammar goes back to Chomsky (1957) and is not attributable to Aristotle. The proposal made in this book does not conflict with Aristotle and arguably the problems noted below apply to the combination of the two notions. 3 As noted below in section 1.4.1, simulation in interpretation could provide an account of acceptability judgements that would challenge their status as a direct window on competence. 4 The axiomatisation would be a theory T such that (1). (1) 5
T ⊧ R(F, M) iff F is an expression of the language that expresses the meaning M
While in natural languages meaning, interpretation, and production are highly contextdependent, this merely leads to another parameter and therefore a three-place relation. A
4
chapter one
terms of the axiomatisation as algorithms that in the case of production find a meaning M for a given meaning F such that R(F, M) is provable and in the case of interpretation, find a meaning M for a given form F such that R(F, M) is provable. The literature on parsing and generating must then be seen as aiming at finding algorithms which efficiently construct meanings M for forms F or inversely while guaranteeing that R(F, M) is valid under the axiomatisation. And the literature on grammar formalisms must then be interpreted as attempts to provide axiomatisations which optimally support parsing and generation as well as the formulation of linguistic generalisations. All existing proposals of grammar formalisms can be interpreted as Aristotelian grammars in the sense just indicated—though as we shall conclude later, they need not necessarily be interpreted this way. If one sees deep structure as a kind of semantic representation, early transformational grammar (Chomsky, 1957) is an ACG and the same can be said for all of the subsequent formalisms where the role of deep structure is taken over by the logical form. It holds for functional formalisms such as systemic grammar (Mellish, 1988) and functional grammar (Dik, 1989) but also for all proposals in the tradition of unification grammar, such as Generalised Phrase Structure Grammar (Gazdar et al., 1985), Head-Driven Phrase Structure Grammar (Pollard and Sag, 1994), and Lexical Functional Grammar (Kaplan and Bresnan, 1982). It also holds for Montague Grammar (Montague, 1974) and the many versions of categorial grammar (Steedman and Baldridge, 2011; Geach, 1962; Ajdukiewicz, 1935; Lambek, 1958) as well as for optimality theoretic formalisms such as production optimality theoretic syntax (Grimshaw, 1997) and the by now numerous versions of bidirectional optimality theories (Smolensky, 1996; Blutner, 2000; Beaver and Lee, 2003). In the next three subsections, three arguments will be given against the tenability of an ACG account of production and interpretation. These arguments will take as their starting points the phenomena of ambiguity (section 1.1.1), time complexity (section 1.1.2), and the gap between production and interpretation (section 1.1.3). The conclusion will be that ACGs cannot be maintained but, at the same time, that a solid case can be made for rulebased production grammars, as an interpretation of the ACGs. The problem
grammar must now characterise a relation R(F, M, C) where F is a form, M a meaning, and C a context. A particular context C still gives an Aristotelian relation RC (F, M) that the grammar must characterise.
introduction
5
is therefore not so much with the substance of formal grammars but with their interpretation as a direction-neutral characterisation of the relation between forms and meanings. As production grammars, they give important constraints on the production process and on a Bayesian conception of natural language interpretation. But an account of coordination on meaning, proper production, and proper interpretation needs additional resources, over and above grammar. These are the prior probabilities of interpretations in a context, the same probabilities that are also needed in vision and other models of perception, but also a system of hierarchical organised stochastic cues specific for language, which should be comparable to similar systems of cues in computer vision. These additional resources will allow hearers to find and select the most probable interpretation of an utterance in a context and speakers to make sure that the intended meaning of their utterance in fact is the most probable interpretation. Proper production and proper interpretation are therefore essentially dependent on each other. The arguments against ACG will be followed by a conceptual introduction to the alternative proposal sketched in the last paragraph. Section 1.3 defines the strategies for the speaker and the hearer which lead to coordination: hearers maximise probabilities, while speakers make sure that hearers get the right interpretation that way. Section 1.4 argues for Bayesian interpretation in natural language interpretation as the best way to implement the hearer strategy. 1.1.1. Against ACG: Ambiguity Aristotelian grammars have a fundamental problem in accounting for production and interpretation: If R(F, M) is not a one to one relation and especially if one form can have multiple meanings, Aristotelian grammar cannot account for communication, that is, for the fact that usually, the hearer achieves a correct understanding of what the speaker wanted to say. If linguistic knowledge gives a set of meanings for an utterance with cardinality n and the hearer makes a random choice between the meanings, one predicts that the hearer gets it right with a probability of 1/n. If the medium value for n is above 1, verbal communication is unreliable, and for values of 2 and higher, it becomes so unreliable that it would be best to give up on verbal communication altogether. One should note, however, that admission of very low reliability runs counter to the idea that the emergence of verbal communication should be explained as an evolutionary process in which it would have to be selected because of its use to the verbal communicators. If
6
chapter one
linguistic communication were unreliable to such an extent, it cannot have been selected, and if it had emerged before it became unreliable, it would have been abandoned. The claim that the relation R(F, M) assigns many meanings to one form has been disputed by Paul Boersma in a solution to the rat-rad problem. This problem was found by Hale and Reiss (1998) as a counterexample to the bidirectional phonology of Smolensky (1996). In Dutch and German, voiced final obstruents are devoiced. The reverse mapping from surface [t] to underlying /d/ is a faithfulness violation in interpretation which gives the prediction that [rat] always is interpreted as /rat/. This conflicts with the fact that in Dutch and German [rat] is properly ambiguous. Boersma (2001) proposes to remedy the problem by adding a system of context-based interpretation constraints that can outrank faithfulness. Boersma (2001) assumes that speakers learn a set of rules which always selects one of the words /rad/ and /rat/ for the Dutch sound pattern [rat] based on the context. If this were indeed a general approach to ambiguity, ambiguity would be entirely eliminated by grammatical means and one would obtain one-to-one relations RC (F, M). The following example was designed to cast doubt on this pattern of explanation for the example case. Its aim is to show that no rules such as Boersma envisages can be learnt, that special assumptions about contexts are required since under natural ways of understanding a word context, that disambiguation can go both ways in the same context, and finally, that a non-grammatical explanation of ambiguity resolution is more plausible. The Dutch sentence (or sentences) (2) could be accompanied by pictures of (i) a rat in in a hamster wheel, of (ii) a hamster wheel within a hamster wheel, or by x-rays of (iii) a pregnant rat, of (iv) a rat who swallowed a small hamster wheel, an x-ray of (v) a pregnant rat contained in a hamster wheel, or (vi) no picture at all. In all six cases, this gives a context of the utterance. (2)
Kijk, daar is een rad/rat in een rad/rat. Look, there is a wheel/rat in a wheel/rat.
In the first case, the context forces the resolution rat-wheel, in the second the resolution wheel-wheel, in the third the resolution rat-rat, and in the fourth case, the resolution wheel-rat. The fifth case is ambiguous between rat-rat and rat-wheel, while the sixth case does not offer a resolution: it must be seen as a case of failed communication involving a four-way ambiguous sentence without any clue for resolution.
introduction
7
This example shows that contexts indeed can disambiguate but also that they may fail to do so. It shows that the same context is consistent with different resolutions of the same ambiguity. Moreover, the example contexts are so rare that disambiguation cannot be attributed to learning a rule. In these cases, the disambiguation process should rather be seen as a kind of inference where the appropriate resolutions provide a way in which the sentence can make sense in the context. And this seems to be a general pattern: ambiguities are resolved by discarding the readings that do not fit into the context or readings that fit there less well. This latter notion can be understood as probability maximisation whereby bad fit means low probability. The best reading is the one that is most likely given the context. Natural language utterances have a high degree of local ambiguity arising on different levels of analysis. On the level of speech recognition, 4 or 5 phonemes are often possible at a given point in the speech signal. The rat-rad problem is an example of phonological ambiguity. Even in the morphologically poor English there is a high degree of morphological ambiguity. Words—if we go by standard lexica—have a high number of usages. Since the advent of formal grammars, ever more syntactic ambiguities have been uncovered. Montague grammar added quantifier (operator) scope ambiguities and, finally, the study of discourse describes different ways in which anaphora can be resolved and discourse relations constructed between the clauses in a text. One can try to estimate the number of readings for standard sentences as the product of the various ambiguities counted on a one-word basis. This results in numbers such as presented in table (3). For my—quite disputable—estimates, the average word length is taken to be 6 letters, the standard sentence length is 10 words. (3)
phonetic phonological morphological syntactic lexical semantic pragmatic
56 = 7776 1.1 1.1 1.2 2.5 1.1 1.3
The product of the non-phonetic factors is about 5. 2 and this puts the ambiguity predicted for an n-word sentence at 5. 2n . In this estimate, a 10-word sentence gives rise to a medium 5. 210 (∼ 14455510) ‘readings’. This figure can be considerably reduced by low-level
8
chapter one
techniques (such as looking up words in the lexicon for phonetic and phonological ambiguities or using a finite state approximation to syntax), and it is not easy to specify the number of ‘interesting’ readings that remain after these techniques have been applied, especially in a language-independent way. It is clear however that ambiguity is not fully eliminated. The argument however does not require astronomical numbers, just a number slightly above 1 for per word ambiguity. For 1.1, a 10-word sentence has about 3 readings, for 1.2 about 6. In both cases, the probability that the hearer gets it wrong by just picking a reading exceeds the probability that she gets it right. The ambiguity of verbal utterances is a decisive counterargument against the idea that an Aristotelian Competence Grammar (ACG) could provide an adequate account of production and interpretation. Aristotelian Competence Grammars cannot explain the very property that makes verbal communication useful: the fact that the hearer standardly comes to understand what the speaker wanted to say. 1.1.2. Against ACG: Time Complexity Ambiguity is not the only argument against an account of production and interpretation based on Aristotelian Competence Grammars. A crucial step in interpretation is parsing, and grammar-based natural language parsing has been studied extensively following the work of Chomsky, Gazdar, and Shieber. Chomsky (1957) showed that natural language cannot be handled by a finite state grammar, Pullum and Gazdar (1982) showed that for the arguments considered by Chomsky for going beyond context-free grammar, context-free grammars suffice, and finally, Shieber (1985) gave an example of a phenomenon (cross-serial dependencies combined with case marking and considerable freedom in word-order allow a homomorphism to an bm cn dm ) that is a formal counterexample to all natural languages being context-free, based on the Zürich version of German. Human language understanding, however, is so quick that it is only compatible with a parser with linear time complexity. But the time complexity of parsers for contextfree ambiguous grammars and for ambiguous linear grammars proposed by Gazdar (1988) is not linear,6 making it impossible that human language
6 An exception is the proposal of Marcus (1978). It is far from obvious, however, that the look-ahead of 2 words assumed in Marcus’s parser to obtain determinism is universally sufficient or even sufficient for English. But it is a proper reaction to the problem that context-free parsing does not normally lead to the linear time complexity that seems characteristic of human interpretation.
introduction
9
understanding could be based on parsing an ambiguous context-free or linear grammar. This argument can be strengthened by looking at what goes in grammars for visual languages, grammars developed for dealing with categories of interpretable 2-D graphics, such as business graphics, flow diagrams, maps etc. Marriott and Meyer (1997) presents an overview of the complexity of recognition problems for visual grammars, coming to the surprising result that for any non-trivial visual grammar formalism, the recognition problem is unrestricted. The jump in time complexity is due to an explosion of the adjacency relation in 2-D graphics with respect to adjacency in strings. This means that symbolic parsing with such formalisms is not feasible and cannot contribute to recognising graphical depictions, which is a disappointing result for visual grammars. Linear processing for vision is intuitive: we seem to see ‘at a glance’. It must be concluded that classical symbolic parsing for visual languages cannot be a component of human vision and that the Bayesian approaches to vision in automatic vision have a far greater relevance in explaining human vision. And it does not make much sense at all that the human brain would have evolved symbolic parsing to deal with natural language understanding when it already had mechanisms for vision available. Crucially, such mechanisms rely on predictive information about what one is going to see, or in the case of language, on what one is going to hear. In both cases, this prediction should be understood not in the sense of predicting the visual signal or the words of the sentence but in terms of predicting the states of affairs one is going to see or going to hear about. There is therefore little prospect for a cognitive role of symbolic parsing within natural language interpretation. The Bayesian methods considered below eliminate the need for grammar-based parsing and have the additional advantage that they already implement the most rational hearer strategy of going for the most probable interpretation, a crucial part in explaining coordination. 1.1.3. Against ACG: The Gap between Production and Interpretation There is a third equally decisive argument against grammar-based production and interpretation, one based on an argument due to Clark and Hecht (1983). ACG production and interpretation predicts that what can be produced can also be interpreted, and that what can be interpreted, can be produced. That, however, is in direct conflict with data about what humans can produce and interpret both during the acquisition stage and in later life.
10
chapter one
These data show that what can be produced is a small proper subset of what can be understood. In face of these data, ACG-based production and interpretation seems untenable. Clark and Hecht (1983) bites the bullet and assumes two separate grammars, one for interpretation and one for production. That, however, introduces two new problems. The first one is the issue of harmony between the two grammars. Why are there no mismatches, where a production would be misunderstood by the interpretation grammar, or interpretations of the utterance within the range of the production grammar that are consistently mapped onto a different utterance. The second problem is one of economy: why code the information in the production grammar twice? The system we propose can also be understood as employing two separate grammars for production and interpretation. The problems of harmony and economy do not arise because the two grammars code complementary information. The production grammar constructs the right word order, assigns the correct morphological variant, and overgenerates (monitoring is not included). The interpretation grammar can be seen as a stochastic version of free categorial grammar, which overgenerates due to abstracting away entirely from morphology and word order, relying solely on what is normally seen as subcategorisation and anaphoric properties associated with the words. These grammars specify independent information that is standardly combined in grammars of the ACG family: word order and morphology in the production grammar, semantics and subcategorisation in the interpretation grammar. Proper models of human production are achieved by restricting the production grammar by self-monitoring implemented through the interpretation grammar. Proper interpretation is achieved by demanding that the result of interpretation best matches the production grammar. This gives consistency and economy and predicts that production (a restriction of the production grammar) is a proper subset of interpretation. Interpretation will continue to deliver interpretations even when production cannot map these interpretations to the utterance anymore.7
7 An important second factor explaining the gap—not important for the proposals in this book—is the fact that words in constructions compete in the production process as ways of rendering features of the message with factors such as recency, individual experience and individual previous use determining their chances for success. This straightforwardly gives the temporary and long-term preferences that explain the absence of perfectly well understood utterances in production.
introduction
11
1.2. Production Grammar As was shown above, the picture given by Aristotelian Competence Grammars of production and interpretation is not cognitively plausible and should be rejected. Does this mean that all ACGs should be rejected as such? Not so, since the problem concerns just the Aristotelian interpretation and since ambiguity is a problem only in interpretation, the various competence grammars are unproblematically interpretable as formulating constraints on the process of utterance production. They deal with important constraints which capture real aspects of language. This can be the official interpretation of the grammar, which would make ACGs answers to the question of how to speak a particular language. This interpretation clearly applies to Panini’s grammar (Katre, 1987), the oldest example of a formal grammar (6th century bc). The Astadhyayi is an oral text that has been uninterruptedly transmitted to this day, since memorising it has always been part of the standard way to learn Sanskrit. Early transformational grammar mapping deep structure to surface structure belongs to this tradition, as does systemic grammar (a formalism specifically developed for natural language generation and still standard in that area), functional grammar (Dik, 1989), or more recently optimality theoretic syntax (Grimshaw, 1997), a mono-directional formalism which maps semantic representations to surface forms. This last system will be the starting point of Chapter 2. The more interpretationally oriented Generalised Phrase Structure Grammar (Gazdar et al., 1985), Head-driven Phrase Structure Grammar (Pollard and Sag, 1994), Lexical Functional Grammar (Kaplan and Bresnan, 1982), and Categorial Grammar (Ajdukiewicz, 1935) can be reinterpreted as constraints on production too, though not quite as directly.8 Under this interpretation, the production-oriented grammar formalisms do exactly what they promise. Classical linguistics works for natural language generation (NLG) without serious problems up to the point that the influential textbook of Reiter and Dale (2000) remarks that syntactic realisation—mapping from abstract to surface form—is essentially solved (standardly with a systemic grammar in NLG). The remaining problems (NP selection, particles) in natural language generation can be related to
8 OT-LFG (Bresnan, 2000) is a proper reformulation of LFG as a production grammar. This seems the way to go and the “deep generation” of Chapter 2 Syntax can be seen as an attempt to reformulate in the same way.
12
chapter one
the issue of self-monitoring considered in Chapter 3 Self-Monitoring—and experience in natural language generation is a strong influence for this chapter. Aristotelian interpretation can be seen as an ambitious additional demand on what linguistic competence is supposed to achieve. It has influenced the way grammars are designed, as seen in, e.g., the requirement of having a context-free formulation (in Categorial Grammar, , LFG, and to some extent HPSG) or in speculations about logical form in the generative tradition. It is largely responsible for what makes grammars complicated and syntax a difficult topic. Even so, competence grammars interpreted as formulating a constraint on the mapping from meaning to form can meet further restrictions. The account of this book aims at achieving linear production times and the possibility of evaluating partial interpretations for producibility of initial segments of the input. It also makes sense to interleave production with further queries about the thought to be expressed, as in Appelt (1992) or more generally in systemic grammar, but this aspect of generation will not be taken on board in this book. A subtle point is that as a constraint on production, the interpretation of formal grammars entails that it actually determines the Aristotelian relation. M is a meaning of F if and only if the production grammar maps M to F. The argument about coordination however entails that the recovered relation is insufficient for determining the meaning in a particular context of use: it gives many meanings without providing a criterion of choosing between them. The argument about time complexity entails that a symbolic algorithm will not be able to invert the relation in linear time. And finally, the production-interpretation gap indicates that the relation is not recovered correctly in this way: there are many meanings M for forms F such that the production grammar does not give the form F to a meaning M. An Aristotelian relation can be recovered, but it is not the correct one for interpretation. 1.2.1. The Primacy of Production A production perspective on rule based formal grammar can be supported from a linguistic point of view. Such an argument would establish that symbolic grammars should be characterisations of production and it would undermine the idea of interpretationally oriented grammar formalisms or directionally neutral ones. The arguments for the primacy of production are: (a) that linguistic generalisations run better if the grammar is production
introduction
13
oriented, (b) the existence of blocking and its Paninian explanation in terms of a production grammar and Panini’s elsewhere principle, and (c) the absence of semantic blocking by more specific interpretation rules. a. Linguistic Generalisations There are some linguistic areas for which it makes a remarkable difference whether they are described from the production perspective or from the perspective of interpretation. One of these areas is the use of personal pronouns (and more generally the use of referring expressions like demonstratives, definite and indefinite descriptions, and proper names). In NP generation, there is a long-established consensus that the process is governed by something like the referential hierarchy of Gundel et al. (1993). (4) is an adapted version that seems to work well for Dutch. (4)
FIRST > SECOND > REFLEXIVE > IN FOCUS > ACTIVATED > FAMILIAR > UNIQUELY IDENTIFIABLE > REFERENTIAL > TYPE IDENTIFIABLE
The rule associated with the hierarchy is to use a noun phrase which is correlated with the highest property in the hierarchy that holds of the referent in the input. For example, if the referent is REFLEXIVE (i.e., if it is identical with the referent of a higher argument of the same verb) but the referent is not the speaker or the hearer, it is possible to use a reflexive pronoun. If a definite description can be used, it should be used unless pronouns or reflexives can be used. The rule can be overridden by other concerns but it does give a close approximation of what people do. Moreover, one can directly use it to motivate constraints on pronoun resolution: first and second person entities and reflexive objects should be avoided as antecedents. If one tries to explain the same phenomena from the interpretation perspective, such constraints become mere stipulations. Papers on pronoun resolution such as Kasper et al. (1992), Asher and Wada (1988), or Strube and Hahn (1999) are useful but rather unreadable and cannot be seen as contributing to a linguistic account of pronouns. The apparent exception, centering theory (starting with Grosz et al. (1995)), has also been shown by Beaver (2004) to be much more understandable from the production perspective. The referential hierarchy and the associated rule mentioned above seem to be the natural account behind generalised pronoun resolution. It explains why one needs to check that a resolution of a third person personal pronoun is not to an object that should be realised as a reflexive pronoun or a first or
14
chapter one
second person pronoun, why definite descriptions should not be resolved to this kind of entity and preferentially not to in focus items, etc. The second phenomenon where assuming the production or interpretation perspective makes a substantial difference is word order. A popular approach to German, Danish, and Dutch word order is field theory. In applying this theory, one divides a sentence into a number of fields and states what goes in those fields, possibly with some constraints on the order of elements which can go into the same field. A simple way of dealing with English word order is based on the linear precedence rules of but field theory could also be applied. Such methods work, are easy to understand, and can be used almost directly to get things in the right order in a production system. And as discussed in Chapter 2 Syntax, a deeper explanation is even available based on prominence ordering. The inverse task of making sense of the meaning of word order is, however, much more complicated, if possible at all because any particular ordering fact may have numerous explanations. And moreover, such undertakings have not led to any interesting generalisations. b. The Elsewhere Principle Idiomatic mistakes can be illustrated by (5). (5)
Which hour is it? Which hours do it? How many hour is it? How late is it? John is on office. John is in the office.
When one is asking for the time or reporting that John is at his office, these are correct formulations in Dutch, Italian, French, and German but not in English. The correct English formulations are, of course, equally bizarre when rendered in the other languages mentioned. Nonetheless, compositional semantics predicts that English speakers should be able to understand the utterances listed above correctly since the words mean the same and the syntax is similar. The effect of seeing that it should not have been said in this way will, however, block the standard interpretation or lead to alternative interpretations. From the production perspective, idioms are easy to describe. The general format is attributed to Panini and is thereby as old as linguistics. It is known as the Elsewhere Principle (Kiparsky, 1973) and can be stated as in (6). (6)
More specific rules override more general rules.
introduction
15
For example, the rule that says that the plural of goose is geese overrides the rule that makes the plural of English nouns N > N+s. In a production grammar, this can be implemented by having a specific rule which describes how to ask for the time or how to state that someone is in the office and preventing the application of more general rules which would lead to the production of deviant expressions such as the ones in (5). For the intended meaning, the deviant ways are simply blocked. c. Interpretive Blocking It is an important fact about natural language that there is a lot of blocking going on and that blocking nearly always occurs in the production direction. There exist, however, also some examples of interpretive blocking. One category of such examples is represented by (7), where the reading of example (7b) is blocked in example (7a). It represents a class of pronoun resolutions which seem to illustrate that having a good antecedent nearby prevents further search in the context. (7)
a. Henk and Katja were surprised that the editors rejected each other’s papers. b. Henk and Katja were surprised that the journal rejected each other’s papers.
Other examples of semantic blocking are of the kind illustrated by (8). It is claimed that the killing cannot have been done in a normal way though the lexeme combination cause to die on its own in no way expressly precludes that option. (8)
Black Bart caused the sheriff to die.
Neither of these cases of semantic blocking invoke a more specific interpretation rule. In the first case, it seems to be a property of the search for antecedents that finding a fitting antecedent prevents further search for less fitting antecedents—quite a general rule. In the second case, a proper explanation rather obviously has to do with the alternative of a simpler9 and equivalent expression kill which would be preferred if the speaker intended to report a normal killing. The rule to avoid a straightforward interpretation for an unusual way of expression—if indeed it is a semantic interpretation rule10—would also not be a specific rule that overrides a more general rule,
9 10
More frequent or less unusual: the length does not seem to matter much. The similar example of Grice (1975): Mrs. T produced a series of sounds closely resembling
16
chapter one
but a general principle that is related to finding an explanation of an unusual way of expression. Rather revealing is what happens with idiomatic combinations in interpretation. For example, John kicked the bucket can quite clearly also mean that John kicked a water container, and the German im Büro taking the place of in dem Büro and the standard expression for being at one’s usual place of work (if it is office-like), can also mean at the office, where the office is some office that is not the usual place of work of the subject. This suggests that while an idiomatic expression cues the relevant meaning, it does not block alternative interpretations, as would be predicted by a system of interpretational rules governed by the Elsewhere Principle. Consequently, interpretation is different from production. As we saw in the examples of production blocking, there is a case for assuming production rules of different specificity. Panini’s account requires such rules and a view that denies production rules would need an alternative account of blocking. The evidence for corresponding interpretation rules that vary in specificity is just not there since the putative cases of semantic blocking can be reduced to other mechanisms. The interpretation of idioms represents evidence against rule-based interpretation which would predict that a more specific rule which interprets the idiom blocks alternative readings. Many linguists have followed Panini in formulating grammars from the production perspective. Such an approach is found, for example, in Chomsky (transformational grammar, government and binding, the minimalist framework), Halliday (systemic grammar), Ross and McCawley (Generative Semantics), Dik (functional grammar), Prince and Smolensky (production phonology), or Grimshaw and Bresnan (OT syntax). The first formulations of more interpretationally oriented grammars are of a more recent date, with categorial grammar (Frege and Husserl in correspondence according to Casadio (1988), Ajdukiewicz (1935)) being a prime example of such approach, and HPSG attempting to be directionally neutral, and LFG much like an interpretational counterpart to a version of transformational grammar.
the score of Home Sweet Home does not seem to imply that Mrs. T did not sing normally, merely that the singing was somehow unable to impress the writer.
introduction
17
This section reviewed a line of thinking which shows that it makes sense to think of language production as a rule-based process: it leads to an account of blocking in production which follows Panini’s Elsewhere Principle. Rule-based production also explains noun phrase selection and word order, since these phenomena can be described much more simply from the perspective of production than from the perspective of interpretation. Grammar as production explains the evolutionary change which led to verbal communication as the emergence of an organised way of producing to make others recognise one’s intention. It works because hearers—as speakers—know how the sound should be organised given an intention, because they already have a sophisticated model of their conspecifics (in their own mental life) and of the world at large, and because speakers have learnt how to use words as cues to aspects of speakers’ intentions, even before they incorporated the same words in their own speech. No additional change is therefore needed for interpretation, except for cue learning. Importantly, one does not need a rule-based interpretation component when one assumes that human language interpretation is Bayesian. But hard production rules are useful in interpretation, since they have strong effects on the production probabilities for interpretations. And a brief look at language history suggests that language evolution creates such hard production rules as a matter of course. The many English or French word order rules have been created along the way from West-Germanic to English or from Latin to French but both West-Germanic and Latin have very few constraints on word order. 1.3. Strategies for Coordination The idea that ambiguity resolution can be defined as finding the reading that is most probable in the context is not controversial. It is just rational. If one has a choice between different possible readings, picking the most probable one is the best strategy for increasing the probability that one is right. Quite rightly therefore, it is the guiding principle in stochastic parsing and signal processing in general. Adding the knowledge of interpretation probabilities to the linguistic knowledge of competent users of a language therefore helps with the coordination problem. It increases the mean probability of coordination to the mean probability of the most probable reading being the right one, i.e., to the mean probability of the most probable reading.
18
chapter one
But this does not result yet in a rate of successful coordination that would reflect the confidence human language users show in verbal communication. One trusts that one will be understood, while allowing for the possibility of failure. That kind of confidence requires a success rate that is well above 0.9. The hearer strategy of maximising probability does not reach such a success rate. For example, take a 10-way ambiguous expression and assume that the frequency of the different readings in a large enough corpus is randomly distributed by assigning random numbers below 100 to each reading as the frequency of that reading. The probability that the dominant reading is the correct one is lower than 0.2. The strategy of the hearer, however, cannot be improved upon: it is already the best that one can do in the face of uncertainty. That means that the only way to understand normal rates of coordination is by assuming speaker cooperation in achieving coordination. Using the same probabilities as in interpretation, the speaker can try to select an utterance that has its intended meaning as the most probable interpretation. If the speaker can carry out this strategy with a high rate of success, say 0.9, the combination with the hearer strategy will score 0.9 as well, that is, it will be within the range that seems to match our degree of confidence in verbal communication. The hearer strategy is just inherited from other interpretation, as in vision or in understanding non-verbal communicative acts. The speaker strategy is an adaptation to the hearer strategy driven by the fact that the speaker speaks because she wants to achieve some goal. It involves looking at one’s own utterances before articulating them and seeing if they will have the desired effect on the hearer. The phenomena studied in Chapter 3 Self-Monitoring suggest not just that speakers engage in such self-monitoring, but also that the process is automatic. People are not aware of subtle word order decisions or decisions of inserting a particle or using a pronoun. The speaker strategy of monitoring her own production in order to make the intended reading the most probable one for the production is the most important innovation of this book. Monitoring is part of bidirectional optimisation as in Smolensky (1996) and Blutner (2000): every production is monitored for being interpretable as the input for the production. Unfortunately, that is not good enough for coordination in the face of ambiguity. Bidirectional optimisation fails empirically if its bidirectional constraint systems fail to capture the substantial ambiguity of natural language. That means that checking by reverse optimisation does not increase the probability of coordination: it merely guarantees that it is not impossible that the
introduction
19
hearer finds the intended input. Similarly, the reverse optimisation does not contribute to reaching the most probable interpretation. This and the serious problems of bidirectional optimisation noted by Hale and Reiss (1998) and Beaver and Lee (2003) make this style of bidirectional optimisation not a good alternative for the theory of this book.11 Self-monitoring is central to Levelt (1983)’s account of speaking, but is not systematically connected with the syntactic problems discussed in Chapter 3 Self-Monitoring. Self-monitoring is also important for historical explanation. The hard syntactic rules that one finds in many of the human languages can be understood as fossilized patterns that were created by monitoring. Monitoring also is a way of describing the expressive pressure behind the recruitment processes assumed in grammaticalisation processes that created functional words and morphemes out of lexical words. It would in this way be responsible for case morphology and tense and aspect morphology, whose formation require expressive pressure for marking the relation between the event type and its arguments and for locating the event with respect to the moment of speech and other events. This section gives a general answer to the question of how coordination can be reached in the face of ambiguity. The hearer must choose the most probable interpretation and the speaker must make sure that that interpretation is the intended one. The real question is however whether and how speakers and hearers can carry out their strategies and that is the subject of this book. 1.4. Bayesian Interpretation
Bayesian interpretation is based on Bayes’ theorem p(A|B) = . p(B) This immediately follows from the definition of conditional probability as p(A&B) p(A|B) p(B|A) p(A|B) = p(B) since that gives p(B) = p(A&B) = p(B&A) = p(A) . p(B|A).p(A)
11 While in section 1.1.1, it was argued that the theory of Boersma (2001) does not give a correct account of ambiguity, it is still very close to the account given here. Boersma’s context-sensitive semantic constraints would make the Aristotelian relations RC (F, M) 1–1 and interpretation directed towards probability maximation. The probability maximation of Chapter 4 Interpretation integrates production syntax not as part of the same constraint system. It also characterises a much more extended relation than the relation obtained by monitored production (Chapter 2 Syntax and Chapter 3 Self-Monitoring).
20
chapter one
An interpretation problem can be defined as finding the most probable interpretation I for a signal S:12 argmaxI p(I|S). Since S is given, this can be rewritten as argmaxI p(I)p(S|I). p(I) is called the prior probability of I, p(S|I) the likelihood of S given I (we shall call p(S|I) the production probability throughout since it measures the probability that S is produced from I). Bayesian interpretation is any process that finds the most probable interpretation by maximising the product of the prior and production probabilities. It is a widely used technique in signal processing that is needed especially in those cases where the posterior probability p(I|S) cannot be reliably estimated in a direct way, but prior and likelihood can. This is the case in automatic speech perception and computer vision. For example, in computer vision, the production probability is treated as a ‘mental camera’, a physics-based mapping from the hypothesised interpretation to the signal it would yield. The prior probability is given by the probabilities of different events in our natural environment. In speech recognition, prior probabilities are estimated by corpus statistics and the production probability by the degree of similarity between what would be expected from the hypothesis and the observation. In natural language, the process can be illustrated by Grice’s example: “I am out of petrol”, said by a man leaning out of the window of his immobilised car. To see that it is not the explanation of a gas station owner who is out of stock, but an appeal for help by a man with a problem, the expectations about the situation (the prior) are needed. Likelihood or production probability is needed for inferring that the man is asking for help rather than explaining his situation (one can ask for help by stating the problem), or in inferring that “I” is used for his car (the possibility of metonymy in English). Production probability for this hypothesis decreases quickly when a different noun (e.g. sugar) is chosen instead of petrol or an ungrammatical reordering of the words is produced. The hypothesised message as such needs to make sense and it must be quite likely that one would have chosen the utterance oneself if the hypothesis were true. There are various classical ideas that can be seen as approximations to Bayesian interpretation. Bos (2003)’s treatment of presupposition and anaphora resolution by means of theorem proving is a good example. Bos is able to check by theorem proving whether increments to the context are consistent and informative and by grammatical processing what are
12 argmax f(x) is the partial function that delivers the value for x for which f(x) gives the x highest value, if there is such a value.
introduction
21
the legal interpretations of a given utterance in an Aristotelian grammar. This simple amendment of interpretation from just legal interpretation to legal, consistent and informative interpretation makes good pragmatic sense, both for the problems considered by Bos and for other phenomena such as disambiguation, conversational implicature and intention recognition. The treatment can be reduced to a Bayesian interpretation model, in which all legal interpretations of an utterance U get uniform production probabilities and illegal or uninformative interpretations production probability 0 and inconsistent increments prior probability 0 and consistent increments uniform priors. Bos’s implementation is clearly the best one can do with classical logic and Aristotelian grammar and is quite close to Aristotle’s view in chapter 16 of De Interpretatione. Restoring proper production probabilities by using a production grammar could however bring in other ideas about disambiguation like tree complexity, comparisons between different anaphora resolutions as in Centering Theory, Beaver’s reinterpretation of centering as production grammar (Beaver, 2004), the likelihood of a word or construction being chosen to express a particular concept (blocking), and even the likelihood of the utterance type and content as the speaker’s strategic choice to achieve her intention, if the production grammar incorporates these aspects. Restoring proper prior probability refines consistency to include stochastic causal reasoning about the world from what is given in the context.13 And importantly, what the speaker says may be inconsistent with the context, e.g. when the speaker is correcting earlier information, or stating a counterfactual. This means that it is not so much the content that needs to be checked for consistency (the content is merely much more probably consistent with the context than not) but the intention of the speaker. If one restores full probabilities, prior probability maximisation captures the search for maximally probable interpretations (maximally probable in the sense of the probability that the speaker may want to say it in the context). If Hobbs et al. (1990) is on the right track in reducing pragmatic reasoning to a search for the most probable explanation of the utterance, we may account for the pragmatic aspects of interpretation by incorporating the
13 The assumption that the hearer is just comparing her subjective probabilities in interpretation is not likely. Like other elements of the context, the probability comparisons must be common ground between speaker and hearer. The hearer can use her subjective probabilities only provided that she takes into account her uncertainty with respect to the precise numbers. An interpretation will be the right one only if it is more probable beyond that uncertainty.
22
chapter one
proper assumptions about what people would want to say in the context in the prior probability of the interpretations. Maximisation of the production probability makes the interpretation sensitive to preferences in the choice of expression. Such formulation preferences are important for the Black Bart example and other pragmatic inferences. The computation of interpretation hypotheses is more or less problematic in various applications. It is trivial in cases of speech recognition where a small set of phonemes constitutes a small set of hypotheses. It is trivial for production OT phonology where, as shown by Frank and Satta (1998) and Karttunen (1998), the production system can be approximated by a finite state transducer. Such transducers can be inverted to give the correct mapping from the surface structures to the underlying structures. But in computer vision, finding hypotheses is a major obstacle which is approached by developing learning algorithms that learn a cue system and prior probabilities at the same time, with the weighted cues used for producing the hypotheses. Natural language is in the same position: the set of possible interpretations is infinite. But fortunately, in natural language, identifying the cues seems quite straightforward. Words, morphemes, combinations of words and morphemes, combinations of words, and words and utterance starts and ends seem natural candidates. In Chapter 4 Interpretation, these are combined in a stochastically constrained way to yield ever larger hypotheses. The process simultaneously evaluates cues and their combinations and maximises production and prior probabilities. The above shows that one may develop Bayesian interpretation schemes to find maximally probable interpretations for natural language utterances. But there is also other evidence that humans use a form of Bayesian interpretation. First of all, there is negative evidence for the obvious alternative: stochastic parsing does not offer a cognitively plausible account of probabilities needed for the speaker and hearer strategies. The relatively high success rates in stochastic parsing14 and speech recognition depend on corpora of a size which exceeds human language experience and require the use of mathematically sophisticated estimation techniques as well as serious numerical processing. While one can imagine that brains can approximate the estimation techniques and the number crunching, the size of the corpora needed is an insurmountable problem. It does not follow that stochastic
14 In terms of F-scores on labelled bracketing (the relevant technical measurement of success), exact match is not nearly as good.
introduction
23
parsing and speech technology do not in some sense model what humans are doing. They do so if human communicators follow the hearer strategy, since stochastic parsing does model hearers trying to go for the most probable interpretation. But humans cannot obtain their probabilistic models in the way in which stochastic parsing technology obtains them simply because they do not have sufficient experience of language use. If hearers do not have access to the right numbers, they cannot follow algorithms analogous to stochastic parsers with the same success rates. The second problem with the plausibility of stochastic parsing as a cognitive model is due to its incompleteness: parsing just goes from lists of words to syntactic trees or dependency graphs, while interpretation should reach full speaker intentions. More complete systems could arise by new algorithms for lexical disambiguation, resolution of pronouns, semantic ambiguities and context integration. Arguably, these tasks require their own probability models to be learnt from data sets because combinations would quickly run into the sparse data problem. The new models should then run in a pipeline architecture in which one process treats the output of another process to produce proper interpretations for a list of words. If it is true that the probabilistic models cannot be combined, this would mean that the errors in each step have to be multiplied. In this way, 0.8 correctness on 5 separate modules would entail an overall correctness of 0. 85 ∼ 0. 33—and that is not good enough to explain normal coordination. Learning a rule system for production and using the model of prior probability needed for vision—which includes the task of inferring intentions of humans such as wanting to catch a bus and grabbing a cookie—requires considerably less data, especially if one takes on board OT learning (Tesar and Smolensky, 2000). Bayesian interpretation for natural language is therefore much more plausible from a cognitive perspective, since it is wellestablished that rules can be learnt with relatively little data and the model of the requisite prior probabilities already needs to be assumed for vision. The additional resources needed for natural language interpretation over and above what is needed for vision are a system of verbal cues and the production probability. The main cues in language interpretation are words and morphemes, with an additional role for constructions15 and non-verbal signals. They cue what one always assumed for words, morphemes, and constructions: their meanings. Cue integration is the process of combining
15 Combinations of words can be cues in their own right, as can be seen in idioms and collocations, and it may be useful to have cues for other frequent combinations as well.
24
chapter one
these meanings into complex meanings. And production probability is accessible in the form of simulated utterance production from a hypothesised meaning. It is clear that humans can formulate utterances given a content and that all that is needed is that this ability be integrated in utterance interpretation. So the conclusion should be that a cognitive theory that gives a Bayesian account of vision would also explain natural language interpretation. It would share the overall architecture and the estimation of prior probabilities. The integration of utterance production in interpretation (simulated production) would have co-evolved with the development of utterance production itself. This in turn can be seen as the exploitation of an ever expanding system of conventional cues for meanings to achieve communication. If one assumes that the interpretation process is limited to a maximal number of cued hypotheses at any one time, the interpretation process is also linear, as a variant of the n-best algorithm in stochastic parsing. Such an nbest algorithm could also explain the immediacy of vision. The case for Bayesian interpretation of natural language is thereby at least as good as the case for a Bayesian account of vision. If vision is Bayesian, it becomes very unlikely that the advent of language required the development of new architecture for NL interpretation. This gives three arguments for assuming that humans approximate Bayesian interpretation. It seems to be the only option which accounts for the computation of the most probable interpretation for both the hearer and speaker strategies that lead to standard coordination on the meaning, given that direct estimation of the probabilities is not possible without sufficient data. Secondly, Bayesian interpretation follows the most plausible architecture for vision and can draw on the same resources. This makes it evolutionarily unlikely that brains would have invented a new mechanism for language. And finally, Bayesian interpretation automatically integrates pragmatic interpretation. There is, however, also considerable empirical evidence for simulated production in natural language interpretation and Bayesian interpretation for natural language fits in with a functional interpretation of mirror neurons as a general mechanism that boosts the quality of Bayesian interpretation. These two aspects are briefly discussed in the next two subsections. 1.4.1. Simulated Production in Interpretation Simulated production in interpretation is one of the two pillars of Bayesian interpretation. A survey of empirical psychological research which points
introduction
25
in the direction of simulated production in interpretation is given in Pickering and Garrod (2007). A classic case is the phenomenon of mirroring: a subject has to repeat the verbal input presented to him by headphones. Reproduction based on an understanding of the input followed by reformulation should take more time than understanding. The speeds observed in the mirroring tasks are however below the time it would take to understand the message properly. One can add other phenomena to the impressive list of phenomena in Pickering and Garrod (2007). Simulated production makes sense of the joint completion of an utterance, the common phenomenon whereby the hearer finishes the utterance in parallel with the speaker (see (Clark, 1996) for an extensive discussion). It is a strong signal that the hearer understands the speaker. What is required for this ability is a full activation of production in understanding, incremental interpretation, and a strong top-down prediction. It can be described as a state of the hearer’s production system where incremental interpretation and top-down prediction jointly come up with a complete reconstruction of the speaker’s plan, so that the production system can now produce the rest of the utterance simultaneously with the speaker. Another phenomenon that may be reduced to simulated production is the difficulty people experience in trying to switch off awareness of linguistic errors when listening to people with an imperfect command of one’s native language.16 This phenomenon makes perfect sense if it is the case that in understanding one reproduces the full utterance in simulation. Simulated production17 may also explain the ability to give grammaticality judgements—which form, according to some, the empirical basis of linguistics. It seems to be an ability that has no functional interpretation beyond the obviously functional ability to notice that one did not understand an utterance. If simulated production is part of understanding, this provides the missing explanation: the utterance is incorrect because it cannot be successfully simulated, if it can be understood at all. If this explanation is on the right track, acceptability judgements are reducible to simulated production and do not give direct access to competence. The empirical basis of grammar would then unequivocally be formed by psychological
16 The author is rather bad at this, others report occasional success. It can become quite irritating to the speaker if the native cannot refrain from stating the corrections which seem to come up with the interpretation of the faulty input. 17 The observation is Remko Scha’s (p.c.).
26
chapter one
mechanisms in language production and by the production behaviour which can be observed in human conversations and in texts produced by humans. 1.4.2. Mirror Neurons Mirror neurons are neurons in the pre-motor cortex which are activated both during motor activity and in the perception of motor movement. Since in the latter case, there is no corresponding movement on the part of the subject, the function of mirror neurons cannot be to drive motor movement, though it appears likely that they do play some role in organising muscle action. Interestingly enough, research on mirror neurons supports the hypothesis of simulated production (Rizzolatti and Craighero, 2004; Lotto et al., 2009; Kilner et al., 2007). There is evidence that speech perception involves activation of the part of the pre-motor cortex that is dedicated to articulation. It is yet unclear whether other aspects of utterance production involve such motor activation. Bayesian interpretation, however, comes up with its own hypothesis about the function of mirror neuron systems. From the Bayesian perspective, the most important role of mirror neuron systems would be the support they provide to distal perception of the activities of our conspecifics. What they do is a simulation of observed motor behaviour given a hypothesis about what the conspecific is doing, a simulation that will make a prediction about the observed signal which the signal will match to some positive or negative degree. The mirror neuron would be evaluating the production probability of the hypothesis and be part of the emulation of Bayesian interpretation, as a constraint on a hierarchical cue system producing hypotheses weighted by prior probabilities. The evaluation of perceptual cues in a context produces maximally probable interpretations, and simulated interpretation then offers a prediction about what should be perceived if the interpretation holds, thus inhibiting or reinforcing the hypothesis. In this view, mirror neurons bring about a leap in the quality and thereby also the scope of distal perception of the behaviour of conspecifics, a leap that offers a considerable evolutionary advantage in the form of much improved predictions of the behaviour of the conspecific. It follows—if one assumes with Barsalou (1992) that concepts are not amodal but rather bridges between mode-specific representations—that at least part of the application criteria of concepts, the question in what way a concept can be perceived to hold, must be connected to simulation abilities
introduction
27
of this kind. A concept will often determine a map from its perceptual cue to motor programs that execute it. Where a program can be simulated rather than executed, the simulation may provide additional visual information that can be checked against the signal. It is not just biological evolution which selected the motor skills of our part of the animal world as a support of their observation and understanding of conspecifics and near-conspecifics but—as I understand from some of the participants in these events—also many of the programmers who try to let their systems win robot soccer world cups. This is a good engineering idea, which reuses the considerable amount of work that was devoted to get a robot moving in intended ways to accomplish another difficult but necessary task, namely, to figure out what other robots are doing and why. Mirror neurons could be likened to the neurological world buying into Bayesian perception with exactly the same motive: to reuse the motor mechanisms in order to achieve better perception. 1.5. Conclusion This chapter argued against Aristotelian Competence Grammars as accounts of production and interpretation and proposed and motivated an alternative interpretation of formal models of grammar as formulating a constraint on production. There is, however, a serious case for interpreting such grammars as giving constraints on a computational production algorithm which maps speaker intentions onto utterances. Under such an interpretation, competence grammars give a partial account of production and already thereby—if interpretation is Bayesian—a partial account of interpretation. It was argued that the observed high success rates of coordination on meaning in communication can only be achieved by a hearer strategy of going for the most probable interpretation that is combined with a speaker strategy of selecting a form whose most probable interpretation is the intended interpretation. Bayesian interpretation based on a grammatical model of production seems to be the only way in which human hearers can execute their strategy of maximising the probability of interpretation. Moreover, there exists sound psychological evidence supporting simulated production in interpretation. Mirror neurons seem to offer a neural mechanism for such simulations, though the same mechanism could presumably also be realised by
28
chapter one
networks of neurons, rather than single neurons. The model of prior probability needed for Bayesian natural language interpretation is the same as the one needed for vision and other perception. The Bayesian interpretation model gives a bidirectional account of interpretation which combines interpretational mechanisms, such as cue evaluation and cue combination, with syntax-based production. The speaker strategy of producing utterances whose most probable interpretation is the intended one also gives a bidirectional account of production, which combines syntax-based production with the evaluation of the result through cue evaluation and prior probability. There is no recursion in this kind of bidirectionality as in game theory (Franke, 2009) or in Blutner’s weak bidirectionality (Blutner, 2000) and no assumption of perfection. Interpretation will continue to work outside the domain of production. The use of production rules just enhances the quality of interpretation and continues to work if the correct production rules are not available for the utterance as will happen for interpreters who speak the language in question imperfectly or for interpreters who do not have the utterance in their production repertoire. And moreover, the speaker’s self-monitoring for correct interpretation can be equally imperfect, as long as it is good enough to guarantee a high success rate for coordination. Bidirectional accounts of production and interpretation solve the ambiguity problem for coordination and explain the gap between production and interpretation. Chapter 2 Syntax will show that rule-based production is compatible with linear time complexity and Chapter 4 Interpretation will show that a variant of the linear n-best algorithm in stochastic parsing can be used to give an interpretation algorithm for stochastic cue integration that emulates Bayesian interpretation. Using the n-best algorithm also naturally leads to incremental interpretation. The theory of this book will be a linguistic model of production and interpretation that should deal with syntax, semantics, and pragmatics in the sense that all well-established insights in these three fields can be integrated, and at the same time a model which also deals with the following cognitive aspects of production and interpretation that are difficult for existing accounts based on ACG. Coordination in Communication The fact that verbal communication is normally successful in the sense that the utterance by the speaker normally leads to the hearer grasping what the speaker wanted to say.
introduction
29
Linear Production The fact that formulation of an utterance takes an amount of time that does not explode with the size of the message (as measured, e.g., by the length of the utterance). Linear Interpretation The fact that the interpretation of an utterance takes an amount of time that does not explode with the length of the utterance. Bidirection in Production The fact that one cannot express a meaning by an utterance that one could not interpret as expressing that meaning. Bidirection in Interpretation The fact that one cannot interpret an utterance that one could utter oneself by an interpretation that one could not express by the same utterance (if one were the speaker in the context). Explaining Simulated Production in Interpretation The (by now) well-established psychological hypothesis that interpreting involves simulated production. Incrementality of Interpretation A well-established psychological hypothesis that human interpreters interpret every initial segment of the utterance on all levels of interpretation. The Gap between Production and Interpretation The observation that what people can say is a proper subset of what they can understand, both for language learners and adult speakers. It is with respect to this list of cognitive properties that the model claims success. The list should also be a benchmark for future improvements of the model. Precursors and Parity The theory introduced in this chapter can be read into the account that Grice (1957) gives of non-natural meaning. According to Grice, a speaker
30
chapter one
non-naturally means to achieve an effect M on the hearer with a signal S if and only if the speaker intends to bring about M partly by the hearer’s recognition of the speaker’s intention in producing S. Communicative success—coordination—can then be defined as occurring if the hearer in fact recognises the speaker’s intention in producing the signal. It is but a small step to add a definition of what constitutes a recognition of a speaker intention. To recognise an intention is to know that a particular intention caused the speaker to produce the given signal. This can be known only by knowing that the intention can be a cause of the signal and by knowing that other potential causes of the signal can be ruled out as less likely. And that cannot be done unless the hearer assumes the speaker follows the speaker strategy of avoiding signals that have more or equally probable alternative causes. That assumption is rational. If the speaker intends to achieve the effect M, following the speaker strategy is just the rational thing to do, since it increases the probability that M will be achieved. Bayesian interpretation accounts for how the hearer can maximise the odds that he has got it right. The speaker’s strategy accounts for how he can know that he is right. Another precursor of the theory introduced in this chapter but also of simulation theories of understanding and of mirror neuronal accounts of understanding is Alvin Liberman’s motor theory of speech perception, which states that speech recognition is distal perception of the motor gestures of the articulatory system which produced the articulation. The problem that Liberman deals with in his theory is the obvious asymmetry between speaking and hearing: execution of the articulation of a word (or even a nonsense word) is a very different process from performing its acoustic analysis and arriving at a (nonsense) word. Liberman’s theory is either a platitude (one cannot recognise a word without knowing how to articulate it) or a complicated psychological hypothesis. In its last form, it has not yet found strong confirmation and has encountered a considerable amount of criticism. Liberman reformulated the issue to which the motor theory gives an answer at a later time (Liberman et al., 1967) as a question at what point the articulation process and the acoustic process reach identity of representation, in other words, as the parity problem. The motor theory would then say that parity happens at the level of articulatory gestures which make up the articulation. From the perspective of this book, one would like to add the speaker strategy to the motor theory and let the speaker monitor for correct understand-
introduction
31
ing to obtain a signal that is optimally suited18 for the hearer’s recognition of the acoustic signal. This in fact deals with one line of criticism of the motor theory which points out that speaker’s improve on a neutral execution of their articulatory gestures in order to increase understandability.19 The motor theory can then be directly interpreted as putting parity at the level of the phoneme representation as the point where—even for nonsense words—alternative phoneme hypotheses are discarded by the strong priors on phoneme sequences determined by the phonological system of natural languages. The same parity question can be asked about language production and interpretation. The process that leads from goal to articulation is essentially planning within the space provided by the language and the interpretation process is a question of evaluating and integrating stochastic cues within the same space. Grice can be seen as having supplied the answer to the parity question in the face of ambiguity: at the level of the speaker intentions. Bayesian interpretation gives the reason why that should be so: it is about the intentions and their informational content that the strongest prior probabilities are available. Assuming that prior probabilities are what we know about the causal order in the world, they can be seen as a prediction about what will be seen at time t based on what was seen up to t. The causal order has only an indirect effect on directly observable properties or on what word is to be expected after another word, that is, on the prior probabilities for lower level recognition. One sees a box but only derivatively the coloured region that makes it up. This is matched by the fact that we hear that Marie is happy and only derivatively hear the word Marie. Below intention, prior expectations may and do exist but they should always give way to the stronger prior probabilities at the conceptual and intentional level. This provides a second argument against the pipeline architecture in interpretation. There is no good reason to assume that reaching parity on the level of words, syntax, or word meaning is a prerequisite for parity on the higher level as assumed in e.g. Pickering and Garrod (2004). The inherent ambiguity at the lower level makes it strategically better to maintain the ambiguity until the highest level is reached and the strongest prior probabilities can be applied.
18 A compromise between minimising effort and maximising understandability that still gives high recognition probability. 19 It is not interesting to object that one can also understand synthetic speech. That is like seeing that a person is trying to catch a rabbit in a movie or in a cartoon.
32
chapter one
And it gives a second cognitive reason for the emergence of an incremental interpretation strategy: such a strategy allows early elimination of hypotheses by the strongest prior probabilities. 1.6. The Other Chapters The theory given is such that any single component depends on all of the other components. Production depends on interpretation and interpretation on production. The mental representations—delayed until Chapter 5 Mental Representation—are the inputs to higher-level production discussed in Chapter 2 Syntax and are the output of the interpretation process discussed in Chapter 4 Interpretation, which means that these earlier chapters are in a sense not fully understandable without chapter 5. But production syntax is also used in interpretation and needs to be supplemented by the monitoring of Chapter 3 Self-Monitoring that calls on interpretation. The strategy of exposition is however picking one linearisation of the components and to provide all the information about the component at the point where the component is explained, even when the motivation cannot be fully understood at that point. This is not a brilliant strategy, but I have not been able to come up with anything better. The following brief description of the remaining chapters should also help with this problem. Chapter 2 Syntax has the general goal of providing an account of rulebased syntax that would be learnable, can emerge in language evolution, allows for typological interpretation and can be seen to give rise to a linear production algorithm that also works on partial interpretations. The account of syntax in production optimality theory (OT) seems to fit these demands rather well. In this framework, one defines the set of optimal productions (surface forms in syntax or phonology) for an abstract input by a linearly ordered set of constraints, which form an OT model of the grammar or phonology of a particular language. Production OT offers an account of learning with relatively few data (as shown in Tesar and Smolensky (2000)). By assuming a small set of universal constraints, the learning task is reduced to learning the way in which these constraints are ordered and this can be done with much less data than, e.g., learning Government and Binding-style parameter settings or a context free phrase structure grammar. Production OT therefore provides an interesting approach to the problem of learning language production. If production can be covered by OT grammars and OT grammars can be efficiently learnt, it is cognitively plausible that human language users have acquired a grammar comparable to an OT grammar
introduction
33
that maps semantic input to utterances. Speaker self-monitoring offers an account of a way in which expressive max-constraints can arise in language evolution and OT grammars come equipped with a prediction about the possible alternative grammars, as alternative rankings of the constraint system. As shown in Chapter 2 Syntax, OT syntax can be used as a specification formalism for a fast association-based generator and it provides a principled separation between ‘pure’ syntax and syntax that requires monitoring (an issue treated in Chapter 3 Self-Monitoring). Such a generator is essential for the incremental interpretation algorithm of Chapter 4 Interpretation. That algorithm could not be linear in time complexity without a linear time generator. Moreover, production OT also works on partial semantic inputs, a property that is crucial for the interpretation model of Chapter 4 Interpretation. Other production models may turn out to have the same possibilities but that needs to be shown before it can be assumed that they can play the same role. The OT model is not by itself a complete implementation of the speaker strategy. There are a number of central aspects of utterances which cannot be described as hard, automatic rules that are keyed by features of the semantic input only and which seem to be regulated by the possibility of misunderstanding of the utterance. These include word order in many languages (word-order freezing), the NP selection problem, the distribution of particles and optional morphology. Jacobson (1958/1984) observed that the order NP-V-NP in Russian can in principle express both SVO (subject verb object) and OVS (object verb subject), but that the option OVS disappears when the subject and the object are syncretic in nominative and accusative as in doc’ ljubit mat’ (the daughter loves the mother). The examples in (9) seems to refute a theory which makes it a hard rule that a pronoun should be used when the antecedent is in focus. (9)
(??)John and Bill met. He was in trouble. John met Bill. He was in trouble. (? if ‘he’ refers to Bill) John met Bill. He talked to him about his problems.
Particles like too and again have a graded optionality in the contexts where they occur, ranging from fully obligatory to fully disallowed. Aissen (1999, 2003b) is a study of differential case marking where the case marking is mostly by optional nominal or verbal morphology. Chapter 3 SelfMonitoring tries to make the case that these phenomena can all be related to the speaker strategy of making sure that she will be understood as intended. Obligatoriness of word order or extra lexical material or morphology is directly related to the probability that a particular semantic feature (θ-role
34
chapter one
for word-order freezing and differential case marking, the identity of the referent of the pronoun, additivity for the two example particles) is understood as intended and the obligatory feature marks that the first NP is the subject, that x is the referent, that the host is additive to an earlier answer to the question that the host clause addresses or that the NP has the θ-role reflected in the case marking. Chapter 3 Self-Monitoring also contains an inconclusive discussion of a closely related case that may be interpreted as self-monitoring in phonology, the effects of the silent h on the surface forms of French words in which it occurs. The main result is the definition of what sets apart self-monitoring effects from hard syntax. The case is also made that the trigger for a monitoring effect, the expectation of going to be misunderstood is directly related to the selection function in a functional account of linguistic evolution in which ways of expressing oneself that will be misunderstood are deselected in favour of ways of expressing oneself that carry a higher probability of leading to understanding. Monitoring can thus be seen as linguistic evolution in action where the speakers themselves select the better ways of expression and increase their frequency in the data sets from which others learn their use of language. Chapter 4 Interpretation gives an algorithm for cue selection and integration that is directed to increasing the product of prior probability with production probability and runs as a mechanism to increase the activations of cues and integrated cues. It is a heuristics for finding the right interpretation, but it can nonetheless be compared with a free stochastic categorial grammar with restrictions on functional application that are farmed out to optimality theory. It is not a categorial grammar, because it combines concepts rather than expressions, because it also performs anaphora and presupposition resolution, because applications create a graph structure of concepts rather than an integrated logical expression, and because it treats the given pre-utterance context as involved in the reductions. All of these are however rather superficial divergences: Frege and Husserl’s concept of a categorial grammar was also semantic, doing anaphora resolution with categorial grammar is close to standard for reflexive constructions, modern computational versions of categorial grammar also compute dependency structures (so that calling these things semantic structures rather than dependency structures is just a terminological innovation). In contrast, farming out application restrictions to a separate account of word-order and morphology is just a good idea. One can then stick to free non-directional Ajdukiewicz style application and use incremental Bayesian interpretation to justify the use of production OT syntax for obtaining the desired restrictions. This opens the perspective of an extension of categorial grammar
introduction
35
to visual languages, sign languages, mixed verbal and sign languages, the integration of visual perception with verbal utterances in communication by developing (integrated) incremental production probabilities for these other media. Moreover, the relation of categorial grammar with linguistic typology becomes optimal in one single step. Some examples of the relevant extensions are given in the chapter, including an attempt to deal with discourse structure on the base of sentence and turn beginnings and endings, treated as cues to speech acts. Thinking of a annotated20 graph structure as the semantic and pragmatic interpretation of the sentence needs its own defence. Part of that defence is adherence to the account of anaphora and presupposition in Discourse Representation Theory (Kamp and Reyle, 1993; Reyle et al., 2007). This account requires that the context constructed by earlier communication and through earlier and ongoing perception can be searched for its components and this forcibly leads the assumption of a representationalist account in which meanings need to continue to have enough structure. Chapter 5 Mental Representation takes the graph structures of Chapter 4 Interpretation and argues that these give a viable alternative to Discourse Representation Theory that can be interpreted as connected sets of mental representations in the philosophical tradition from Aristotle to Husserl. It is in fact not far removed from the view of Aristotle in chapter 16 of De Interpretatione. Chapter 5 Mental Representation makes the case that as a theory of mental representation, the graph formalism has two important advantages over Discourse Representation Theory. The first is the treatment of when two minds agree about a representation, a treatment that within Discourse Representation Theory unavoidably leads to theories of anchoring. While accounts involving anchoring are not incorrect by not capturing the data, they are complicated in postulating a level of representation that cannot be reduced to the basic level of representation. In the more classical theory developed in Chapter 5 Mental Representation, the part of the structure representing an object is also representing the way in which that object is given and this gives the simultaneous conceptual grip on the object that anchoring accounts have to reconstruct. A second advantage of the same kind is the trivial account one obtains of the semantic feature of definiteness. The structure representing the object and the way in which it is given either determines it uniquely given a context and a model or it does not. In
20 Nodes in the graph structure are annotated with context labels for semantic reasons, a step which makes them unambiguous in a way in which e.g. f-structures in LFG are not.
36
chapter one
the modern discussion where objects and the ways in which they are given are standardly separated in the semantic formalisms employed, there are a number of different positions (Russellian (Russell, 1905; Heim, 2012), the familiarity theory (Hawkins, 1978; Heim, 1982), the functional theory Löbner (1985); Löbner (2011)) which all have cases where they obviously explain what is going on and cases where unintuitive stipulations are needed. The theory of definites is trivial when one is assuming traditional mental representations and it is only the attempt to reformulate it in predicate logic or discourse representation that leads to the conceptual mismatches between linguistic definiteness and notions like definition, given object and functionally determined object in the modern literature. Reanalysing definition, given object and functionally determined object in the light of the representational theory of definiteness is easy and all three notions give the same result. Assuming mental representations, defined objects, familiar objects and objects given by a function are the same objects. Chapter 6 Final Remarks takes up the thread from this Introduction and discusses the consequences for syntax, semantics, pragmatics and computational linguistics and tries to delineate the new goals one should pursue in the light of the results of the book.
chapter two SYNTAX
The aim of this chapter is to deal with “hard” syntax, that is, the aspects of the forms of the utterances in a given language that can be described by hard rules. Within the structure of the proposed theory of production and interpretation, hard syntax plays two roles. Its primary role is to be part of the account of production, that is, of the process speakers go through when they produce an utterance on the basis of some communicative intention. While any account of production that would not take into account the speaker strategy of producing only utterances that have the intended interpretation as the most probable one in the context would be seriously incomplete, utterances must also satisfy certain hard syntactic rules. The focus of the chapter is on the part of the production process where utterances receive their morphology and word order. In this area, which one could, following a similar usage in the field of natural language generation, call syntactic realisation, there are important differences between languages. For a proper model of production, in addition to syntactic realisation one also needs higher-level generation, the area where the inputs for syntactic realisation are constructed and lexicalised. In addition, the further constraints on utterances developed in Chapter 3 Self-Monitoring need to be incorporated. The second aim of this chapter is to deal with hard syntax as part of interpretation. Bayesian interpretation incorporates the production probability of an utterance given a hypothesis. It is assumed to be estimated by a simulation of the utterance under the hypothesis. It is further assumed that this Bayesian interpretation is incremental, which requires that the partial hypotheses arising from initial fragments of the utterance to be evaluated with respect to their ability to cause the observed initial fragment. Higherlevel generation is much less involved in these evaluations and the considerations of Chapter 3 Self-Monitoring not at all: monitoring defines the adaptation to the bias for the most probable interpretation that is characteristic of the hearer strategy. It does not need to be checked again in interpretation. Syntactic realisation—possibly with a fragment of higher-level generation—is then what is shared by production and interpretation. The main result of this chapter is that production can be done with a quasi-linear
38
chapter two
algorithm based on a specification of a plan to achieve a communicative task and that this same algorithm can also check whether the initial segments of the sentence are syntactically correct given a partial hypothesis about what they want to achieve. The first goal, the algorithm, needs to be reached in order to show that production can be effortless and instantaneous, the second goal, checking initial fragments of interpretation, again, is important for using production as an efficient filter in an incremental interpretation algorithm, especially such as the one defined in Chapter 4 Interpretation. If production were not linear, the interpretation algorithm also would not be linear. Human interpretation is incremental (as originally shown by Crain and Steedman (1985)). The main aim of this chapter is to show that there exists a production algorithm that has this combination of properties. Many—perhaps even all—proposals for syntactic description1 may turn out to allow for a generation algorithm that has both of these properties. The primary reason for using a restricted version of optimality-theoretic syntax as below is that the generation algorithm is simple and the demonstration of both properties is straightforward, in the second case even trivial. For other frameworks, it would be necessary to develop a concept of production to the extent that these properties could be proven and, to the best of my knowledge, such proofs are not available. This chapter is organised as follows. Section 2.1 gives a minimal introduction to optimality theory. In subsection 2.2.1, a simplified optimalitytheoretic grammar is given for word order in a Dutch clause. This is then generalised in subsection 2.2.2 to a treatment of German and English, demonstrating that the constraint set for Dutch has some degree of typological validity. In section 2.3, it is shown that the particular set of constraints specifies a linear production algorithm. Section 2.4 demonstrates how the grammar can be extended to morphology and higher-level generation, and thereby lead from a task specification to the string. Section 2.5.3 discusses the checking of initial fragments. The optimality-theoretic grammars are important for Chapter 3 SelfMonitoring because they also form the basis for the requisite proofs which
1 Joan Bresnan’s optimality theoretic LFG (Bresnan, 2000; 2001) is a successful reformulation of LFG in optimality theory. The account of Dutch, German, and English below is strongly influenced by Gazdar et al. (1985) and, with some charity, could be seen as a translation of Generalised Phrase Structure Grammar into optimality theory. The treatment of Grimshaw (1997) is optimality-theoretic syntax with much influence from Government and Binding. There is no principled reason why such translations into optimality theory could not be achieved also for other formalisms.
syntax
39
show that the phenomena treated in that chapter cannot be dealt with in production grammar alone and they also serve as the basis for an extension of the algorithm that would incorporate automatic self-monitoring. 2.1. Optimality Theory One may—and perhaps must—relativise the importance of optimality theory for the production grammar, but there are other reasons to be interested in an optimality-theoretic account of syntactic production. Optimality theory comes with a typological interpretation as well as extensive and promising results about learnability. Moreover, it is strongly influenced by what is understood about neural processing within the tradition of connectionism. All of these are central issues for an account of human utterance production. A treatment of language production that requires data which cannot be learnt is useless. An account of human language that does not make sound predictions about typological variation is not good linguistics. And a sound basis in what is understood about the brain is essential if we understand, as in Chapter 1 Introduction, competence grammars as abstract accounts of the process of production. While it is unlikely that the final word about formalisms for the characterisation of the syntax and semantics of human languages has been said, it would be surprising if new proposals would tend towards less learnable formalisms, had less ambition in typological explanation, or a weaker relationship with what is understood about how the brain works. Any new proposal should do better rather than worse in these respects. Optimality theory arose from an enterprise which aimed at developing phonology within the framework of connectionism. Certain neural nets can be understood as measuring how well-formed a surface form is (*markedness) or how well the surface form matches an underlying form (faith including the realisation of underlying features—the maximation of features) and adding features to the output that are not given in the input (do not epenthesise) or its complexity (economy): they implement a constraint on the surface form given the underlying form. A combination of such nets could then define which surface form best meets all of the constraints for a given underlying form by defining weights for the different nets to give an overall evaluation. One can then define the best output as the candidate for which the sum of all the weighted errors is minimal: the optimal candidate. The simplest definition is given by assuming a set of constraints Constraint with associated
40
chapter two
weights Wi and a set of candidate outputs Candidates. The best output for a given input can be defined by (10), where the weights are learnt from experience. Ei (c, input) stands for the error of candidate c for the input input on the i-th constraint. The function argminx f(x) finds the value of x for which f(x) assumes the lowest value, if there is such a value. (10) argminc∈Candidate Σi∈Constraint Ei (c, input)Wi
The question addressed in the enterprise was whether one could obtain a proper phonological description with the above-mentioned formalism using constraints of this kind. (11) gives an example of two possible constraints. (11)
faith(consonant): underlying consonants correspond to surface consonants and surface consonants to underlying consonants (errors for non-corresponding consonants on both levels) *coda: no final consonants in syllables (errors for each final consonant)
In this formalism, the error weights can have many different outcomes. If the constraints are weighted differently (as in (12)), different compound weights result. The following diagram shows evaluations for different pairs of underlying and surface forms. The weights are given in the order faith(consonant) and *coda. (12)
Underlying surface /matmat/ [mama] /matmat/ [matma] /matmat/ [mamat] /matmat/ [matmat] winner
0.6, 0.6 1.2 1.2 1.2 1.2 all
0.6, 0.4 1.2 1 1 0.8 [matma],[mamat]
0.6, 0.2 1.2 0.8 0.8 0.4 [matmat]
0.4, 0.6 0.8 1 1 1.2 [matmat]
0.2, 0.6 0.4 0.8 0.8 1.2 [mama]
The discovery on which optimality theory is based is that sophisticated description is possible even without assuming weights. It is sufficient to order the constraints while assuming that a lower constraint can never override the effect of a stronger constraint regardless of how often the lower constraint is violated. This can be seen as a special case of definition (11), but is more perspicuously defined as in (13). This much simpler conception of the relative importance of the constraints strengthens the typological prediction. Now only the much smaller number of linear orderings of a finite set of universal constraints will give a possible pattern of a human language. And this improved typological prediction has a direct application in learning. All that needs to be learnt is the linear order of the constraints in a particular language, which requires far fewer data than would be needed for learning the weights of each
syntax
41
constraint (Tesar and Smolensky, 2000). The deterministic algorithm for production proposed below also essentially depends on optimality theory. It would not work with a set of weighted constraints evaluated by rule (11). If constraints are taken to be functions CI from candidates to the number of errors under an input I, candidates can be compared and optimal condidates defined as in definition (13). This seems to depend strongly on the set of candidates, but normally different hypotheses about the set of candidates will deliver the same results. (13) a. Candidate a is better than candidate b for input I in an ordered constraint system C1 …Cn iff ∃j ≤ n∀i < j C i,I(a)=C i,I(b) and Cj,I (a) < Cj,I (b) b. Candidate a is optimal for input i under an ordered constraint system C1 , …Cn iff ¬∃b b is a better candidate than a for input i under C1 , …, Cn
To continue the example, there are now only two possibilities, faith(consonant) > *coda and *coda > faith(consonant). These give the tableaus in (14). A tableau is the standard way of presenting an optimality-theoretic competition between candidates under a constraint system. In a tableau, the input is written in the top left box, followed by the constraints in their order of ranking. On the vertical line, there are the different candidates for the surface from followed by the number of errors they incur on each constraint. The winner is indicated by ⇒. In the examples (14), irrelevant candidates are omitted. (14) /matmat/ [mama] [matma] [mamat] ⇒ [matmat] /matmat/ ⇒ [mama] [matma] [mamat] [matmat]
FAITH ** * *
*CODA * * **
*CODA FAITH ** * * * * **
An optimality-theoretic production phonology for a language is the ordered constraint system that correctly predicts the set of surface forms as the optimal forms for a set of inputs under that constraint system. Any optimalitytheoretic production phonology for a language predicts that any permutation of the constraint system is a pattern found in a possible human language
42
chapter two
and that any other language can be correctly described by a permutation of the constraint system. An optimality-theoretic syntax for a language is a constraint system that would produce a correct description of the well-formed sequences of words in the language as the optimal candidates for the set of semantic inputs. It always comes with the claim that it gives a syntactic typology by predicting that other permutations specify possible human languages and that any other language can be characterised by a system drawn from the same constraint set. The programme of optimality-theoretic syntax can therefore be seen as amounting to the claim that there is a set of constraints such that for each natural language, there is a way of ordering the constraints in the set so as to give the set of well-formed expressions and meanings of those expressions. The set of constraints is the same for all languages but languages can differ from each other in how they order the constraints. Given a language-specific ordering, finding the optimal surface realisation decides about what, for the language in question, are well-formed surface structures and what the possible meanings these surface structures may have. In particular, one can define the Aristotelian relation as (15), but, as noted in chapter 1, the relation does not give an account of interpretation. (15) R = {⟨F, M⟩:F is a winning form for a meaning M on the constraint ordering}.
2.1.1. Reversing Production The claim that optimality theoretic syntax can capture all human languages is comparable to the claim that all human languages can be described by a context-free grammar and as such, it can be wrong. Indeed, there are some reasons to doubt the claim. The first of these is that in formulating the constraints, it may be necessary to refer to language-specific features. That would threaten the universality of the constraints—and thereby also the typological interpretation and learning—but would not be important for the generation algorithm. Second, it has been found that certain constraints come with a fixed order.2 Such fixed orderings merely strengthen the typological interpretation and make learning easier. And finally, there are phenomena that cannot be treated within the boundaries of optimality theoretic syntax. Chapter 3 Self-Monitoring treats a number of these cases
2
E.g. Aissen (2003b) has this for constraints arising from harmonic alignment.
syntax
43
and concludes that optimality-theoretic syntax as defined in this chapter is essentially incomplete and needs to be strengthened by another component, namely one that makes use of interpretation. Optimality-theoretic syntax is important for defining the production algorithm proposed in this chapter only in its most abstract form, that is, as a mapping of semantic inputs to forms using a ranked set of defeasible constraints. Non-defeasible constraints can only be satisfied at each other’s expense: a structure adjusted so as to meet some constraint can thereby acquire a property that makes it violate another constraint, so that the process of successively calling on the constraints is forced into backtracking with bad worst case scenarios. And this happens not just in theory but also in practice, because generalised constraint solvers make for remarkably inefficient parsers and generators.3 Optimality theory resolves all conflicts between constraints by always letting the higher ranked constraints win from the lower ranked ones. A system of ranked constraints can therefore be implemented as a generator by calling, in their order of ranking, all of the procedures that try to satisfy the constraints, provided that the representation of the underspecified structure can take in the information coming from the constraints at the point where the constraint is called. This makes backtracking unnecessary and thus results in a deterministic process. As long as the individual constraints correspond to procedures that also have a linear time complexity, the resulting generation process will be linear. The possibility of an efficient computational interpretation in the direction of production has already been demonstrated for optimality theoretic phonology (Frank and Satta, 1998; Karttunen, 1998). Optimality-theoretic phonology can be implemented by finite state transducers.4 This implementation guarantees that the inverse of production can be computed equally efficiently: it is the transducer running in reverse. But finite state transducers
3 This is a generalisation of the experience of the author and others with early systems like CUF (Dörre and Dorna, 1993) and ALE (Carpenter and Penn, 1996). The problem is inherent in constraint solvers of this type. 4 As proved by Smolensky, standard optimality theory is not finite state (Frank and Satta, 1998). A version of the syntactic algorithm described here can, however, be used to give a full implementation in the production direction. A finite state implementation has to impose a limit on the number of errors that can be taken into consideration, because the size of the transducer increases rapidly when the bound on the number of errors is increased. The finite state implementation results in a reversible finite-state transducer, unlike an implementation along the lines of this chapter that is not automatically reversible.
44
chapter two
can work only if the problem they deal with yields to finite state methods. For phonology, this is a plausible assumption but the relation between interpretations and surface forms has been seen as being of a higher complexity ever since Chomsky (1957). The algorithm provided for production is deterministic and therefore efficient, but not directly invertible. In this respect, optimality-theoretic syntax is similar to other formal systems that map meanings onto forms, such as early transformational grammar, generative semantics, systemic grammar, and functional grammar. What is special about optimality theory is that each step can be seen as a possibly unsuccessful attempt to let the partially specified representation meet a constraint, while the older systems employ structural transformations for the mapping. But formally, they are related. If it is clear which transformations are to be used, the computation of the surface form is efficient since backtracking can be avoided. The same holds for operations which build surface forms in systemic grammar from semantic specifications and for operations which map an underlying semantic form in various steps to a surface form in functional grammar. In the reverse direction, however, one can never be sure which operation caused what formal feature of the structure one is dealing with (the recognition problem for transformational grammar, see Peters and Ritchie (1973)). And in yet another instance of underdetermination of form by meaning, there will often be information lost along the way from semantic representations to surface forms. The way back from surface forms to semantic representations essentially involves traversing a large search space and if the search follows the specification of the production, the time complexity will be much higher than for production. This is part of the motivation for this book’s claim that natural language interpretation is much more like perception than like syntactic parsing. If a good grammar for production is like the optimality-theoretic syntax presented in this chapter, inverting such grammars involves an inefficient search process. If natural language interpretation is guided by probability weighted cues (words, morphology), it can quickly converge on probable interpretations which can then be inhibited by syntactic incorrectness or reinforced by syntactic correctness, as proved by simulated production. Syntax is surely a constraint on interpretation but it is not the central constraint defining the search process, the one that constructs the hypotheses to be considered. Production constructs a syntactic and morphological structure which may well be describable by a syntactic rule system—which is what optimality-theoretic syntax claims. The role of syntax in interpretation is best described as a production filter on interpretation: interpretations
syntax
45
must be such that they could serve as the input for the production of the perceived sentence. 2.2. Optimality-Theoretic Syntax The aim of this section is to demonstrate the plausibility of the approach used by optimality-theoretic syntax by presenting a treatment of word order in a Dutch clause that uses OT-style constraints and by showing how this treatment can be adapted to English (a highly configurational language) and German (which has a more free word order). Later, it will also be made plausible that the treatment can be extended from word order to morphology, lexicalisation, and message selection. This section contains some results which are not directly relevant to the aims of this book. For example, it demonstrates that Dutch cross-serial dependencies can be treated in optimality-theoretic syntax, a treatment which also gives an outline of how could be handled in optimality theory (the problem of how to handle HPSG in optimality theory is quite similar). This section also presents what seems to be a historically and typologically plausible way of connecting the grammars of Dutch, English, and German. While the demonstration makes it plausible that the dialect of optimality theory employed here has some typological validity and can be used to capture mechanisms from other grammar formalisms, a full discussion of these issues must be postponed to another occasion. 2.2.1. Optimality-Theoretic Syntax for Word Order in Dutch Optimality-theoretic syntax has been developed along the lines of optimality-theoretic phonology as a system that maps an underlying representation to surface representations (Grimshaw, 1997; Bresnan, 2000). Regarding the nature of the underlying forms and of the surface representations, various hypotheses have been considered. From the perspective of optimality theory as such, the choice of inputs is not of much importance as long as the structures which are chosen provide the information needed to decide to what extent the constraints are satisfied. In this book, a choice is made for the semantic representations of Chapter 5 Mental Representation. This is necessary because the interpretation process needs to produce structures from which the production process can be defined. The actual inputs for the system below are one step more specific and are derived from the semantic representations by an earlier optimisation round, discussed in 2.4.
46
chapter two
An optimality theoretic system consists of a set of inputs, a set of outputs, and a linearly ordered set of constraints.5 What follows is a proposal for a constraint system for Dutch word order in the clause.6 Constraints are given in order of strength. Inputs are bags7 of bags of words with features on the bags coding the syntactic (sub)categorisation of the words and semantic constituents. They can be defined as in (16). (16) 1. A lexical item with a feature annotation is an input structure. 2. A bag of input structures with a feature annotation is an input structure.
The following is the list of constraints, in their order of ranking. There are three types. cohere(f)8 constraints try to keep the elements of an F-labeled input structure together (but will lose from stronger constraints). one = f will place an F-labeled input structure in the position one, the first position of the sentence. It can be frustrated by stronger cohere(g) constraints. f x Complement sentences and relative clauses come last. v[head] < v pending with v[comp] < x. 9 There is good evidence for allowing in this position also certain foci. See, e.g., Frey (2004) for German. There is no technical problem in incorporating such other elements but at this point, it is preferable to keep things simple.
48
chapter two
v[head] < v: the verbal head comes before the verbal arguments. v[comp] < x: verbal complements come first. These last constraints are meant to deal with the order of verbs and arguments in Dutch clauses. They predict nested constructions in complex verbs, as in (18). (18) (main) hij leert Jan zwemmen (subordinate) hij Jan zwemmen leert (subordinate) hij Jan leert zwemmen He teaches John swimming.
These can be treated in the system by assuming the input (19). (19) input: {hijsu , leertv,fin,main , Janobj , zwemmenvc }
An optimality theoretic tableau for this input is given in (20). * stands for errors and ⇒ for the winning candidate. In this tableau, v[head] < x comes before v[comp] < x, though in main clauses with a single complement verb, this has no consequence. The candidates are the 24 permutations of the words in the input. The error scoring of the ordering constraints directly corresponds to the position of the offending constituent: the number of items that precede the constituent in question. The constraint names are abbreviated: 11 stands for one = subj, v1 for v[fin,main] < x, su1 for subj < x, ob1 for obj < x hv1 for v[head] < v and vc1 for v[comp] < x. (20) input: {hijsubj,np , leertverb,fin , Janobj,np , zwemmencomp,inf } zwemmen Jan leert hij zwemmen Jan hij leert zwemmen leert Jan hij zwemmen leert hij Jan zwemmen hij Jan leert zwemmen hij leert Jan Jan zwemmen leert hij Jan zwemmen hij leert Jan leert zwemmen hij Jan leert hij zwemmen Jan hij zwemmen leert Jan hij leert zwemmen leert Jan zwemmen hij leert Jan hij zwemmen leert zwemmen Jan hij leert zwemmen Jan hij
11 *** ** *** ** * * *** ** *** ** * * *** ** *** ***
v1 ** *** * * *** ** ** *** * * *** **
su1 *** ** *** ** * * *** ** *** ** * * *** ** *** ***
ob1
hv1
* ** ** ***
* * ** **
* * ** *** ** *** ** *** * *
syntax
leert hij Jan zwemmen leert hij zwemmen Jan ⇒hij leert Jan zwemmen hij leert zwemmen Jan hij zwemmen leert Jan hij zwemmen Jan leert hij Jan zwemmen leert hij Jan leert zwemmen
11 * *
v1
su1 * *
* * ** *** *** **
49 ob1 ** *** ** *** *** ** * *
hv1 *** ** *** ** * * ** ***
If one changes the input to make the sentence subordinate, the input is (21). (21) input: {hijsu , leertv,fin,−main , Janobj , zwemmenvc }
The following two diagrams show the effect of the two pending constraints, v[head] < x and v[comp] < x. In the following tableau, the latter constraint is ranked first. zwemmen Jan leert hij zwemmen Jan hij leert zwemmen leert Jan hij zwemmen leert hij Jan zwemmen hij Jan leert zwemmen hij leert Jan Jan zwemmen leert hij Jan zwemmen hij leert Jan leert zwemmen hij Jan leert hij zwemmen Jan hij zwemmen leert Jan hij leert zwemmen leert Jan zwemmen hij leert Jan hij zwemmen leert zwemmen Jan hij leert zwemmen Jan hij leert hij Jan zwemmen leert hij zwemmen Jan hij leert Jan zwemmen hij leert zwemmen Jan hij zwemmen leert Jan hij zwemmen Jan leert ⇒hij Jan zwemmen leert hij Jan leert zwemmen
11 *** ** *** ** * * *** ** *** ** * * *** ** *** *** * *
v1
su1 *** ** *** ** * * *** ** *** ** * * *** ** *** *** * *
ob1 * * ** *** ** ***
* * ** ** ** *** ** *** *** ** * *
vc1
* * ** *** *** *** ** *** * * *** ** *** ** * * ** ***
50
chapter two
Reversing the ranking of the two pending constraints, one obtains the following tableau. input: {hijsu,np , leertv,fin,−main , Janobj,np , zwemmenverb,comp } zwemmen Jan leert hij zwemmen Jan hij leert zwemmen leert Jan hij zwemmen leert hij Jan zwemmen hij Jan leert zwemmen hij leert Jan Jan zwemmen leert hij Jan zwemmen hij leert Jan leert zwemmen hij Jan leert hij zwemmen Jan hij zwemmen leert Jan hij leert zwemmen leert Jan zwemmen hij leert Jan hij zwemmen leert zwemmen Jan hij leert zwemmen Jan hij leert hij Jan zwemmen leert hij zwemmen Jan hij leert Jan zwemmen hij leert zwemmen Jan hij zwemmen leert Jan hij zwemmen Jan leert hij Jan zwemmen leert ⇒hij Jan leert zwemmen
11 *** ** *** ** * * *** ** *** ** * * *** ** *** *** * *
vfm1
su1 *** ** *** ** * * *** ** *** ** * * *** ** *** *** * *
ob1 * * ** *** ** ***
hv1 ** *** * * *** ** ** *** * *** **
* * ** ** ** *** ** *** *** ** ** *
* * ** *** *** **
The main clause question order (leert hij Jan zwemmen) can be derived by assigning leert the feature wh. This forces leert to occupy the first position. In the subordinate question (of hij Jan zwemmen leert), wh is carried by the complementiser of. The interpretation of pending constraints C1 and C2 is disjunctive. An optimal candidate must be optimal either with respect to one linearisation or with respect to the other. The treatment of Dutch will be extended after introducing German and English and the procedural interpretation.
syntax
51
2.2.2. Provisional German An approximation of the situation in German is presented in the following table. It is a simplification of the Dutch grammar where the lower constraints are ordered and the word order constraints based on functional roles disappear in favour of one single constraint, namely prom < x: prominent elements come first. cohere(rel,sq) one = wh one = ct cohere(dp) cohere(scomp) one = subj v[fin,main] < x prom < x vcomp < x hv < v scomp < x prom < x captures a word order constraint typical for languages with a relatively free word order. It has been first described for German by Uszkoreit (1987) and is used for Korean and German by Choi (1999), but seems equally applicable to Latin or Polish. The following formulation is based on Aissen (2003b)’s development of the prominence concept for differential case marking. An element can be more prominent in a list of elements by having a more agentive thematic role, by being more animate, by being more activated (pronouns with preference for local pronouns), by being topical, and by being contrastive. Verbs and their projections are never prominent. An element is not prominent with respect to another element on the same list if the other element improves on it in one dimension without losing in another dimension (this is essentially the theory of Uszkoreit (1987)). The constraint prom < x gives an error for each dimension on which X is less prominent than Y if Y comes before X. The system as it stands captures the reduced word-order freedom in German with respect to Dutch for the finite verb and the increased wordorder freedom in the middle field, illustrated in (22) (‘beibringen’ is more idiomatic). (22) dat Jan Peter leert zwemmen. *dass Jan Peter lehrt schwimmen.
52
chapter two *dat Peterobj Jan zwemmen leert. dass den Peter Jan schwimmen lehrt. *dat zwemmen Peterobj Jan leert. dass schwimmen den Peter Jan lehrt. *dat zwemmen Jan Peter leert. dass schwimmen Jan Peter lehrt.
2.2.3. Provisional English English, Dutch, and German should come out as variants of each other, which is what their classification as Germanic languages would predict. In this respect, English is the most problematic case.10 Certain semantic features, such as negation, polar questions, non-subject wh-questions, and polarity focus require auxiliaries. This is part of higher level generation and leads to epenthetic occurrences of the auxiliary do in lexicalisation with these verbal features in cases where other auxiliaries are not present to bear the features. The following constraint system treats auxiliary inversion in questions, presentation construction inversion, and topicalisation. It introduces an additional position extra for extraposed topics. Complementisers include relative wh-phrases and at most one wh-question phrase which gets the feature c (complementiser) in higher level generation. It is further assumed that both presentational verbs and their locatives are simultaneously marked by a feature pres in higher level generation. With these features in place, English comes out very much like Dutch. The possibility to promote (contrastive topic) elements to the first position has disappeared with respect to German and Dutch in favour of a possibility to extrapose them. This makes the middle field ordering completely fixed, apart from the possibility of ordering multiple PPs freely. cohere(rel, sq) extra = [ct, -pres] This regulates the extraposition of topicalised elements: they must have feature ct but cannot have the feature pres which are promoted to first proper position even if they are contrastive topic.
10 The approach taken by Grimshaw (1997) and Bresnan (2000) to the insertion of auxiliaries in English does not lead to a treatment of German and Dutch word order. Moreover, it is tied to strong assumptions about the underlying trees.
syntax
53
one = [wh, q, -c, -v] These are the question non-verbal wh-elements, not the relative wh-elements that are +c. cohere(np, pp, scomp, vcomp) one = [c] This constraint puts the complementisers in first position if that is possible. one = [pres, loc] The same for the locative in a presentational construction. If the first position is filled, the locative is treated as other PPs: who/that appeared across the hill. v[wh or pres] < x The only verbs that can precede the subject. The whfeature can only be born by auxiliaries. subj < x vx vcomp > x An example of this is given in (23). (23) There appeared someone. Input: {appearedpres,fin,v , someonesubj , thereloc,pres } appeared there someone appeared someone there ⇒ there appeared someone there someone appeared someone appeared there someone there appeared
lp1 * ** ** *
v
* ** * **
su ** * ** *
English inversion is a remnant of the unrestricted Germanic inversion and applies only to auxiliaries (with feature wh) and to presentational verbs (with feature pres). This limitation puts the subject nearly—but not quite— always before the verb, thus making the preverbal position a powerful marker of subjecthood. (24) Who did you see?
54
chapter two {whowh,obj seev,wh,past yousubj }
Since the specification of the verb requires v, fin, past, and wh, it can only be realised by a combination of didpast,fin,aux,wh and seev,−fin . The candidate set is therefore limited to the 4!=24 permutations of the bag {who, did, you, see}. one = wh restricts this to the six permutations where who occupies the first position, v[wh or pres] < x to the two permutations starting with ⟨who, did⟩, and subj < x makes ⟨who, did, you, see⟩ into the single optimal candidate. English, Dutch, and German share a common Germanic precursor which had a strong case system. This case system is almost completely lost in English and Dutch, while in German, it has merely become defective. It is plausible to assume that the precursor syntax was close to the system here attributed to German. The Dutch and English word order systems, on the other hand, can be seen as grammaticalisations where word order is recruited for θ-marking, a process that will not happen if θ-marking can make use of the case morphology. In English, grammaticalisation restricts the various orderings that are allowed by prominence to a linear ordering of functional roles. One could attribute the same restrictions to Dutch and that is, in fact, the line taken by the Dutch grammar specified above. But it is also plausible to let Dutch be the same as German in the sense that word order restrictions result not from grammaticalisation but rather from word order freezing, which is an effect of self-monitoring postulated in Chapter 3 Self-Monitoring. This can be argued from the exceptional word order in Dutch sourceexperiencer verbs such as to please or to frighten. In these cases, if the source subject is inanimate, the order object-subject is allowed, as in (25). This is the same word order one finds in German but is strictly ungrammatical in English. (25) dat de soldaten het geluid verontrustte. dat het geluid de soldaten verontrustte. that the noise alarmed the soldiers.
The object-subject order is so rare in Dutch that it is hard to see how this exception could have survived against the massive predominance of the subject-object order if it were just a matter of grammar. The reduction of Dutch word order freedom in comparison with German may well be the result of Dutch speakers’ systematic avoidance of competing interpretations with unintended assignment of θ-roles. The same restriction of word order freedom happens in German and in Russian when case mark-
syntax
55
ing or other factors do not properly distinguish the subject and the objects. What makes Dutch different is then merely the fact that in Dutch, case is nearly always unavailable for θ-marking. A full assimilation of Dutch and German, however, would predict that a object-subject order is also possible when head marking is sufficient for θ-marking and that, as shown in (26), is not the case. (26) ?*dat Piet de kinderen sloegen that the children beat Piet
The best solution seems to treat the Dutch word order facts as a grammaticalisation that would more accurately be captured by a version of the proposed constraint system based on θ-roles. The grammaticalisation is the emergence of the word-order constraints that enforce fixed order between subjects and objects from the situation characterised by prom under monitoring pressure. The exception then follows from the claim of Lenerz (1977) that there is no canonical order for source-experiencer verbs in German, which—in our setup—means that there is no word-order that can be frozen by monitoring, and consequently no word order that could grammaticalise. 2.3. The Production Algorithm
Optimisation by a set of constraints C1 , …, Cn can be taken as a recursive definition of a reduced candidate set A0 to An . A0 is the set of candidates. A constraint Ci can be seen as function that maps candidates a and input x to a natural number, the number of constraint errors for the candidate as a form for the input. Let en = min{Cn (a, x) : a ∈ An−1 }. We can now define Ai as follows. Ai = {a ∈ Ai−1 : Ci (a, x) = ei }
Optimality theory guarantees that Ai is inhabited for any i ≤ n. The optimal candidates are given by An and An will nearly always be a singleton set for the constraint systems considered. As an example take the constraint subj = one. This would map a set of candidates Ai to the subset of Ai where the subjects stand in the first position. If Ai has no such candidates, it would just return Ai itself. The algorithm represents the sets Ai by a single representation Si such that Ai = {X : X < Si }, where Si is an underspecified structure and X is any fully specified structure that is underspecified ( x: take elements f and right-adjoin them to the highest closed bag to which they belong or to the outermost bag if they do not belong to a closed bag. {x1 , …, xfi , …, xn }F becomes
⟨xfi , {x1 , …, ci−1 , xi+1 , …, xn }F ⟩F or ⟨{x1 , …, xi−1 , xi+1 …, xn }F , xfi ⟩F .
cohere(f) assigns the feature closed to structures with the feature f.
f = one concatenates a constituent with the feature f that does not occur in a closed constituent to the left of the whole input structure if there is no left adjoint already in place. In a closed constituent, it left-adjoins it to the closed structure. f = extra does exactly the same but marks the left adjoint as extra, so that g = one can see that the first position is not yet occupied.
The constraints are called in order of strength and directly map an abstract structure to a structure determining the surface form.
13
See Zeevat (2008) for an attempt.
58
chapter two
Example:
(27) Input: {hijsubj , heeftmain , Mariaobj , {laten, Janobj , {leren, zwemmenvcomp }vcomp }vcomp }
one=subj ⟨hijsubj,one , {heeftmain , Mariaobj , {laten, Janobj , {leren, zwemmenvcomp }vcomp }vcomp }⟩
main < x ⟨hijsubj,one , heeftmain , {{Mariaobj , {laten, Janobj , {leren, zwemmenvcomp }vcomp }vcomp ⟩}⟩ obj < x ⟨hijsubj,one , heeftmain , Mariaobj , {{Janobj , {laten, {leren, zwemmenvcomp }vcomp }vcomp }⟩
obj < x ⟨hijsubj,one , heeftmain , Mariaobj , Janobj , {{{laten, {leren, zwemmenvcomp }vcomp }vcomp }⟩ vhead < x ⟨hijsubj,one heeftmain Mariaobj , Janobj , laten, {{{lerenzwemmenvcomp }vcomp }vcomp
vhead < x ⟨hijsubj,one heeftmain Mariaobj , Janobj , laten, leren, {{{zwemmenvcomp }vcomp }vcomp ⟩
This gives one linearisation: hij heeft Maria Jan laten leren zwemmen. The other is produced by the pending constraint. vcomp < x ⟨hijsubj,one , heeftmain , Mariaobj , Janobj , {{laten, ⟨zwemmen, {leren}vcomp }vcomp ⟩
vcomp < x ⟨hijsubj,one heeftmain Mariaobj , Janobj , {⟨{⟨zwemmen{leren}vcomp ⟩, laten⟩}vcomp
This gives the second linearisation: hij heeft Maria Jan zwemmen leren laten. (28) is an example for German (from Haider (1991)). (28) Buecher gelesen habe ich schon viele schoene. books read have I already many beautiful.
An input structure for this sentence is given in (29). This chapter has nothing specific to say about how to obtain this structure from a semantic represen-
syntax
59
tation in higher level production: apparently, German can break up semantic constituents into parts that are contrastive topic and parts that are not. Once the structure is available, it can be used for surface production. (29) Input: {habemain , ichsubj , schonmod , {viele schoenedp }, {{Buechern }}gelesen}ct }vcomp }
one = ct ⟨{Buecher gelesen}ct,vcomp,one , {{habemain ichsubj schonmod {viele schoene}dp ⟩} v[fin,main] < x ⟨{Buecher gelesen}ct,vcomp,one , ⟨habemain , {{ichsubj schonmod {viele schoene}dp }⟩
prom < x ⟨{Buecher gelesen}ct,vcomp,one ⟩, ⟨habemain , ⟨ichsubj {schonmod {viele schoene}dp }⟩⟩⟩
prom < x ⟨⟨Buecher, {gelesen}ct,vcomp,one ⟩, ⟨habemain , ⟨ichsubj , ⟨schonmod , {⟨viele{schoene}dp ⟩}⟩⟩⟩⟩ prom < x ⟨⟨Buecher{gelesen}ct,vcomp,one ⟩, ⟨habemain , ⟨ichsubj , ⟨schonmod , {⟨viele, {schoene}dp ⟩}⟩⟩⟩⟩
An English example:
(30) Doesn’t he like swimming? one = wh ⟨doesn’tneg,wh,one , {like, swimmingvcomp }, hesubj }⟩ subj ⟨ x ⟨doesn’tneg,wh,one , hesubj , {{like, swimmingvcomp }}⟩ vhead ⟨ x ⟨doesn’tneg,wh,one , hesubj , like, {{swimmingvcomp }}⟩
The word-order rules also enable an interpretation in terms of ‘prominence weights’. In such an interpretation, word-order constraints do not build structure but assign weights to words and constituents of input structures, and constituent weights influence the weights of their constituents. There are various ways of implementing this, including the following one: Start the process with a variable r set to 1. f < x: assign r/2 to all constituents with F as the value of their weight attribute and set r to r/2.
60
chapter two f=one: if the input already has a constituent with weight 1, skip. Else find a constituent with feature F and give it weight 1. f= extra: if the input has a constituent with weight 2 skip, else assign weight 2 to a constituent with feature F. f > x: assign weight r/100 to F and set r to r/2.
Words now stand in a linear order given by their derived weights, which is recursively defined as in (31). (31) Derived Weights If a constituent D with weight z is an immediate constituent or left adjunct of C with derived weight w, it has derived weight w + wz.
Derived weights respect the borders of constituents. The total additional weight of the other elements of a constituent is bounded by half of the constituent weight. prom can be implemented by assigning an inherent weight as a vector of weights coming from prominence dimensions. This delivers only a partial order. The vector can be entered into a special position regime with one = f assigning a bonus value, and v[fin,main] prom
For this example, monitoring can be limited to features θ and top, where θ is more important than top: θ > top
Welches Maedchen liebt Peter? is optimal for both of the inputs ?x(girl(x) ∧ love(p, x)) and ?x(girl(x)∧love(x, p)). Monitoring is then trivial since there is no other expressive option (due to the verb-second rule in German, which
self-monitoring
89
is not described by these constraints, there is no other ordering option). The prediction is therefore that the form is indeed ambiguous. As we saw, if we use only faith > prom, then in (66b) both Maria liebt Peter and Peter liebt Maria are optimal for love(m, ptop ) even if Peter is topic. Only one of the forms, however, meets the requirements of θ-monitoring: Peter loves Mary is equally good for love(ptop , m) for which Maria liebt Peter is not possible. (66c) is decided by the case marker checked by faith. In (66d) the reading eat(h, gtop ) is the only one that comes up, due to the low semantic probability of the competing reading that grass is eating the horse. (66d) may well point to the general solution. NP1 V NP2 supports SVOreadings more than it supports OVS-readings, since being the first NP is a strong cue for being the subject. The cue can be explained by the frequent coincidence of subject with topic and by the fact that subjects tend to be more agentive than objects, another way to acquire the prominence that puts them in the beginning of the sentence. The SVO-reading therefore normally wins unless something else (e.g., informative case marking, agreement, or the semantic improbability of SVO) prevents it. If this is the case, the interpretation OSV is ruled out in mat’ ljubit doc’ since nothing indicates that it should be interpreted as OSV, NP1 V NP2 prefers an SVO-interpretation which prevents the self-monitoring speaker from producing it for an OSV input. When we adopt this general solution, automatic self-monitoring for word-order freezing becomes identical to automatic self-monitoring for optional discourse markers: the most probable interpretation is seen as leading to an unintended interpretation for the form and, as a result, the form is blocked by monitoring for the intended meaning. The freezing effect is then just that SVO is the remaining interpretation. A suspension of the freezing effect under parallelism also fits well with this approach to wordorder freezing. Parallelism provides a strong cue against the standard ordering, allowing the non-standard order to win. This fits with a probabilistic interpretation of self-monitoring. The following would be an example (but I am uncertain about the intuition, preferring an SVO reading for the bexample). (67)
Who do the boys love? a. Maria liebt der jüngere. Mary, the younger one loves. b. Ina liebt Johann. Ina, Johann loves.
90
chapter three 3.3. Pronouns and Ellipsis
An intuitively appealing contribution to the theory of pronouns (or the NP selection problem, as it is called in natural language generation) is the insight that there is a sequence of defaults. This idea has been present in natural language generation in various forms for a long time but became best known as the referential hierarchy (Gundel et al., 1993). In this approach, a hierarchy of psychological categories is defined and an alignment of the categories with NP forms is predicted and tested. For the needs of natural language generation the hierarchy should be extended with, e.g., first and second person and reflexivity and reciprocality, because those categories can determine forms which are not part of the hierarchy and potentially could destroy its elegance. (68) seems a reasonable version for Dutch, in other languages, reflexive should be equally high as first and second, or even higher. (68) FIRST > SECOND > REFLEXIVE > IN FOCUS > ACTIVATED > FAMILIAR > UNIQUELY IDENTIFIABLE > REFERENTIAL > TYPE IDENTIFIABLE
This hierarchy supports the inference that if a form is used that aligns with one category, the higher conditions do not apply to the referent. For example, Grice’s example (69) implicates that she is not the speaker, the hearer, Bill, in focus, activated, familiar, or uniquely identifiable because indefinite descriptions align prototypically with referential and type identifiable. (69) I saw Bill in town with a woman.
At the same time, the use of the best aligned form is far from obligatory. All linguists know of counterexamples to the obligatory use of reflexives for the subject of the same clause. Examples of not using first or second person (for expressive reasons and for politeness) are easy to find and construct. There are many cases where the use of third person pronouns for in focus referents is best avoided, and it may be pointless to let the hearer identify familiar, activated, or uniquely identifiable entities. In all such cases, forms that align with lower categories are then used instead. Some relevant examples are given in (70). (70) My guru and his disciple. Your humble servant. Everybody voted for John. Even John voted for John. (In the mirror). I like you. John and Bill came to visit. John/*He … A waiter/The gray haired waiter/The guy who you met last year at the kindergarten explained the menu.
self-monitoring
91
What these examples show is that it would be unrealistic to hope for strict rules that would describe the choice of the form of the NP. This point is also supported by the empirical data given in Gundel et al. (1993): no item in any of the languages they study is strictly aligned with any category in the referential hierarchy. That makes it plausible to think of the referential hierarchy as part of a partial order of semantic features for monitoring. Treating the referential hierarchy as part of a monitoring hierarchy helps with the issue of non-strict alignment but more generally, it is the case that the referential hierarchy is a clearly important generalisation that does not seem to fit anywhere in most grammar designs. It is, however, helpful to extend the hierarchy with two features: id would be a feature that has the referent of the NP as its value (so that monitoring for it amounts to a requirement that the hearer can identify the referent), while polite takes into account the social relation between the speaker and the adressee and of the speaker and the hearer to the referent, and tests whether, given that relation, the NP is acceptable. This rules out familiar ways of expression if the relation is not one of intimately acquainted equals.10 This results in (71). (71) ID > POLITE > FIRST > SECOND > REFLEXIVE >RECIPROCAL > IN FOCUS > ACTIVATED > FAMILIAR > UNIQUELY IDENTIFIABLE > REFERENTIAL > TYPE IDENTIFIABLE
The treatment should include a specification of which NP expresses what feature, including politeness features. For Dutch, that should be approximately as in (72). (72) FIRST: SECOND: REFLEXIVE: RECIPROCAL: IN FOCUS: ACTIVATED: FAMILIAR: UNIQUELY IDENTIFIABLE: REFERENTIAL: TYPE IDENTIFIABLE:
ik, wij, mij, me, ons jij[fam], jou[fam], je[fam], u, jullie [fam] zich, zichzelf elkaar ie [fam], hij, zij, hem, ’m, ze, het die, die N, deze N die N [fam], de N de N die N, een N, N+pl een N, N+pl
Ellipsis for subject and object NPs is an interesting additional case that is formally different. In the monitoring view, it holds of the overt NPs that 10 Languages differ a good deal in ways of defining the relation that forces a polite way of expression and there may even be different relations involved, some involving third persons. That also implies that the feature of familiarity needs more refinement.
92
chapter three
they (or the articles and demonstratives they contain) are there due to selfmonitoring features because these devices are the only ones that realise the relevant input features. Zero subjects, on the other hand, result from the general possibility of ellipsis. This can be expressed by restricting the faithfulness of expression (the max-constraint family) to what is not currently in joint attention and letting hard syntax and lexicon decide about what is to remain. Zero expression may pass self-monitoring because of its complementarity with other expressive devices. In Italian, for example, a subject that is not already topic (though in focus in the sense of the hierarchy) needs stress and therefore a phonological realisation by a non-clitic pronoun. This makes an Italian zero subject an expression of a topical element that already is in focus. And in Chinese, non-protagonists need to be overtly realised, which makes zero an expression of the protagonist role. Self-monitoring is also important for the generation of the nominal part of definite, demonstrative, and indefinite descriptions. For activated and familiar, the nominal part should contain enough information to make bridging effective or to retrieve the activated object from the context. For uniquely identifiable unfamiliar objects, enough material should be assembled to make a unique definition not only possible but also recognisably unique. Something similar may well hold for the categories of referential and type-identifiable. If bridging is supposed to happen, the choice of the noun must give a link to an in-focus referent. One way to understand Evans’ observation (Evans, 1977) that indefinites in the context of their clause normally supply a unique description, is to see it as a requirement that enough material is included in the noun to make it the case that if a clause “A(an N)”11 is uttered, “the N such that A” is a definite description of the referent. It is, however, hard to explain why the preferred choice should then not be “A(the N such that A)”. For example, (73a) would be better expressed as (73b). (73) John saw a girl in the park. John saw the girl that he saw in the park in the park.
A better alternative view is to assume that lexical material is collected in utterance planning under the viewpoint of relevance to the hearer, which includes the issue of whether the hearer needs to be able to identify the 11 Thereby making it the case that “an N” is still equivalent with “one N”, its etymological origin. When uniqueness fails, the indefinite use of “this N” is a better way of expressing identifiability by the speaker.
self-monitoring
93
object independently of the current utterance. If the material is sufficient for identification (given the context, given the possibility of bridging, or by uniqueness of description), the definite article is chosen, otherwise the indefinite. The most important point of this section is that natural generalisations expressed in the referential hierarchy or in similar work in natural language generation can be incorporated straightforwardly in a linguistic description of NP selection as part of the specification of self-monitoring. An interpretation in terms of generation rules—e.g., in production OT constraints— would wrongly predict that one has to use ‘me’ or ‘she’ whenever that is allowed by an input feature and that is wrong. Most linguistic frameworks seem to be unable to accommodate this kind of generalisations. If a grammar is seen as a mapping from semantic inputs to correct expressions of those inputs, it is possible to incorporate an obligatory choice of NPs, roughly along the lines of: always choose the highest NP in the hierarchy that is possible given the input. But that does not do justice to the fact that this is not a hard rule. If a grammar formalism merely defines a relation between semantic representations and forms, it cannot capture the effects of the referential hierarchy. 3.4. Differential Case Marking The term differential case marking covers both optional case marking and obligatory case marking on only certain nouns or categories of nouns. An example of the second phenomenon is the lack of accusative marking on NPs in English except for some of the animate personal pronouns. Proper optionality in case marking happens in spoken Japanese where nominative and accusative case markers can be omitted. Differential case marking is closely related to word-order freezing, as discussed in section 3.2. In fact, word-order freezing could be treated as a special kind of case marking. Untypically, it is neither subject nor object marking, but simultaneous subject-and-object marking. Aissen (2003a) however shows that simultaneous subject-and-object marking is necessary for the treatment of the morphology of Comanche. Unlike standard differential case marking, word-order freezing (or Comanche morphology) does not fall out from Silverstein’s typological generalisation (as quoted in Aissen (1999)). Silverstein’s generalisation states that prototypical objects/subjects are less likely to be marked, while untypical objects/subjects are more likely to
94
chapter three
be marked. Aissen (1999, 2003b) gives an optimality-theoretic formalisation in terms of prominence. Prominence is a partial order over NPs that can be computed as the product of several smaller orderings: animacy ordering (human > animate > non-animate), person ordering (1>2>3), activation ordering (local > pro > lexical), definiteness ordering (definite > specific > unspecific), and an information theoretic ordering topic < focus. High prominent subjects and low prominent objects are standard. Aissen’s optimality theoretic formalisation uses a constraint set (74) (74) ∗X&Subject&∅ (a subject X needs to have case marking) and ∗X&Object&∅ (an object X needs to have case marking)
with a fixed ordering given by (75).
(75) ∗X&Object&∅ < ∗Y&Object&∅ if and only if X is more prominent than Y. ∗X&Subject&∅ < ∗Y&Subject&∅ if and only if X is less prominent than Y.
The language-particular patterns can now be captured by a *structure constraint, forbidding case marking, which can take a position with respect to the case-marking constraints. Case marking is optional when *structure is pending with one of the constraints that enforce case marking. Zeevat and Jäger (2002) show that there is a direct relationship between frequency statistics over corpora and classes X&Object or X&Subject. In particular, if X is more prominent than Y, p(Object|X) < p(Object|Y) and p(Subject|X) > p(Subject|Y). In other words, the Aissen constraints can be read as cue constraints for interpretation: one can infer Object or Subject from X with a certain probability. Here, X is the observable NP-class and Subject and Object (or the θ-roles that lead to these syntactic functions) the inferable property. That makes case marking similar to the treatment of particles discussed in section 3.1. If a wrong case is cued by the NP-class and the language has an optional case marker, it has to be inserted by selfmonitoring. Interpretational cues are counteracted by the much stronger cues connected with case (p(Subject|Ergative) = p(Object|Accusative) = 1). Another important factor is that the roles are mutually exclusive, that is, p(Subject&NP1|Object&NP2) = p(Object&NP1|Subject&NP2) = 1. Other strong potential cues for subject and object are provided by agreement, voice, indirection, and word order, and these play an important role in Aissen’s discussion of the typology of differential subject and object marking. As observed in Zeevat and Jäger (2002), these cues create a problem for the speaker: if their combined effect does not lead to a decision or results in a wrong thematic assignment, the hearer may well misunderstand the utterance. Automatic self-monitoring then predicts that the speaker will select a marker if there is one. The monitoring feature is the same as for
self-monitoring
95
word order freezing, namely, θ-role. (In fact, word order is the last resort if no case markers are available). Communication failure caused by difficulties with identifying the binders of the θ-role in NP1NP2V is—as can be inferred from the data in Aissen’s typology—an important source for emerging grammaticalisations. The strict word order in French and English can be seen as a grammaticalisation of word order under the influence of word-order freezing. The obligatory case marking systems of Latin, Russian, or Sanskrit12 is the outcome of a process of extension of optional case marking. Ergative systems can sometimes be traced back to passive structures (see, e.g., (Sackokia, 2002)). Obligatory passivisation in Sioux (Aissen, 1999) when the object outranks the subject in prominence can also be seen as the outcome of an optional strategy (which Bresnan et al., 2001 shows to be operational in English as well). The emergence of these patterns can be related to two processes. First of all, a rational self-monitorer will overmark rather than undermark: undermarking leads to communication failure, while overmarking does not matter. (The penalty on undermarking and the resulting inhibition of undermarking, without a similar penalty on overmarking, automatically creates a small overmarking effect). The creeping progress of Spanish object marking documented in Aissen (2003b) can be attributed to this strategy. But overmarking also reinforces itself because the strength of cue constraints changes with the progress of marking and leads to further marking. Consider the following example which could be a reconstruction of the way the Sioux pattern emerged. Suppose NP1 is more prominent than NP2, NP2 is the intended agent, and the verb involved can receive disambiguating passive morphology: p(Subject|NP2&Passive) = 1. p(Subject|NP2) < p(Object|NP2) by prominence and without any marking p(Subject|NP2) = p(Subject|NP2&¬Passive) and p(Object|NP2) = p(Object|NP2&¬Passive). Assuming other factors that may contribute to disambiguation, automatic self-monitoring predicts that sometimes passive morphology is selected. But this will change p(Subject|NP2&¬Passive). It becomes smaller because the passive cases have to be taken out of the frequencies. So a low-prominent NP2 in a non-passive sentence becomes an even stronger cue for NP2 being an object. Automatic self-monitoring will then produce even more passive morphology, which in turn will lead to an increase in cue strength. A combination of the overmarking strategy and changes in the cue strength will in
12 The processes involved are lost in the mist of time since these three languages inherit their case system from Indo-Germanic.
96
chapter three
the long run always lead to complete passive marking—though the run can be very long. The cue strength and marking frequency can tend towards a limit but overmarking will always push marking over that limit again. This process can be blocked by reliable grammaticalised marking elsewhere (on the verb or the other NP), which would push automatic self-monitoring for θ-monitoring out of business. Which particular markers are available for such a growing role in expressing thematic relations seems largely accidental (the exception being word order, which is always available). Recruitment of lexical roots for a grammatical role or recruitment of functional inventory for a new role can be modelled if the item in its old role already is a cue for the new function13 and there must be a need to mark the new role (Zeevat, 2007). There is also no functional reason for the processes that lead to the loss of functional markers. Affixes tend to lack stress and can erode away, while creolisation processes easily destroy functional strategies (in a heterogeneous language community, the strategy of morphological marking becomes unreliable). These brief remarks lead to hypothesis (76). (76) Automatic self-monitoring explains the recruitment and extension of functional items.
This hypothesis turns automatic self-monitoring into the motor of grammatical evolution. Automatic self-monitoring by itself assumes the role of living functional evolutionary pressure that goes into action in nearly every utterance production. 3.5. A Case for Phonological Self-Monitoring? Boersma (2007) presents a case that could be taken as a counterexample to purely production oriented OT phonology (as in the architecture of Hale and Reiss, 1998 or the ‘motor theory of speech perception’ of Liberman et al., 1967) and an argument in favour of integrating an optimality theoretic account of perception into a looping architecture of full production and perception (as in the final proposal of Boersma, 2007). Boersma’s case can also be seen as an argument for automated phonological self-monitoring in the sense of this chapter. The following follows the paper closely.
13 The preposition ‘a’ in Spanish, which is an optional object marker for animate objects in modern Spanish, cues the absence of most proto-agent properties of the argument in action, thus preventing an agent interpretation of the argument.
self-monitoring
97
The case is the French silent h,14 which cannot be heard except through its interaction with other phenomena. The article le drops its schwa before a vowel-initial word (e.g., état which is pronounced as [letat]) but does not drop the schwa when the (abstract) word starts with a silent h, so that, for example, le hazard is pronounced as [ləazard], that is, in exactly the same way as when le combines with a consonant-initial word. Quel état is pronounced by moving the final l of quel to the first syllable of état but in combination with hazard, the l stays where it was and the first syllable ha of hazard can start with a pause, a creak, or a glottal stop. In combination with the plural article, as in les états, one gets [lezetat] but les hazards is pronounced as [lɛazar]. And finally, the h seems responsible for the failure of schwa-drop for the female article une, which becomes [yn] combining with femme or idée but keeps its schwa in combination with hausse or indeed hache, which are pronounced [yneos] and [yneaʃ]. This phenomenon can be treated in generative phonology or in OT production phonology with special constraints for French but neither treatment is explanatory. If one limits oneself to universal constraints in an optimality theoretic treatment, one gets something along the line of the following: …{max(UC), DEP(ə)} >> max(h) >>*ʔ>>*ə>> {max(V), ∗CC} >> max(EC)
The missing part at the beginning is there to ensure that not too many schwas are dropped. max(UC) ensures the realisation of obligatory consonants (with the omittable ones projected by max(EC)), while DEP(ə) prevents schwas in the output if they are not in the input. max(h) forces the realisation of h as a glottal stop, pause, or creak. *ʔ removes glottal stops. *ə removes schwas, *CC penalises two adjacent consonants, and max(V) realises the underlying vowels in the output. Interestingly, this results in an incorrect treatment of the silent h. le hazard leʔazar
… max(h)
leazar
*
lazar
*
…
14 The orthography is sometimes misleading. The French word homme has no silent h and its underlying form is /om/. It patterns with état when combining with other words. The h is written for historical reasons in these cases.
98
chapter three quel hazard … max(h) ke.lazar
*
…
kel.ʔazar kel.azar les hazards lɛzʔazar
* … max(h) *CC *
lɛʔazar lɛazar: une hausse unos yneos
* … max(h)
*ə
*
*
ynʔos yneʔos
*
The only correct production is [kel.ʔazar]. Candidate productions [ləazar], [lɛazar], and [ynəos] should win but do not because they violate constraint max(h). max(h), on the other hand, cannot be demoted because it is the source of the special behaviour of the silent h. Without it, hazard would reduce to [azar], hausse to [os], and the winners would be [lazar], [ke.lazar], [lɛzazar], and [ynos] respectively. Boersma’s provisional solution is to reinterpret max(h) as an interpretation constraint: h does not need to be realised in articulation but must be inferable in perception. The idea is that the inference is produced by hiatus (VV) or by any of the ways in which the silent h can be realised (creaks, glottal stops, and pauses). Boersma then explores other architectures for dealing with this problem and settles for one of them. This ends the recapitulation of part of Boersma (2007). The proposal of this chapter is closely related to making the input feature h a monitoring feature. A better choice than the h itself, however, would be to let the monitored feature be the emptiness/non-emptiness of the onset of a syllable. Hiatus marks its second vowel as the start of a onsetting syllable and so the varying realisations found between kel and azar can all be seen as indicating an onsetting second syllable. The profile of elision, liai-
self-monitoring
99
son, enchainement, and schwa-drop, on the other hand, indicate syllables without an onset. This property distinguishes syllables in French and French would be special in having onsetting syllables with an inaudible onset. It could, however, be doubted whether the silent h is part of speaker self-monitoring in the same sense as in the other applications discussed above. First of all, the solution is not formulated in terms of semantics, but in terms of recognising phonological structure. There is no hétat alongside état, no ausse next to hausse, and no hidée next to idée, that is, there is no ambiguous signal, since it would be almost completely disambiguated by the lexicon.15 The description only becomes similar to self-monitoring for other features if the underlying phonological form is taken to be the semantics of a phonetic structure. That would be a well-motivated step. Human subjects can recognise nonsense words and learn new words from their use in books and conversations. An underlying phonological structure is needed as the object that is perceived in these tasks and it can be defended that a phonological structure is an additional level at which speaker and hearer converge. A second problem with assimilating this particular case with monitoring is that it seems to require a ranking of monitoring between proper production constraints: mon(h) must be dominated by DEP(ə) and must dominate *ə and *ʔ. While in all examples of automatic self-monitoring one needs to assume constraints that dominate monitoring, it is not necessary to suppose that monitoring constraints dominate production constraints. One can perhaps defend this domination of production constraints in a reformulation of the constraint set in the way of Chapter 2 Syntax, that is as constructive procedures. In such a reformulation, economy effects typically become emergent (they correspond to possible structure in the output form that is just not constructed because there is no reason for its construction) and self-monitoring can prevent the economy effects from occurring. However, many unclarities about a procedural reconstruction of OT phonology remain and this issue must be postponed to future research. Perrier (2005) reports effects in articulation that can be interpreted as aiming for acoustic targets that increase understandability. This would be a simpler argument for automatic self-monitoring in phonology. A basic account of articulation should make many variations of articulation optimal and monitoring for phonetic features would deselect those optimal articu-
15 Of the 244 French nouns starting with h in http://www.limsi.fr/Individu/anne/Noms .txt, just 2 (hâle and hanche) had a noun homophone without h.
100
chapter three
lations that could be misunderstood. It is not clear, however, how the cases discussed by Perrier could be formalised. The issue of whether there is automatic self-monitoring in phonology and articulatory phonetics that is analogous to the syntactic and lexical self-monitoring studied in this chapter must therefore remain undecided for the moment. On the other hand, the conclusion that both the silent h and the Perrier effects involve some form of automatic self-monitoring seems unproblematic, also in the light of the good evidence for phonological self-repair in Levelt (1983). 3.6. Conclusion This chapter presented empirical evidence from linguistics that supports the assumption of automatic self-monitoring. The case is indirect and its starting point is the assumption that language production can be understood as a mapping from a (possibly enriched) semantic representation that does not need other resources apart from grammatical constraints which look at the candidate utterance, the semantic representation of the message, the lexicon, and the context of utterance.16 Under a further assumption that the structure of the mapping is explanatory (articulated within optimality theory as the demand that all constraints must be universal), there is still an important range of descriptive problems in natural language that cannot be handled by such a mapping unless it gets access to what an interpreter will do with a candidate production. And it would appear that there is a rather large class of such cases and that some of them significantly influence the structure of sentences. The argument does not tell us how automatic self-monitoring is implemented. Usually, it is possible to come up with often rather complicated language-specific constraints which capture the effects of self-monitoring. An argument against this automatisation of automated self-monitoring is the soft edge of all automated self-monitoring applications: proper automatisation would predict a harder edge. An argument in favour of automatisation is the grammaticalisation of automatic self-monitoring into hard rules. This would, however, require a smooth transition between an automatic self-monitoring account of the phenomenon and a grammatical account.
16 The context of utterance is needed to define when anaphoric devices are possible at all, i.e. when the context has an antecedent for the device. That makes the context necessary for hard syntax, even if monitoring can in the end decide against the device.
self-monitoring
101
An alternative account of the transition is available in the form of re-analysis in language learning where for the learner, nearly total marking by automatic self-monitoring is indistinguishable from grammatical marking. To make more progress with these issues, psychological investigation of automatic self-monitoring would be needed. In particular, one would need to find out whether automatic self-monitoring causes delays in production or in the activation of larger parts of the brain. A direct implementation of self-monitoring by means of interpretation and inhibition of productions predicts that such effects will be found. An important result of this chapter is the general empirical profile of an automated self-monitoring application, repeated here as (77). It is empirical in the sense of acceptability judgements but apart from an incomplete look at the corpora, no psychological validation of any of the applications is available. The profile is about optional markers for a feature. (77) 1. A proper description of insertion of optional markers has to address the question whether the hearer would correctly understand an alternative production without the markers. 2. The description does not select any particular marker (other markers of the same feature would do just as well). 3. A marked form (the utterance with the marker is longer) is marked due to a genuine semantic property, i.e., marking is not controlled by a pseudosemantic feature with various non-transparent aspects, such as French gender or whatever controls the Russian genitive or the English perfect. 4. Marking is not due to syntax. 5. Marking can be obligatory or disallowed but there are also intermediate cases, an optional fringe, cases where one can but need not mark. 6. Violations of obligatory marking due to monitoring lead to a different interpretation. 7. Marking is overt. 8. Marking which is not made obligatory by monitoring is not ungrammatical.
An automated self-monitoring application is, by the hypothesis advanced in Chapter 1 Introduction, a structural accommodation of the speaker to the hearer strategy of choosing the most probable meaning. Grammaticalisation of an automated self-monitoring application has occurred if the language itself has adjusted to the hearer needs. One would expect that in the earliest phase of such a grammatical accommodation, requirements (1), (2), (4), and (5) are lost while (3), (6), (7), and (8) initially still remain intact. The meaning of the marking (3) can be distorted in later phases and (6), (7), and (8) can disappear: unmarked versions become just ungrammatical. This shows that that only (1), (2), (4), and (5) are sufficient properties
102
chapter three
for automated self-monitoring, the others are compatible with grammatical marking. This characterisation of phenomena that fall under automatic self-monitoring is matched by a general theoretical proposal where the automatic self-monitoring component is formalised in optimality theory as an additional constraint dominated by all of the production constraints except for the low-ranked economy constraints. In a procedural reinterpretation of OT syntax of Chapter 2 Syntax where the effects of those economy constraints are emergent, it suffices to say that proper production constraints override monitoring. This is a very natural interpretation: automated self-monitoring exploits the expressive space given by well-formed expressions of a language for a given input, possibly at the cost of economy. A competing theoretical account of self-monitoring is bidirectional optimality theory. In such accounts, an optimal form for a meaning is only really optimal if the meaning is also optimal for the form by competition with the same constraints in the reverse direction (Blutner, 2000; Smolensky, 1996). A basic problem with such accounts is that the reverse competition which uses the same constraint set does not compute the same relation between forms and meanings: it happens only for special constraint systems. The standard counterexample is the rat-rad problem in phonology but it is very easy to find similar problems with constraint systems from syntax too (see, e.g., Zeevat, 2000). There are two possible solutions: one can demand special, ‘symmetric’ constraint systems (Boersma, 2001; Smolensky, 1996; Hendriks and Spenader, 2006) or one can prune away the unwanted asymmetric pairs. Blutner (2000) proposes two ways of doing just that. In this chapter and in Zeevat (2006a), it is argued that these bidirectional approaches result in a version of self-monitoring that is too strong since they would rule out, for example, that doc’ can be a contrastive topic in mat’ ljubit doc’ or that Welches Maedchen can be object in Welches Maedchen liebt Peter? The correct view seems to be that automatic self-monitoring is a second round of optimisation, one constrained by the available means of expression. A marking device—e.g. word order—can be used by syntactic constraints or by more important monitoring. Where the expressive means are not available, the production is fully allowed under self-monitoring but bidirectional optimisation is forced to disallow it. That is precisely the point of Gärtner (2003). Note that bidirectional optimisation collapses into the view defended in this book. Suppose the bidirectional constraint system assigns to a form in a context meanings which are not the maximally probable interpretations of the form in that context. It would then be empirically incorrect and would
self-monitoring
103
need amendment until the most probable reading of forms F in a context C is in fact the bidirectional winner for F in C. It is unlikely that this amendment strategy can be carried out but it seems just a statement of the empirical requirement on a theory that relates verbal forms and their meanings. So suppose that the required amendments are possible. It is then the case that one could factor out the grammar in two components: one that determines syntactic correctness and a second one that finds the most probable interpretation in the context. The first component is a version of Chapter 2 Syntax. The second component would be comparable with other means of determining the most probable readings which could also replace the component (they may be better or more efficient). The phenomena covered in this chapter can then be covered by the second component which would determine which optimal production gives the best chance for recovering the meaning, with priorities for more important semantic features. This would not disallow the productions that fail with respect to some monitoring features, provided there is no optimal production, that does equally well on the more important monitoring features, but also marks the less important ones.
chapter four INTERPRETATION
The aim of this chapter is to present a linear algorithm which produces interpretations based on a Bayesian interpretation scheme. The algorithm constructs the most probable interpretation incrementally, produces pragmatically enriched interpretations, and is highly robust. It uses knowledge resources that human speakers have at their disposal and which may be obtainable from the huge corpora now available to computational linguists. The algorithm deals systematically with one half of linear interpretation (the other half is dealt with in Chapter 2 Syntax), and with bidirection in interpretation. The strategy chosen also explains simulated production in interpretation, incrementality of interpretation and the gap between production and interpretation. This gives the list of cognitive properties from Chapter 1 Introduction that is repeated here as (78). (78) 1. coordination in communication The fact that verbal communication is normally successful in the sense that the utterance by the speaker normally leads to the hearer grasping what the speaker wanted to say. 2. linear production The fact that formulation of an utterance takes an amount of time that does not explode with the size of the message (as measured, e.g., by the length of the utterance). 3. linear interpretation The fact that the interpretation of an utterance takes an amount of time that does not explode with the length of the utterance. 4. bidirection in production The fact that one cannot express a meaning by an utterance that one could not interpret as expressing that meaning. 5. bidirection in interpretation The fact that one cannot interpret an utterance that one could utter oneself by an interpretation that one could not express by the same utterance (if one were the speaker in the context). 6. explaining simulated production in interpretation The by now well-established psychological hypothesis that interpreting involves simulated production. 7. incrementality of interpretation A well-established psychological hypothesis that human interpreters interpret every initial segment of the utterance on all levels of interpretation.
106
chapter four 8. the gap between production and interpretation The observation that what people can say is a proper subset of what they can understand, both for language learners and adult speakers.
The algorithm is an n-best algorithm and as such, it is linear and incremental. Simulated production is integrated as a filter in interpretation (more on that below). Overall, the algorithm assigns the most probable interpretation and is thus a faithful implementation of the hearer strategy advocated in Chapter 1 Introduction. It is standard semantics and pragmatics, because it delivers logical representations with proper truth conditions and reconstructs conversational implicature, presupposition, rhetorical structure, anaphora and ellipsis. In the second half of this chapter, it will be explained how the pragmatic effects are obtained. In Chapter 5 Mental Representation, it will be demonstrated that the linked sets of contextualised representations that the algorithm produces form a logical formalism with standard truth-conditions that can give a logical representation of contexts, the contents of utterances and the conversational contribution of the utterances. For the purposes of this chapter, one can see the sets of linked concepts as just another way of representing content. The proposal belongs to a tradition started by Jerry Hobbs’s abductive framework for interpretation (Hobbs et al., 1990), but it also incorporates elements of the data-driven models of more recent AI. This chapter also sketches the possibility of integrating information obtained from other perception with natural language interpretation. As an emulation of Bayesian interpretation, the proposed algorithm is part of the explanation of normal coordination in verbal communication. Bayesian interpretation leads to the most probable interpretation of an utterance and the algorithm thus incorporates the hearer strategy of choosing the most probable interpretation. Speaker adaptation to the hearer strategy, the subject of Chapter 3 SelfMonitoring, predicts that communicative success is normal even if linguistic signals are highly ambiguous. As will become clear later on, the interpretation algorithm is highly tolerant of imperfect input and of imperfections in recognising the signal. It works on cues provided by the input and tries to integrate them into a picture of speaker’s intention in producing the input. Speaker errors in syntax and morphology lead to low scores obtained by simulated production but do not disrupt interpretation as long as there is no alternative way of integrating the cues that would lead to better values in simulated production. Speaker errors can postpone a decision on a badly perceived part of the input but the algorithm can recover by using
interpretation
107
predictions arising from the left and right context to repair the part which contains the error. As a highly robust algorithm, the interpretation algorithm proposed here also explains the gap between production and interpretation. First of all, it does so due to its robustness: the input can contain syntactic and morphological errors and still be understood as intended since it is assumed that the speaker would not produce erroneous input on purpose. But secondly, what plays a role in the explanation of the gap is an asymmetry in the cues. Notice that the set of conceptual cues that evoke words in production must correspond to inverse cues from those words and constructions to the same concepts. Without a sufficient strength of the inverse cue, words would not pass automatic self-monitoring. But this mechanism does not hold in interpretation. One can interpret a word or a construction for a particular concept while personally preferring other means of expressing the same concept. The concept ‘horse’ cues the word ‘horse’ and the word ‘horse’ cues the concept ‘horse’: thereby the word ‘horse’ is a good way of expressing the concept horse. The concept ‘horse’ hardly cues the word ‘steed’, but ‘steed’ inversely is a good cue for the concept ‘horse’. This by itself captures the gap between production and interpretation: concepts and words that mutually cue each other are available both for production and for interpretation, while words cueing a concept for which another word is preferred are available to interpretation only. Simulated production does not need to search for words and constructions, it merely needs to check that they are in the right order and have the right morphology. The algorithm is linear if one assumes that there is a maximal number of competing hypotheses that are considered by the algorithm at any one time: the n-best approach in stochastic parsing.1 The only rational way to deal with input that is not restricted in size—given that only a finite number of hypotheses can be considered at any one time—is to take all decisions as early as possible. Incrementality and linearity are therefore closely connected. Incrementality also makes it possible to immediately use the strongest priors, those deriving from world knowledge.
1 In, e.g., Ytrestøl (2011) this becomes an approximation only to the most probable legal HPSG parse: it misses legal readings with a small probability. This discrepancy between the theoretical model and the algorithm does not exist in the view presented here. It is the algorithm—and not the HPSG model of the language—that is a model of human interpretation. It is therefore not an approximation to an even better theoretical model. The algorithm can be criticised only on the basis of empirical data about human understanding and from the perspective of not capturing Bayesian interpretation properly.
108
chapter four
The algorithm does more than just meet the demands outlined above. It makes specific assumptions about the nature of concepts and about what is stored in the mental lexicon, i.e., about concepts that need to be linked to other concepts. It also makes concrete assumptions about how verbal communication updates the belief state of the interpreter and about the ‘syntax’ of the objects which code information in those belief states. By selecting linked concepts and their contexts, it thus also makes assumptions about the foundations of natural language semantics. Various consequences of these claims will be explored further in Chapter 5 Mental Representation. For the time being, it suffices to explain the structure and the intuitive semantics of the representation language. Expressions like John and Girl stand for concepts that do not take arguments and, as occurrences Johni and Girlj in a context express the propositions: There is somebody who meets the concept John and There is somebody who meets the concept Girl. The occurrences denote the object that makes these propositions true, as in (Johni Girlj )Meetk . Here there is a 2-place concept Meet that takes as arguments the denotation of Johni and Girlj , supplied by the occurrences Johni and Girlj in the context and as a proposition expresses There is a meeting between the two denotations of Johni and Girlj and as a referring expression denotes the meeting claimed to exist in the context. A model w for a context is a pair ⟨Uw , Fw ⟩ that assigns objects in Uw to linked concepts and n+1-place relations to the unsaturated concepts. Fw ((Johni , Girlj )Meetk ) must be a meeting of Fw (Johni ) and Fw (Girlj ), i.e. the three objects must be related by Fw (Meet): ⟨Fw ((Johni Girlj )Meetk ), Fw (Johni ), Fw (Girlj )⟩ ∈ Fw (Meet). Moreover, the earlier occurrences in the context of Johni and Girlj should make it the case that that ⟨Fw (Johni )⟩ ∈ Fw (John) and Fw (Girli ) ∈ Fw (Girl). Contexts as a whole can be arguments of concepts, as happens in propositional attitudes and in the concepts expressing logical operators, like Negation, Implication and Necessity. In Chapter 5 Mental Representation these will be supplied with a precise semantics, capturing their intuitive meanings. In the examples in this chapter, indices on occurrences will be systematically omitted. This is possible since two occurrences of the same concept really are two occurrences within the interpretation process, unless the process identifies them.
interpretation
109
4.1. The Interpretation Algorithm Let us recall the Bayesian scheme of interpretation as outlined in Chapter 1 Introduction: the task is to find an interpretation I of utterances and turns such that I = argmaxI p(I)p(U|I). p(I) is estimated based on models of the speaker and the world, which means that it can only be estimated on the semantic aspects of the interpretation. The algorithm is constantly maximising prior probability in trying to evaluate a cue or a combination of cues to the meaning of the utterance. It is traversing a large search space guided by prior probability in order to reach a point with a high likelihood. The process is based on making words, roots, morphs, and multi-word expressions the basic cues for interpretation. These elements cue concepts. Concepts are taken to be mental entities that have a truth-conditions in their combination with each other, that is, the notion of a concept is understood from the perspective of truth-conditional semantics. Concepts also fulfil their traditional role of being that which holds together a complex representation, so that, for example, the concept of grasping connects a given grasping subject with a grasped object in an event of grasping and links together the concept of the grasper (e.g., Charles) with the grasped object (e.g., a glass of lemonade). This results in a concept of a grasping event in which Charles grasps a glass of lemonade.2 Recent developments in DRT (Reyle et al., 2007) construct the semantic contribution of a lexical item as a complex entity, a sequence of presuppositions followed by a content, that is, (P1 , …, Pn )C. Integrated meanings are then formed by resolving the presuppositions in the context (or, sometimes, by adding their content to the context if resolution does not work) and adding the content enriched by binding the presuppositions to their antecedents to the context. This format and the set-up of the process is motivated by semantic binding that is not syntactically expressed, i.e., due to the presence of presupposition triggers, anaphora, and sub-lexical combination processes (for more on this sub-lexical use, see, e.g. Solstad, 2007). In this chapter, the DRT scheme will be extended to various kinds of binding including those usually taken to be achieved by compositional
2 This is compatible with views on concepts such as that proposed by, e.g., Barsalou et al. (2003), where concepts are multi-modal. On that view, the truth-conditional aspect emerges from the modal simulations it consists of: a visual simulation of grasping, the motor routines connected with carrying out the grasping oneself, and connections with other concepts such as holding a glass of lemonade, wanting, or intending to grasp a glass, moving the hand near the glass, etc.
110
chapter four
rules, such as the combination of a verb with a subject or of a relative clause with its head noun. This results in a simpler overall system, which moreover allows the treatment of morphology and word order proposed in Chapter 2 Syntax to act as a constraint on all combination processes. In the approach proposed here, words cue concepts which consist of a content and a sequence of presuppositions. In its most abstract form, interpretation is then a process of selecting concepts cued by words and resolving their presuppositions within contexts.3 Without any further constraints apart from the content attributed to presuppositions as in the DRT proposal, this results in a very large set of interpretations: every word cues sets of concepts and many entities in the context will match the content of their presuppositions. But this is nevertheless the backbone of the algorithm. Presupposition declarations carry more information than just the content needed to evaluate a match with a contextually given antecedent. They can carry additional semantic information not involved in such matching, the kind of information called “character” in Kaplan (1989), which is needed in order to determine or identify the reference but is not part of the content. They also normally include information as to where the antecedent of the presupposition is to be found in the context. And finally, there is usually a reference to what should be done if the antecedent cannot be found: sometimes the presupposition can be accommodated, sometimes it cannot. The interpretation process may also crash (crash), accommodation of the missing object may be forced, which leads to a crash if accommodation (accommodate) would make all contexts of the trigger inconsistent, and finally, it may be accepted that a particular presupposition remains unbound (nonesuch) or receives a standard value (standard).4
3 Contexts are the main topic of Chapter 5 Mental Representation. For the time being, it suffices to assume that logically complex concepts like disjunction, propositional attitude concepts, and speech act concepts can open contexts apart from the main context, contexts which represent the interpreter’s information state (or some surrogate of it, as in reading a novel). 4 Zeevat (2009a) proposes an additional category, weak, which allows binding also from suggestions made in the common ground. It can be of the form weak; crash or weak; nonesuch. This new category is needed for particles. For example, indeed weakly presupposes its host and crashes if no suggestion can be found, while only weakly presupposes that an amount exceeding the amount stated in the host is suggested but trivialises when such suggestion cannot be found. Weak presupposition seems to come out of grammaticalisation processes and the possibility of trivialisation depends on whether the particle meaning is exhausted by the presupposition. Only still makes the host an answer to a quantity question under trivialisation, while indeed stops having any meaning if the presupposition lacks an antecedent.
interpretation
111
These distinctions add up to a classification of presuppositions in the widest sense: having an argument, triggering anaphora, and classical presupposition triggering are all included. Arguments need to be classified as obligatory or optional, while optional arguments need to be divided into anaphoric cases, cases where they are assumed to have a stereotypical binder (Sæbø, 1996), and cases where they can be assumed to be just absent. Anaphora is divided according to the type of the antecedent and the location where it is to be found, with accommodation, stereotypical binding, and trivialisation ruled out.5 Traditional presupposition triggers need to be divided into those that do allow accommodation and those that do not (see Beaver and Zeevat (2006) for a longer discussion). Further subclassification is necessary for including particles, see footnote 4. The classification is based on a number of existing subclassifications of pronouns, presupposition triggers, and arguments, like Sæbø (1996) and Beaver and Zeevat (2006). Presupposition triggers do not always accommodate and omitted arguments sometimes but not always need anaphoric binding. If they do not need binding, they can be interpreted either as receiving a standard value (for arguments whose existence is entailed by the concept) or as missing (in cases where they are not needed by the concept). Non-accommodating presuppositions (crash) correspond to obligatory arguments and standard pronouns, short names, and other short definites. The standard arguments are optional arguments that are conceptually necessary and nonesuch arguments are also conceptually optional. Accommodate arguments are classical presuppositions as studied by Karttunen, Gazdar, Soames, Heim, or Van der Sandt. The various kinds of optional arguments a concept can take are important in the current setup because they are responsible for coherence effects such as bridging and context restriction. They will, however, be ignored in the examples unless needed because they would lead to overly large examples. (79) outlines the basic format for specifying presuppositions. (79) (What:Condition:Where:Method)
What is a partially instantiated concept that needs to be found in the context. Condition may impose additional semantical conditions on What (character meaning). Where indicates the contextual status of the item that is searched for, such as whether it is general knowledge, common ground 5 The exception are pronouns like it in English or het in Dutch, which could perhaps be described as having both accommodating (anaphora to non-overt antecedents) and trivialised readings (the epenthetic ones). The matter, however, is not clear.
112
chapter four
between the speaker and the hearer, given in the linguistic context, in the utterance situation, in the current focus of attention or, finally, in the same clause. Method indicates what should be done when the search fails. Where and Method allow disjunctive specifications (A; B), meaning that A applies or B applies if A does not work. To give an example: the contrast, noted in Sæbø (1996), between the two possible ways of filling in the object of ‘give’—where the object either needs a resolution (the goal object) or is filled by a standard value (the theme object)—can be captured as in (80). The goal object needs to be specified in the clause or be currently focused in the context, otherwise it leads to a crash. The theme object must be specified, otherwise it is assigned the value standard (if, for example, John gave to the Red Cross something unusual, for example a button, the speaker should say so). Binding to something in the focus of attention does not seem to apply: if that is the intention, the object needs to be specified by a pronoun that is to be resolved to the focus. (80) John gave three dollars. (X::(clause;focus):crash) John gave to the Red Cross. (X::clause:standard)
At any moment of time, the context consists of a set of linked concepts divided over various locations—they may be in general knowledge, in common ground knowledge, in the linguistic context, in the perceived utterance scene, in the current focus, or given in the current clause and the set of currently open contexts determined by earlier interpretation activity, since certain operator concepts can open new contexts. (81) he (X:male(X):focus:crash)Id
The pronoun has as its content the identity function and a presupposition that if resolved to its antecedent (required to be male and currently in focus), it denotes whatever the antecedent denotes. If the required antecedent cannot be found, its processing fails (crash). (82) regrets (X:singular(X):(clause;focus):crash, C::(clause;focus):crash, T:present(T):focus:accommodate, fact(C)::clause;common ground:accommodate) Regret
The verbal form ‘regrets’ cues a concept of a state in which its subject (possibly ellipsed) is saddened at temporal location T by a fact denoted by
interpretation
113
a (possibly ellipsed) object that is either given or accommodated. Regret must also open a context for the interpretation of its object. The logical operator fact(X) is interpreted in Chapter 5 Mental Representation, as a predicate that requires its argument X to be part of the context to which fact(X) belongs. (83) Mary (X:(X,“Mary”)Called:commonground:crash)Id
This treatment is based on the semantics of names provided by Geurts (1999) but strengthened by disallowing accommodation, which seems necessary for short and frequently used names. Accommodation is only an intended effect, when the name is long or accompanied by an apposition as in John, a friend of mine or Bill Smith, the mayor. The presupposition is character, not content. Id picks up the referent of its presupposition and makes it its referent. This combination is required for dealing with cases such as John believes that Mary is ill, which may be true even if John knows Mary only as his friend’s girlfriend, but does not know her name. (84) left (X:singular(X):(clause;focus):crash, T:past(T):(clause;focus):accommodate) Leave
The verbal form left points to a concept that needs a subject and a time. The effect under the algorithm of the utterance He regrets Mary left should be the five representations in (85), assuming that he gets resolved to a representation John, the time of leaving to this morning, and Mary left is accommodated. The semantic effect is the combination of the five updates with the resolved concepts. (85) a. b. c. d. e.
(John)Id (Mary)Id ((Mary)Id ((Today)Morning)Id)Leave fact(((Mary)Id ((Today)Morning)Id)Leave) ((John)Id ((Mary)Id,((Today)Morning)Id)Leave), fact([((Mary)Id,(Today)Morning)Id)Leave))]) Now) Regret
The main task of the algorithm is to select linked concepts quickly while keeping the number of alternative interpretations low. Its main tool in achieving both of these goals is a calculus of activation, where activation is used in its normal psychological sense, as the degree of prominence in
114
chapter four
the brain at a given moment of time. Very low activated interpretations will not be considered unless there are no better candidates and very highly activated interpretations will monopolise the activation at the expense of their lower activated competitors. This can be captured by two thresholds, High and Low. If High is reached, the activation of an interpretation becomes maximal and competing candidates get activation zero. If Low is reached, the activation becomes zero and the available activation is distributed over the competing candidates that remain. In this way, decisions can be forced and if needed, the size of the set of candidates for selection will be limited by pushing Low upwards until the set size limit is reached. To describe the process from a slightly different angle: a word cues a set of concepts with as yet unidentified presuppositions. These concepts get activated based on two factors, namely the strength with which a word cues a particular concept, which is determined by the frequency with which the word is used for the concept, and its prior activation, determined by its frequency in recent use. The various concepts cued by a word compete with each other, i.e., they share a full activation of 1 so that their activation is modified if the activation of any of the competitors changes. For each of the concepts with activation above Low, links to earlier matching concepts are constructed. These links have their own activation, which is a product of the quality of the match, its syntactic quality, its prior probability in the context, and the activation of the antecedent. Disjunctive specifications of location and method are accounted for in the matching value: links which meet the default specification always have a strong advantage over links allowed in a lower position in the disjunctive specification. This can be accomplished, for example, by halving the possible matching values each time a disjunction is crossed. Presuppositions with a location clause or sentence may have to wait for their binders. This can be accommodated by constructing links to the future and assigning them a standard activation to keep them in the race (earlier antecedents should not have an advantage but production simulation can still eliminate links to the future or decrease their activation). Such links to the future can be checked already by simulated production. It must therefore also be attempted to match new linked concepts with extant links to the future. Success in this respect increases the activation of the new concepts. To summarise: the activation of a fully linked concept in a context is the product of the concept activation, the activations of its links, its probability in the context, its simulation value, and its target value. The latter mea-
interpretation
115
sures how much the concept is used as a link target of other concepts in terms of the number of times this happens and the activation of source concepts. A linked concept competes with other linked concepts for the same word. The best way to think of management of activation is to see it as a physical process, somewhat akin to the circuits inside a classical radio or, indeed, to neural nets. New events, like finding a new link or reaching a threshold value, lead to changes of activation all over the place and can bring about new events that lead to new adaptations. Mathematically, each entity’s activation is a function of activation of other entities (and some external values) and the process can be modelled as a recursive update of a spreadsheet where formulas are recomputed until the spreadsheet stabilises. This is a sketch of the algorithm: 1. Read a new word and find the concepts it may express. 2. Determine the activation of each of the concepts and discard those with activation below Low. Select concepts activated above High. If concepts require it, open auxiliary contexts. 3. For each remaining concept and each of its presuppositions construct links and determine their activation. Discard links with activation below Low, select links with activation above High. Recompute activation until no more eliminations occur. 4. Determine the activation of the linked concepts in contexts resulting from 3. Discard linked concepts with activation below Low, select those with activation above 1. Iterate. 5. If the number of hypotheses goes over the limit, eliminate the needed number of hypotheses with the lowest activation. Recalculate activation, eliminate, and select based on the new values. 6. At end of input, repeat 5 until all remaining hypotheses have full activation and close any auxiliary contexts. Formulas for activation are given below. Concept Activation When a new cue is encountered, the set of concepts with which it is associated is entered into the algorithm and the activation of the new concepts before combination is determined. This can be defined as the product of the prior activation of the concept, due to earlier use in the context and its frequency.
116
chapter four
The product has to be normalised so that the sum of the activation of competing concepts for the same word gives 1. This normalisation is invoked by reading a word that has a set of concepts as its possible meaning. One needs to know how often a particular word is used to express a concept (freqc,w ) as well as how often and how recently the concept occurred in the preceding context. This gives a measure prioractc , which is given by a recency-weighted count of prior occurrences in the context. Jointly, this produces the initial assignment of activation of concepts, act(c), as specified in (86). (86) act(c) =
freqc,w prioractc Σc’ is a concept for w freqc’,w prioractc’
Link Activation The activation of a link is the product of three factors: the activation of its target, the match of the target with the specification in the source concept and the syntactic quality of the utterance if the link is assumed. It again needs to be normalised to 1 with respect to the competing links for the same source. This requires a function match(p, φ), which measures how well the linked concept φ meets the specification given by presupposition declaration p in the concept c. It must evaluate the location and the character but also possible disjunctions, as indicated below. It also requires syntactic evaluation: If a link is assumed, would the target be where it is and have the morphology it has? This is the function syn(p, φ, u), with p being the presupposition declaration, φ the target, and u the utterance. This results in the following formulas for the activation of a link from p in c to a linked concept φ. act(c, p, φ) = Σ act(φ’)match(p,φ’)syn(p,φ’,u) φ’ syn(p, φ, u) = p(u|p = φ) act(φ)match(p,φ)syn(p,φ,u)
Linked Concept in Context Activation The activation of a linked concept in a context is the product of 5 factors. The first three are the ones discussed above: the activation of the concept and of the links of the linked concept and how well it meets syntax. The new factors are the value of the linked concept as a target for links from other concepts and the prior probability of the concept in the context. The total activation obtained again needs to be normalised to 1 with respect to the competing linked concepts for the same word. The probability of a particular content in the context will be estimated by prior(φ, c).
interpretation
117
prior(φ, context) = p(φ|context) target(φ, context) = 1 + Σψ∈context,ψ links to φ act(ψ) syn(φ, u) = p(u|φ)
If a linked concept φ = (ψ1 : p1 …ψn : pn )c, then the activation of φ can be set as follows: act(φ, context) =
f(φ, context) Σφ’,context’ f(φ’, context’)
f(φ, context) = act(c)
Σi act(c,pi ,ψi ) target(φ, context)prior(φ, context)syn(φ, u) n
This completes the specification of activation. In Chapter 3 Self-Monitoring, attention was drawn to the importance of cues such as the following: (87) If x is a definite human NP, its chances of being a subject are high. If it is an inanimate indefinite NP, it is very likely an object.
These cues could be incorporated in the form of matching values for subjects and objects on a par with other selection restrictions which are to be incorporated in matching. It is not clear, however, whether this does not just follow from an appropriate treatment of selection restrictions in matching. In such a treatment, the subject slot would prefer definite human NPs, while the object slot would prefer inanimate indefinites and assign lower matching values to inanimate indefinite subjects and animate definite objects. Actual implementation experience will have to decide on the best option. Matching
match(c, p, φ) = sel(φ, c, p) ∗ place ∗ method place = 1 if φ meets the first place choice, otherwise 0.5. method = 1 if φ is the default method, otherwise 0.5.
If selection restrictions give a probability distribution for a presupposition as the probability of a binder with properties P1 , …Pn , the selection value can be determined by evaluating the distribution on properties of the link target. The value for a full set of links is the product of values for each of the links. A worked example is in order here, but such examples quickly become rather large. Moreover one has to make up the numbers. The following is a mere illustration. (88) Tom fell.
(88) starts with (89). (89) tries to pick up an object in the common ground that is called Tom.
118
chapter four
(89) Tom (X:(X,Tom)Called:cg:crash)Id
On encountering the word Tom, links are now created to all common ground representations that are called Tom with an activation proportional to their recency and frequency. The verb is ambiguous between the concept of falling over from a standing position to a lying position and a concept in which the subject falls down to a low place from a high place (and many other concepts). The difference is coded in the presuppositions associated with the subject and by the two concepts being formally different. (90) fall (X:(X)Upright:clause:crash)Fallover; (X:[(X S)At (S)High]:clause:crash S:[(X S)At,(S)High)]:clause;cg:acc)Falldown
Links from the first argument position to all available representations of objects in a high place or standing upright are now constructed with activations given by the target. The linked representations evoked by Tom will be among these. There are also links from the second position of the second concept to representations of high places. The initial ordering of the link activations is by the activation of the link target. The matching value evaluates the presupposition associated with the argument place. The syntax value brings in simulated production. A good match would be of a representation of an object who is standing upright or a concept of an object who is in a high place. Especially good in the last case if the second argument links to that high place, less well if the place has to be accommodated. Simulated production forces the first arguments of the verbal concepts to be the representation which is identity linked to Tom. This combination of factors leads to many linked concepts with different activation levels depending on the contextually given Tom-representations and one will win. The hard rule about English subjects eliminates many candidates. Knowledge that Tom is sitting in a tree will eliminate others and prefer resolutions of the source of Falldown to the tree. Contexts that only have Tom lying on the floor in a low place are inappropriate for the utterance and do not lead to an interpretation. Monitoring speakers will not produce the utterance in such contexts. This completes the example. An important question is whether the data for the algorithm can indeed be obtained. In case of selection restrictions, this is quite clearly the case.
interpretation
119
Word sense disambiguation (Schütze, 1998; Schulte im Walde, 2009) can divide corpus occurrences of word uses according to the word sense (concept) expressed in the occurrence. To this purpose, one needs not just the different concepts a word can express but also specific occurrences of the word expressing that concept in the corpus. This can serve as the basis of frequency estimations and—with some additional work—also of selection restrictions. The prior probability of a word expressing a particular linked concept can be estimated using a number of techniques. These include trying to find predictions from concepts in the pre-text to the current concept (or even to partially linked concepts), that is, estimating the probability of a concept given concepts in the pre-text. For causal data, one could train preconstructed Bayesian nets obtained from the first kind of data by ordering events in the order of cause and effect. In the particular case of predicting a particular speaker in a conversation, one might even use hand-crafted models. Obtaining sets of concepts for words would seem to be the hardest task since it requires cooperation between word sense disambiguation techniques and the emerging theories of lexical semantics. The promise of Chapter 2 Syntax is that production optimality theory offers a good way of testing the probability that the initial segment of an utterance is correct for a particular hypothesis. That is the case but one could clearly want more than a yes or no on an initial fragment. It would be useful to also have an estimate of probability that a link to the future will succeed and a statistical take on the probability of different possible word orders. It would be possible to extend the methods of Chapter 2 Syntax in that direction but it lies beyond the scope of this book. This section presented an outline of an algorithm that would compute interpretations in linear time relying on a number of data that should be empirically obtained.6 It is quite clear, however, that human interpreters have all of these data at their disposal and could use them in emulating the algorithm to quickly and reliably arrive at interpretations of natural language utterances.
6 First implementations of the algorithm by my student Jonathan Malinson are promising however, even though they do not add up to a full proof of concept yet.
120
chapter four 4.2. Vision and Pragmatics 4.2.1. Vision
The algorithm outlined in the previous section can be adapted for higher level vision. As in language, one has a lexicon of concepts in vision, but these are now invoked by visual cues and instead of simulated utterance production, one simulates the causal effect a hypothesis has on the visual signal. This is normally compared with a camera: a mental camera that turns the hypothesis into a visual signal. More precisely, it gives a probability distribution over signals, which assigns a probability to a given signal under a particular hypothesis. The techniques for this simulation have been investigated in depth in computer vision (and it is a largely solved problem (Forsyth and Ponce, 2009), in this respect comparable to syntactic realisation). In vision, one can use the same estimates of prior probability of a hypothesis as in language and a related format for the concepts, though one may well want to have more concepts.7 The format for presupposition specification, on the other hand, is different from the one used in natural language. If one assumes a horse at a certain position in a scene, that gives directions as to where the head, the legs, the tail, and the body should be found and recognition of any one element determines where the others are likely to be found. Geography is therefore of key importance. Anaphora seems completely absent, though reidentifications are central. Missing essential parts should lead to accommodation if they seem to be hidden from view and to a crash if they are predicted to be visible. The necessary arguments are provided by the requisite visual parts of the perceived object but one would still want to include as many optional arguments as possible in order to obtain a maximally informative and integrated best hypothesis. As noted above, anaphora is not clearly a part of vision but attempts to see a current scene as containing known objects are essential and should be the standard procedure: one should try to see a new visual representation
7 One could have far more colours, shapes, or spatial relations in vision than one has in natural languages. In language, concepts are only lexicalised when the lexicalisation increases the probability of needed coordination, in other words, without expressive pressure, there will be no lexical expressions for many possible concepts. In vision, however, these limitations disappear and the set of concepts can be as fine-grained as visual distinguishability permits.
interpretation
121
as denoting objects already given in earlier perception. Any representation should cue a search for earlier material that is identical, as specified in (91). (91) (Y:Y=X:memory:skip)X
If the simulation can be done in linear time, the algorithm is also linear if only a fixed finite number of hypotheses will be considered at any one time. 4.2.2. Other Cues As a matter of fact, the way the algorithm was described above—that is, as a series of selections of linked concepts—describes vision much better than natural language interpretation. Mere acceptance of the linked concepts that are selected in natural language interpretation would be tantamount to always seeing the interlocutor as a reliable source of information who keeps sending words that give rise to true linked concepts. This is not correct, though, because a sequence of words can be a question, an assertion, an order, an impression of what somebody else thinks, and many other things. Moreover, information imparted by an interlocutor can be unreliable and it is up to the interpreter to decide what to do with it. She does not have to automatically accept it. A speaker can mark the information she relays as unreliable and she can and should indicate the different roles of the shared information using markers of questions, imperatives, or attributions to other people. Often however, there are no overt markers for phenomena such as these. All in all, received verbal material should not be entered as such but as part of a larger representation, the representation of the speaker intention. The hearer may decide to believe the offered information and incorporate it into her information, but this action is not automatic. That is why it is useful to take into account the non-verbal signals that accompany the production of an utterance. Non-verbal signals are signals that are not words but are closely related to the production of verbal input. For example, production of verbal material by a speaker after another speaker is a signal that the new speaker is taking a turn in the conversation. If the turn consists of several sentences, it is a sign that the speaker is organising his turn into separate utterances. If the utterance is complex, it is a sign that that it consists of clauses. And clauses have their own internal complexity, which may indicate that the speaker is at some point referring back to an object or an earlier event. Non-verbal signals of this kind form a clear hierarchy. At the top, there is a turn in conversation, where the turn corresponds to a complex speech
122
chapter four
act. Turns contain utterances, utterances consist of clauses, and clauses of nominal, verbal, adjectival, and adverbial parts. For our purposes, of particular interest are the two higher levels because they introduce concepts of utterances and concepts of speech acts. Each of the above-mentioned levels comes with its own classification. Turns can be questions, answers, assertions, promises, proposals, requests, acceptances, rejections, jokes, stories, etc. If there is more than one utterance in a turn, these utterances are connected by a rhetorical relation. A later utterance can be a restatement of the first one, an explanation, justification, background, or elaboration of it. It can also perform a joint task with an earlier utterance it relates to, as in a narration, a list, a contrastive or a concessive pair. And finally, it can draw conclusions from earlier utterances or sum them up. Such classifications have been described in enterprises such as rhetorical structure theory (Taboada and Mann, 2006), Structured Discourse Representation Theory (Asher and Lascarides, 2003), and discourse grammar (Scha and Polanyi, 1988). A recent overview is (Zeevat, 2011). The lower levels introduce quantification and definite reference, which—unlike turns and utterances—is triggered by specific words and constructions. And as it turns out, classifications of the higher levels can be captured directly as if they were words. One can treat them as a set of concepts with a presuppositional prefix. As such, they can be entered directly into the algorithm. In all these cases, the concepts involved need context arguments in the form of sequences of linked concepts. In many cases, a context can be constructed by inserting updates from a whole turn or a whole utterance into the relevant context. To use a computer science metaphor, this is like a temporary reassignment of an output stream of linked concepts. The open turn and utterance concepts receive the context under construction as their argument and the prior probability of the concept applying to the context under construction then contributes to the concept’s activation. In the case of negation, modality, and attitudes, there is a competition between two contexts, namely the embedding context and the scope of the operator. Here, incoming words need to choose a context. In the case of quantification, the choice involves three contexts: the context in which the quantification occurs, the restrictor context, and the scope context. The proposal is to see a turn in conversation and utterances in a turn—in particular their start and end—as separate cues, which introduce concepts that require contexts as arguments. The verbal material is then entered in the context argument. But sometimes, one can—and in some cases must—
interpretation
123
go further. For example, interjections8 are separate utterances bringing their own intentions. From here, it is but a small step to start thinking of other constituents as utterances in their own right (e.g., free participial clauses, non-restrictive modifiers, and epitheta). This will not be discussed further here, though it offers an alternative way of entering material in other contexts than Potts (2003). 4.2.3. Pragmatics In the traditional ‘pipeline’ view, pragmatics is applied to the literal meaning obtained from syntactic parsing by semantic rules. This is the view of Grice (1975), which is the starting point for a large body of formal work on pragmatic interpretation. In the process defined by the algorithm, however, there is no separation between semantic and pragmatic meaning: what emerges from the incremental process is the full meaning, that is, the conversational contribution of the turns and utterances. So far, various pragmatic effects were incorporated into the theory presented in this chapter. Pronominal resolution, resolution and accommodation of presuppositions, reference by names, reference by deixis, and bridging are all part of the central mechanism assumed here, in which elements from the presuppositional prefix of concepts get linked to given linked concepts. The specification mechanism for concepts is powerful and can be extended to capture other pragmatic effects. For example, it would be almost trivial to use it to develop a theory of domain restriction by letting common nouns be ambiguous between a kind and a pronoun that selects a given set of instances of the kind in the context, with a preference for the second option. It could also easily be used to capture stereotypicality effects by a mechanism that picks standard values for optional manner arguments, so that when the manner is unspecified, it is receives a standard value, given the other arguments.9 These phenomena all involve a general preference for resolving to the context over new material. Another important aspect of the algorithm is that it maximises probabilities, in particular in relation to selecting a particular concept for a word
8 Interjections are mid-sentence occurrences of other sentences or sentence fragments. They are one of the targets of Potts (2003)’s account of conventional implicature. 9 In some settings, for example, a standard drink is understood to be alcoholic (at a party or in a pub), while in other cases, it would be some sort of soft drink (e.g. at a child’s birthday party). This would require a context dependent slot for a standard kind.
124
chapter four
from a finite set of possibilities, which is a process guided by resolvability and probability. This again is a powerful mechanism for pragmatics. The maximisation of probabilities is quite naturally evoked for those parts of the English language which are particularly ambiguous: noun-noun compounds, prepositions, participial clauses, and genitives. Research in these areas has established that it is possible to think of these phenomena as essentially specifying a long list of possible meanings from which one meaning has to be selected by a combination of probability and contextual fit. The algorithm can do that given that cues for the lists are provided. A noun-noun compound must be taken as a cue for an additional word, nounnoun, which is then treated like any other word and provides a link between the two nouns. Prepositions are treated as words, participial morphology can also be made into a separate word, and the same holds for genitive morphology. The approach outlined above naturally extends to the problem of recognising the rhetorical relation of a turn to previous conversation and to identifying the mutual relations of utterances in a multi-utterance turn. The idea is to use turns or the utterances themselves as a cue for a rhetorical relation. A discourse relation is then the outcome of selecting a single linked concept from a list of options. This latter kind of concepts is then responsible for speech act recognition, speaker intention recognition, and implicatures. The following sections are a first attempt to characterise some additional signals that are invariably realised in verbal communication as another ambiguous cue for concepts. Turn Start This is an indication that the speaker will make a contribution to an ongoing dialogue. It can be incorporated into the model by treating it as a single word that should be added to the input stream, dialogue_move, before the words that make up the turn. Just like most other words, this ‘word’, too, is ambiguous and a concept needs to be chosen and resolved. The prediction of the most likely concept is strongly dependent on the context of the dialogue. For example, if someone has just asked a question Who will be there?, the most likely dialogue move would be an answer to that question. A somewhat less likely move is a justified rejection of the request to answer it and even less probable is an attempt to change the topic. Or if somebody just explained how badly some students had behaved, one primarily expects acknowledgments, further questions about the incidents, further examples of bad student behaviour, and attempts at comforting the speaker. Perhaps less likely are attempts at
interpretation
125
defending the students, exhortations to the speaker to be less impressed by such student behaviour, and finally, attempts to change the topic. The notions involved can be described on a general level, which is then further refined by a default order on a particular occasion. Question, Assertion, and Turn are abbreviations for full presuppositional specifications that fit these entities in this context. Question should be the question that is currently in focus. Turn points to the current turn, i.e., the context that currently takes the input of the speaker and temporarily replaces the hearer’s information state. Assertion takes the assertion that is currently in focus. And finally, Topic finds the topic question of the last turn, that is, it should find the last turn and extract its topic. Where the specification does not find the indicated antecedent (where there is no focused question, no assertion in focus, no last turn), it should crash. These concepts correspond to a set of dialogue relations, some of which are characterised in (92). (92) turn answers a given question turn raises a question turn acknowledges an assertion (by the interlocutor) turn elaborates on an assertion turn corrects an assertion turn makes a request to the interlocutor turn solves a problem of the interlocutor
These relations can be schematically stated in our formalism as in (93). In all cases, the turn in question is a turn just made and it is understood as the outcome of interpretation of the verbal input that constituted the turn. The beginning of a turn should specify a turn concept and open a new context into which the turn is projected. (93) (Question Turn) Answer (Turn Topic) Question (Assertion Turn) Agree (Assertion Turn) Elaborate (Assertion Turn) Correction (Turn) Request (Problem Turn) Solve
The simple mechanism of previous activation by a hearer turn is responsible for a local reordering of the turn concepts. Other mechanisms involved are resolvability and probability: there are antecedents that should be present and the content of a turn must be such that it meets the concept. These mechanisms are exactly the same as for the lexical cues discussed above and the algorithm can deal with them in exactly the same way.
126
chapter four
Turn End The end of a turn, often marked by intonation, is a cue for the closure of open contexts, in particular of the context set up by the turn, and a cue for completing the selection of linked concepts that were set up by the turn. Utterance Start The first utterance of a turn starts the turn and instantiates the turn concept. Later utterances within a turn are connected to earlier ones by discourse relations. To classify these relations, one could take as a starting point Mann and Thompson (1988); Hobbs (1985); Scha and Polanyi (1988) or Asher and Lascarides (2003). Here, though, the shorter list of Jasinskaja and Zeevat (2010) is used. Discourse relations such as restatement, elaboration, explanation, background, justification, list, narration, contrast, concession, conclusion, and summary can be captured by concepts, which too are cued by the start of an utterance. Utterance should pick up the current utterance and Pivot an earlier and accessible item in the turn.10 (94) concept (Pivot Utterance) = (Pivot Utterance) Subevent (Utterance Pivot) Cause (Utterance Pivot) While (Utterance (Speaker Pivot) Know) Cause ((Pivot Utterance) List ((Pivot Utterance) Narration ((Pivot Utterance) Contrast (Pivot Utterance) Concession (Turn Utterance) Conclusion (Turn Utterance) Summary
gloss restatement, “=” is the identity relation elaboration on the pivot explanation background justification list narration contrast concession conclusion summary
Some more complicated discourse relations are List, Narration, Concession, Contrast, Summary, and Conclusion. So (Pivot Utterance) List can be seen as an abbreviation for a schema where Minus is that part of Topic which Pivot did not deal with. (95) (Pivot Topic) Address (Pivot Topic) Minus (Utterance (Pivot Topic) Minus) Address 10 This is subject to the right frontier constraint: earlier utterances are not accessible if there is later utterance that already takes them as pivot for the coordinating relations List, Narration, Contrast and Concession.
interpretation
127
Narration expresses that an utterance continues a Story,11 to which Pivot was the last contribution. This can be captured as in (96). After creates the rest of the story, the one that starts after the Pivot. (96) (Pivot Story) Address (Pivot Story) After (Utterance (Pivot Story) After) Address
Contrast expresses that the topic is a double wh-question Topic, to which the two related utterances (Pivot and Utterance) supply doubly distinct answers. This gives (97), where the Address concept can be glossed as X answers the question Y by answers A1 …An . For example, John had pasta addresses Who ate what? by John and pasta. (97) (Pivot Topic Answer1 Answer2) Address (Utterance Topic Answer3 Answer4) Address (Answer1 Answer3) Distinct (Answer2 Answer4) Distinct
Concession expresses that the Pivot concedes part of an Issue, while the Utterance denies the rest. (98) (Issue Pivot) Entail (Utterance (Issue) Not) Entail
Summary and Conclusion should take the current context Turn and demand that it supports the Utterance. This can be done as in (99a–b). (99) a. (Turn Utterance) Argument b. (Turn Utterance) Entail
All of these set up a local context for an utterance, a context which temporarily replaces the turn context. Utterance End The end of an utterance is a cue that local auxiliary contexts opened during the utterance need to be closed and a decision needs to be taken. The last function could perhaps also be assigned to the end of a phonological phrase.
11 Stories are like topics but differ from topics by not being questions. They describe complex events with an internal temporal ordering.
128
chapter four
Interjection Start The start of an interjection is similar to the start of an utterance. One needs to link the interjected material to the interrupted utterance and another occurrence of “rhetorical_relation” seems the correct way to do that. Interjection End Closure of the interjection context. Clause Start Much the same holds for clauses as for utterances even though in clauses, one often finds a marking of the rhetorical relation in which they are with respect to the matrix. One way of making the specification easier is to assume that clauses—and perhaps even adjectival phrases and other modifiers—set up a context of their own and that that context is closed at the end of the clause or the phrase. Clause End This should be taken as the point where contexts started by operators in the clause are closed. For these lower levels, syntactic considerations are closely tied to the recognition of non-verbal signals and often can be derived from lexical and constructional cues. Problematic are cases where this is not so. In such cases, the algorithm will come up with a nucleus for the utterance that will be integrated into the utterance concept and create independent satellites, which will not be integrated into the utterance concept or the nucleus. Such a situation should be taken as a cue that the satellite is a clause, which evokes a set of discourse relation concepts that link the nucleus and the clause. A parallel with vision is useful here: one would like to prevent a situation where there are two unconnected conceptualisations of parts of a visual scene. Particles and Conjunctions In some work on particles (Zeevat, 2009a) it is claimed that ideally, a particle should be semantically characterised as being an assertion of its own. In this approach, for example John had only three apples comes out as a combination of two assertions as in (100). (100) John had three apples. It is surprising that he had so few.
interpretation
129
The two statements are not independent. Interpretation of the latter assertion requires that the first sentence should state how many apples John had, while the second sentence refers to the number of apples John had and presupposes a conflicting expectation that John had more. The influence of the second assertion on the first can be expressed as (Question Utterance) Answer with an additional requirement that the Question should asks for a quantity and that Utterance be identified with the first assertion. Under an appropriate resolution, the quantity is then identified with three and the question with How many apples did John have? The projection of the second statement originates in the particle, whose contribution can be stated as (101). It weakly presupposes a representation A that specifies a quantity in answer to the question that exceeds the quantity at which the current utterance puts it. Trivialisation of the weak presupposition trivialises the predication but continues to presuppose that the utterance exhaustively answers a quantity question: the semantic contribution of only that is assumed by e.g. Rooth (1992). (101) (A (Question A) Answer (Question Utterance) Answer) Exceed
This treatment extends to other particles and conjunctions. If and is indeed an additive particle as Jasinskaja and Zeevat (2010) claim, its semantic contribution is rather like a discourse relation. It presupposes a question and an answer to it that is distinct from the utterance and states that the utterance answers it. (102) (Utterance (Otheranswer Question) Address (Otheranswer Utterance)Distinct Question) Address
In John had pizza and Mary had spaghetti, Utterance is Mary had spaghetti, Otheranswer is John had pizza, and Question is forced to be Who ate what?. In Because Mary wanted it, the conjunction because introduces a concept (103) which forces resolutions and turns the utterance into an explanation. (103) (Utterance Pivot) Cause
These remarks, though much too short, should establish that particles can introduce one of the concepts invoked by turn and utterance starts and force a selection of that concept. Intonation Despite the fact that over the past twenty years, intonation has been studied intensively, no standard proposal has emerged as yet. Even so, it is safe to
130
chapter four
say that there are at least four intonational categories that need to be taken into account. These is the rising or falling tone at the end of a sentence and intonational prominence of one or more constituents, with a special role for the twiddle (a combination of rising, falling, and rising again).12 Rising tone is characteristic of a speaker who leaves the decision up to the hearer, for example in questions or assertions with respect to which the speaker is uncertain. This can be described as a speaker asking a question or asking for confirmation or denial of the assertion or proposal she is making. It opens the way to Turn concepts, which decide about the function of the turn or the utterance: it can be a low confidence statement, a proper yes-no question, or an assertion with a request for confirmation. (104) (Speaker Utterance)Belief (intonational down-toning) (Utterance)Yesnoquestion (Utterance)Confirmationquery (intonational tag)
For intonational prominence, Rooth (1992) proposes two interpretations: contrast and focus. The first can be seen as a presupposition trigger, whereby the prominent phrase has a parallel element from which it needs to be distinguished. In (105), on the other hand, XICAN should function as another nationality concept, distinct from American, explaining the intonation pattern in: An AMERICAN farmer met a CANADIAN farmer. (105) AMERICAN (XICAN::focus:skip, (XICAN, American) Distinct) American
The other possibility discussed by Rooth makes intonational prominence a marker of focus. This can be straightforwardly captured as in (106), where Question and Focus must be constrained to resolve to the current utterance. (106) focus (Question Focus) Answer
It is important to integrate the intonational cue smoothly with the verbal cues. In (106) this can be arranged by specifying Question in such a way that it has to resolve to the non-focused material in the sentence. This way of integrating intonational meaning is much simpler than Rooth’s proposal. By combining this treatment of intonationally marked focus with the treatment sketched for only, one gets three cues that have to be integrated, namely the utterance cue that allows the utterance to answer a quantity
12
Pierrehumbert and Hirschberg (1990).
interpretation
131
question, only which presupposes an expectation of a higher value to the question, and the intonation which selects the same utterance concept. Facial Expression It may well be possible to classify facial expressions in such a way that they would form a framework of new conceptual descriptions of ongoing utterances and turns. Facial expressions of surprise, anger, or other emotions can be expressed by concepts as in (107). (107) (Content Speaker) Surprise (Content Speaker) Anger
In this case, Content needs to be resolved to the content of the utterance. Deictic and Other Gestures Gestures are still understudied. Especially important (also on anthropological grounds) are gestures such as pointing with a finger or gaze fixation, iconic gestures, and size indications. They would fit in in much the same way as, say, intonation but crucially also as material that has to be integrated in the stream of verbal material as in example (108), where a height-indicating gesture is synchronised with the adjective tall. (108) He is tall. (Gesture) Height ((He) Height (Gesture) Height) =
4.2.4. Clark Buys Some Nails Integration with vision and gestures is illustrated in the next example, simplified from Clark (1996). (109) presents the scenario. (109) Clark buys some nails Enters Selects nails and puts them on counter. “Hello there. These please.” “Hi. 1.25.” Puts $ 1.25 on counter. “Thank you. Bye.” “Bye Bye.” Leaves shop.
The example illustrates an important point of Clark. Verbal exchange is only a small part of the action in the normal transactions in which verbal
132
chapter four
utterances play a role. As this example shows, limiting oneself to the verbal exchange does not lead to an intelligible analysis of what is going on. The example illustrates how linguistic and visual cues can be combined with conceptual reasoning to give an intelligible picture of an everyday event, something that is out of reach with an isolated linguistic module. The proposed representation format, however, can deal with integrating verbal and non-verbal events. Apart from the common sense concepts, there a number of more technical concepts: Joint for contexts that are common ground, (X Y)Reason for an attributed motive X behind an action Y and the propositional attitude Want. (110) traces the development of the shop attendant’s thoughts through the exchange. The reasoning involves a schema of Buy that allows the inference: to (Buyer Seller Thing)Buy if the context meets the form specified in (110). (110) ([Thing, Buyer, Seller, (Buyer, (Buyer, Thing) Possess) Want, (Seller, Thing) Possess (Thing, Price) PriceOf (Buyer, Price, Seller) Pay (Buyer, Thing) Obtain]) Joint
The reasoning scheme can be seen as a cue axiom for the Buy concept, but one that gives the necessary and sufficient conditions. (111) Man (Self Man) See (Man Shop) In (Man Shop) Enter (Self [(Man Shop) Enter)] See ([(Man,Thing, [(Man Thing) Buy]) Want], [(Man Shop) Enter]) Reason ([(Man, Thing, [(Man Thing) Buy]) Want] Must ([(Man, Thing, [(Man Thing) Buy]) Want] Joint (Man “Hello there. These please”) Utter (Man Self) Greet ([(Man Self) Greet]) Joint Nails ((Man, Nails) Indicate) Joint Nails = Thing ([((Man Nails) Buy (Man “These please”)Utter) Request]) Joint (Nails, $ 1.25) PriceOf (Self “1.25”) Utter ([(Nails, $ 1.25) PriceOf]) Joint (Man, $ 1.25, Counter) Put
interpretation
133
([(Man, $ 1.25, Counter) Put]) Joint (Man Nails) Buy ([(Man Nails) Buy]) Joint (Self Man) Thank (Self Man, “Thanks”) Utter ([(Self, Man) Thank]) Joint (Self Man) Greet (Self Man “Bye Bye”) Utter ([(Self, Man) Greet]) Joint (Self Man) Greet (Man “Bye bye”) Utter (Man Shop) Leave
From the perspective of Clark or an observer, the results would not be very different. 4.2.5. Scalar Implicatures Links to the future can be compared with activated and exhaustively interpreted wh-questions. They are both waiting for a binder: in one case, it is the expected argument, in the other case, the answer. Formally, there is no difference between the two apart from the issue of where the link to the future is satisfied (either in the clause or in a new utterance in indefinite future) and concerning a related issue of what to do if the required link cannot be found. A missing post-verbal obligatory object leads to a crash at the end of the clause, while an activated wh-question can wait indefinitely. If a link to the future is bound, an identity is created between a local referent and the link target, and that is enough for obtaining the exhaustivity effect, if indeed the question is asking for an exhaustive answer. A missing part of the solution is where the wh-questions come from. One source of wh-questions are questions under discussion in a dialogue. Questions, however, also often arise from natural processes that are part of verbal communication. One such source are missing optional arguments which are nonetheless semantically filled by the concept itself and obtain a standard value. They are evoked by a natural question the interpreter has to ask in the course of interpretation: What binds this role? For example: What caused John’s anger? Where did his car break down? Another source of wh-questions are the questions arising in high-level production (and in the interpreter’s reconstruction of that process). In example (112), in order for B to overanswer, B must change the overt question to What animals does John have on his farm?, which leads to scalar implicatures such as John having no cows on his farm. But that question is again
134
chapter four
overanswered by changing it to a question of how many sheep and geese he has which gives rise to further scalar implicatures. (112) A: B: a. b. c. d.
How is John’s farm? He has three sheep and a couple of geese. Does John have animals on his farm? Which animals does he have on his farm? How many sheep does he have on his farm? How many geese does he have on his farm?
These questions arise because of the formulations chosen and the fact that they arose can be confirmed by the scalar implicatures that result. 4.2.6. Relevance Implicatures Relevance implicatures are bridging inferences based on the assumption that the current contribution achieves some public goal. In this sense, quantity implicatures as defined just now are also relevance implicatures, because they presuppose that the utterance exhaustively answers some public wh-question. Other relevance questions come from various sources. One source are explicit wishes and goals, which can be seen as links to the future awaiting plans to fulfil the wish or reach the goal. But goals and wishes can also be implicit, as in Grice’s example below, and exploit people’s standard reactions to certain situations such as cars breaking down, nice food and drink being available, or danger arising. Grice’s example is presented in (113). (113) Driver next to his car to a local resident: Driver: I am out of petrol. Local: There is a garage around the corner.
The driver’s utterance is interpreted as (114). ((X)Not is the negation operator of Chapter 5 Mental Representation.) (114) (Driver ([(Driver Car)Drive])Not ([(Car Petrol)Have])Not) Explain
This is reinterpreted as the driver’s problem (world knowledge) as in (115). (115) (Driver ([(Car Petrol)Have])Not) Problem
The local’s utterance is interpreted as offering a solution to the problem as in (116). (116) ((Garage Corner)Around, ([(Car Petrol)Have])Not)Solve
From this, the driver and the audience can infer (117).
interpretation
135
(117) ([(Driver Garage Petrol)Obtain])Able
(117) entails that the garage sells petrol and is open, or at least this is the most likely option. The solution uses the driver’s implicit wish (a problem is a state one wishes to stop obtaining) to put plans for the fulfilment of the wish on the table, which in turn makes it the case that the local’s contribution is interpreted as part of a workable plan for the fulfilment of the wish, using knowledge about garages. The relevance implicature comes out as a condition under which information provided by the local is a solution to the problem under a further supposition that the local is assumed to know whether (117). 4.3. Conclusion This chapter presented a linear and incremental algorithm for natural language interpretation that is based on conceptual knowledge, world knowledge, knowledge about communication, and on simulated production. Both the algorithm and the format of the conceptual knowledge it uses are closely related to categorial grammar and stochastic parsing formalisms. Unlike in categorial grammar, however, what is combined here are not the words but the concepts expressed by them and the reductions (link constructions) are not limited to the current utterance. Rather, the algorithm allows link constructions to contextually given concepts and concepts cued by turns, utterances, interjections, intonation, gestures, and facial expressions. Another point where this proposal differs from categorial grammar is that link construction is not automatically limited to the functional application of meanings. That makes it different from the otherwise rather similar semantic categorial grammar which Frege and Husserl developed in their correspondence (Casadio, 1988), where Fregean meanings apply to each other. In the current proposal, link construction merely integrates concepts with their context. Link constructions are not restricted by adjacency (or its combinatorial extensions): they are in principle free. In the formulation given in this chapter, the only hard restrictions are locational restrictions on pronouns and type matching. That, however, only masks the fact that the grammar’s main goal is to select between various readings of words (turns, utterances, interjections, gestures, intonational contours, or facial expressions) based on their probability in the context, the way they can be integrated with other elements, and their quality as explanations of the relevant utterance.
136
chapter four
Compared to interpretation by abduction as proposed by Hobbs et al. (1990), this chapter presents a more radical but still abductive proposal. Where Hobbs takes on board a full-fledged Aristotelian grammar, the proposal of this chapter uses abduction all the way. In the theoretical model of interpretation, a minimal model of grammar is used only for word order and morphology, while a naturalistic treatment of interpretation lets speakers use their production abilities directly in understanding by simulating the utterance production. Another difference from Hobbs is in the nature of the abductive axioms. The proposal here is to limit the axioms to a finite number of competing concepts for a cue. The choice is made by their probability as estimated locally using Bayesian interpretation. In principle, this makes the current proposal more restricted. Jacobson’s Principle The principle of compositionality has often been thought to be the fundamental principle of empirical natural language semantics. It has also been successfully used in criticising various semantical hypotheses. From the perspective of this book, however, it becomes clear that an argument in favour of the standard formulation of the compositionality principle (as in Janssen (1984)) crucially depends on a view which universally attributes constituent structure to human languages. That is not at all an irrational assumption— though it has its problems (Finnish (Karttunen, 1989) or Ovidian Latin are problematic for constituent structure)—but in the light of, for example, issues discussed in Chapter 2 Syntax, it cannot be taken as a self-evident truth. Constituent structure should not be part of semantic methodology and therefore, the standard interpretation of the compositionality principle should not be included either. There are two points that should allay worries about not following the compositionality principle in full. The first is that natural language semantics is perhaps not fulfilling its promise partly because of its strict adherence to the Montegovian methodology. What seems to follow from the account of syntax and semantic interpretation presented in this book is that there is nothing problematic about sources of semantic material which come from outside syntax, about multiple marking of a semantic feature, or about imperfect linguistic marking of semantic distinctions, phenomena that tend to considerably complicate compositional analyses. The rule-to-rule hypothesis bars the semanticist from the use of any operator or argument which cannot be attributed to a word or a syntactic combination rule, it has a problem with multiple expressions of the same
interpretation
137
semantic feature, and it needs to resort to tricks where the expression of a semantic feature is imperfect. In these respects, the ‘rule to rule’ hypothesis is an obstacle to semantic progress. Moreover, the rule-to-rule hypothesis is an approximation of what one might call Jacobson’s principle, that is more closely defining the proper practice in linguistic semantics to pay due attention to the linguistic form. (118) Marked syntactic and lexical structure needs a semantic explanation.
This disallows assigning a meaning to an expression X that is identical to the meaning of a less marked variant Y in the context. Markedness can be defined by syntactic and morphological complexity (or even by frequency) and the same meaning by the semantic theory one is using. The situation that is excluded is insertion of additional structure that has no semantic or interpretational explanation. Example (119) illustrates the principle.13 (119) If you had come, you would have noticed.
Explanations are needed for the past tense, the if, the would, and the perfect, since if is marked with respect to the veridical when, the past and perfect with respect to present imperfect, the would with respect to its absence. Explanations of the various additional features can overlap but must not make the additional feature superfluous. This way of ensuring that a semantic analysis takes the formal differences in syntax and lexical structure into account seems preferable to the ‘rule by rule’ hypothesis, where there is always the option of assigning to the additional structure a trivial semantics—and that is precisely what Jakobson’s principle rules out. The semantic explanation can take the direction of assuming extra conceptual content or the direction of monitoring: the additional structure excludes an unwanted reading. Lexicon Perhaps the most important resource for the algorithm is a lexicon of constructions and their conceptual representations. Traditional dictionaries assume that words and multi-word expressions have a certain set of readings. While that view cannot be completely disregarded, it seems more appropriate to think rather in terms of meanings constructed in a particular context. The view that meanings—and combinations of meanings—are 13
Discussion with Katrin Schulz inspired this example and the discussion that follows.
138
chapter four
built from “semantic microfeatures” (minimal conceptual building blocks) has been articulated by Smolensky (1991) and Hogeweg (2009), and is implicit in the work of Bowerman and Choi (2003). The proposal of Hogeweg (2009)—a further development of Smolensky (1991)—is particularly relevant here. The association of words with microfeatures in these authors is a question of associative learning, which makes it the case that a word will always evoke all microfeatures associated with it and will normally overspecify its meaning in a particular context. The selection of a subset of microfeatures through contextual fit then creates the meaning in the context. The approach has mainly been applied to prepositional meaning (in and its Korean correlates in Bowerman and Choi) and to particle meaning (the Dutch particle wel in Hogeweg). It is clear that if this approach were indeed feasible for larger parts of the lexicon, concept construction would offer a more efficient approach to the selection of linked concepts. On the other hand, a full implementation of this approach beyond its successful application to certain particles and prepositions is still missing and it is not clear yet whether it can be extended to, for example, verbs. Aristotelian Grammar The formalism of Chapter 2 Syntax could be combined with the formal methods presented in this chapter to produce something resembling an Aristotelian grammar. It would consist of a list of cues including their strengths, a lexicon of the format given in this chapter, and a distribution p(φ|C) assigning probability to linked concept in contexts which allow them. Well-formedness of a form-meaning pair in a context can be defined by a combination of syntactic correctness and interpretability. F is well-formed in C iff F is optimal for I and I is the outcome of the interpretation algorithm for F in C. Since there is no assumption that optimality theory can ‘see’ the interpretation, the formal backbone has to come from the interpretation algorithm. The formal backbone of the algorithm is a free categorial grammar which reduces contexts by reducing unsaturated elements in the context under construction. Link constructions have to be approved by simulated production, which means that optimality-theoretic syntax provides the linear and morphological restrictions on link constructions in standard categorial grammars. Moreover, the algorithm is committed to an n-best incremental regime guided by the probability model. Its aim is to find an interpretation
interpretation
139
for which the product of the prior and production probabilities is maximal, where both probabilities take into account the relevant context. The restriction to a context is essential. One could define syntactically well-formed utterances and turns by existentially quantifying over contexts and interpretations. One could also define possible meanings of a wellformed sentence by existentially quantifying over contexts. These notions would, however, be mere abstractions that do not have a naturalistic interpretation in production and interpretation. Robustness results from not paying undue respect to well-formedness and this is unproblematic since the interpretation algorithm recognises far more than just well-formed structures. The possible meanings assigned by the grammar are also not relevant in interpretation because a context will normally bias the speaker and hearer to just a subset of the meanings. Most importantly perhaps, the probabilistic models involved are far from eternal. They change along with language use and its determinants such as culture, politics, and history. This dependency seems to undermine the formal nature of notions such as a well-formed sentence of English or the meaning of a well-formed expression of English.
chapter five MENTAL REPRESENTATION
Chapter 4 Interpretation gives outputs that are new linked contextualised concepts within a given context that represents the interpreter’s information state or the common ground between speaker and the interpreter from the perspective of the interpreter. This chapter claims that the resulting contexts can be seen as as expressions of a logical representation language with a proper model theory that has an interpretation as an update semantics and that the logical representation language should be taken seriously as a proposal in natural language semantics. It is a formalisation of traditional mental representation (as an update semantics) and it has various advantages over its closest competitor: discourse representation structures. This chapter presents a logical treatment of the outputs of the algorithm of Chapter 4 Interpretation (sections 5.1 and 5.2), discusses its philosophical interpretation as an account of the mental representations assumed by philosophers from Aristotle to Husserl (section 5.3), and tries to show that the structures as (re)developed in this chapter have important advantages in the treatment of belief attributions (section 5.4) and the theory of definite expressions (section 5.5), as well as in the treatment of quantifier scope ambiguities, in combination with the interpretation algorithm of the last chapter (section 5.6). Those readers who merely want a better understanding of the outputs of Chapter 4 Interpretation should find enough by reading section 5.1, with an optional look at section 5.2. Section 5.3 shows that the outputs of the interpretation algorithm—or the more conventional structures developed in sections 5.1 and 5.2—formalise the traditional notion of mental representation. Section 5.4 develops a theory of intersubjective agreement about content within the formalism, based on a semantics for belief attributions. The final section 5.6 develops an account of quantifier scope ambiguities and compares the formalism with Discourse Representation Theory (Kamp and Reyle, 1993, DRT). The current chapter and the previous one are the outcome of a failed attempt to develop a new Bayesian development algorithm for DRSs from input utterances. The advantages over DRT (sections 5.4, 5.5 and 5.6) as well as the relation to classical representations were a surprise. The enterprise did not fail as such: one can construct DRSs from the contexts of the
142
chapter five
representations developed in this chapter but useful information is lost in translation and with that information, the advantages over DRT disappear as well. 5.1. From Links to Representation Structures The aim of this chapter is to show that multisets of contextualised linked concepts can be seen as logical representations of contexts and the content of utterances. They are logical representations, since they have truthconditions, i.e. they can determine a set of mathematical models with respect to which they are true and a set of mathematical models with respect to which they are false. One can given an account of their truth and falsity as such in terms of the model-theoretic notion, but this will be considerably more problematic. The model theory allows a relation of consequence between multisets of contextualised linked concepts: C ⊧ C’ iff ∀w(w ⊧ C ⇒ w ⊧ C’) and, in consequence, one has a logic for the multisets of linked concepts. It however simplifies the definitions to get rid of the links and the context labels in the formalism in favour of a more conventional syntax. (120) should end up being represented by linked conceptualised concepts corresponding to each of the words in the way indicated in (121). (120) John told a girl that he loved her.
We ignore the temporal aspect in this presentation. (121) John: a concept identity-linked to a contextually given concept denoting John, context 0 Tell: a concept linked to John, Girl and Proposition, context 0 A: a marker on Girl, context 0 Girl: a concept satisfied by individual girls, context 0 That: the context 1: Proposition set up by the object of Told He: a concept identity-linked to John, context 0 Love: linked to He and Her, context 1 Her identity-linked to Girl, context 0
A simplification of the notation is obtained by going back to the presupposition prefixes of Chapter 4 Interpretation and replace all the presuppositions by the target of the links constructed for the presuppositions in the prefix and to replace the identity-linked concepts by the target of their link. The identity-linked concepts themselves can then be omitted from the multiset. This gives (122) for the complement. (122) (John Girl)Love
mental representation
143
The next step is to collect all members of contexts within square brackets: the multisets are represented by sequences. This gives a single representation (123b) for the complement that can be entered into the prefix of Tell as in (123c). That is part of the whole context derived from (121), as shown in (124). (123) a. Girl b. [(John Girl) Love] c. (John Girl [(John Girl)Love]) Tell
We will, however, also use mnemonic names for concepts that are given in the context. For example, John is used in the notation as a mnemonic name for whatever representation in the given context it is linked to. The name John does not stand for a concept, it is a referential device for re-identifying a given object in the pre-context, that uses the property of being named “John”. (124) Girl, (John Girl [(John Girl)Love]) Tell
The final notational step is to distinguish between different instances of the same concept by the means of indices. This is necessary only for nondefinite representations. In the current example, there are two indefinite representations: Girl and Tell with the indicated arguments (it is just possible to tell the same girl twice at the same time that one loves her, for example, by holding up a board with the text “I love you” while pronouncing those same words, but—in contrast—one cannot love a person twice at the same time). (125) Girli , (John Girli [(John Girli )Love]) Tellj
Given that there is not more than one occurrence of Girl and Tell outside a prefix, the indices can be omitted. The informal notation as in (124) therefore already gives all the information that is in the set of linked contextualised concepts under these conventions. Indices are necessary when John kissed a girl and another girl as in (126). The fact that the kisses are distinct follows from the fact that their objects are distinct. (126) Girl1 , Girl2 , (John Girl1 )Kiss (John Girl2 )Kiss
The above establishes a notation for the multisets of linked contextualised structures of Chapter 4 Interpretation which allows a formal definition, as follows.
144
chapter five
Let L be a set of concept labels c with a signature σ such that σ(c) is a pair of natural numbers ⟨m, n⟩. Let I be a set of indices.
1. If c ∈ L has a signature ⟨m, n⟩ and φ1 …φn are representations and C1 …Cm are contexts, and i is an index then (φ1 , …φm , C1 , …, Cn )ci is a representation. 2. If r1 , …, rn are representations, then the sequence [r1 , …, rn ] is a context.
The fragment built by concepts with signature ⟨m, 0⟩ will be considered first. The interesting contexts are the proper ones, those that are both consistent and complete. Completeness is the property which guarantees that the context as such can be true or false since all argument representations are defined. Incomplete contexts have representations with argument representations that do not have a value in the context, so that the representation does not have a truth value. It is the occurrence of a representation as a member of the context that will assign a value to that representation. Another way of describing this is: one can only say something about things that are already given. A complete context follows the rule that no representations are added at its ith position which apply a predicate to representations that are not already given in the sense of being among the first i representations of that context. As a consequence, complete contexts represented by sequences always have to start with representations made from concepts with signature ⟨0, 0⟩. Incomplete contexts can, however, be made into complete contexts by “rolling them out”, i.e., by adding all the argument representations in the prefix of the ith representation ri of C to the initial segment of C ending at ri . Let C be a context [r1 , …, rn ]. Then C is complete iff for all i ≤ n if ri = (φ1 , …, φm )cl then for all j ≤ m φj ∈ {r1 , …, ri−1 } Let C be an incomplete context [r1 , …, rn ] and let ri be the first representation that has a first prefix element φ not occurring in [r1 , …, ri−1 ]. The roll-out roll(C) of C is then defined as roll([r1 , …, ri−1 ] ∘ [φ] ∘ [ri , …, rn ]). If C is complete, roll(C) = C. A possibility w for a complete context C is a pair ⟨Uw , Fw ⟩ where Uw is a nonempty set and Fw is a function from the elements of C into Uw and from the concepts of the language L with signature ⟨m, 0⟩ into m + 1-place relations over Uw such that for all representations φ of C if φ = (ψ1 , …, ψn )cl ⟨Fw (φ), Fw (ψ1 ), …, Fw (ψn )⟩ ∈ Fw (c). Concepts c with signature ⟨m, 0⟩ denote m + 1-place relations over a domain. A representation r = (r1 , …, rm )ck based on c is satisfied by a
mental representation
145
possibility w and an object a, if for 1 ≤ j ≤ m w assigns an object uj to rj in the prefix and ⟨a, u1 , …un ⟩ ∈ Fw (c). w and Fw (r) must satisfy r. This allows r both to be true in a possibility and to denote an object at the same time. The two properties are connected: r is true in a possibility w iff < Fw (r) satisfies r on w. For example, a is a proper value for Fw (Girlj ) iff a ∈ Fw (Girl) iff a is a girl according to w. A state1 s is a proper value for Fw ((X Y)Lovev ) iff ⟨s, Fw (X), Fw (Y)⟩ ∈ Fw (Love) iff according to w, s is the state of Fw (X) loving Fw (Y). If a complete context has a possibility it is consistent and therefore proper. Incomplete contexts can be considered to be consistent if their roll-out is consistent. Identity logic can be added to the formalism by introducing 1- and 2-place identity Id and = and by having functional concepts (as a subset of the set of concepts). σ(Id) = ⟨1, 0⟩, σ(=) = ⟨2, 0⟩. The first Id is the identity function used for identity linking. = is the identity relation. The choice for interpreting = as the set of identity triples is just for convenience (identity statements could also denote states). Fw (Id) = {⟨u, u⟩ : u ∈ Uw } F(w(=) = {⟨u, u, u⟩ : u ∈ Uw } c is functional in w iff ∀u1 , …un , v ∈ Uw ∃a ∈ Uw (⟨v, u1 , …, un , ⟩ ∈ Fw (c) ⇒ a = v)
The above exhausts what can be said about the formalism with only concepts with signature ⟨n, 0⟩. It is an important fragment since it suffices for vision and a large fragment of natural language. In the rest of this section, we will discuss the semantics of the predicate Tell from our example and introduce the auxiliary notions needed. The concept Tell has signature ⟨2, 1⟩ and has therefore a context argument which puts it outside the basic fragment. Proper contexts can entail each other as defined in (127). (127) C ⊧ D iff ∀w w ⊧ C ⇒ w ⊧ D
Two contexts can also have the property of being identical with respect to another context. For this, identity between representations is needed: (128). (128) C ⊧ φ = ψ
1 A state is the kind of temporality associated with stative verbs like love, an event is the temporality associated with achievement and accomplishment verbs like climb, tell or hit. These distinctions do not play a role until one looks at the complexities of tense and aspect semantics.
146
chapter five
While this looks like an extensional identity, it is not: it requires φ and ψ to denote the same objects in all possibilities allowed by C. It is their intensional identity modulo C. Identity between representations is the basis for identity between contexts with respect to a context C, as defined in (129). (129) C ⊧ D = E iff ∀φ ∈ D ∃ψ ∈ E C ⊧ φ = ψ ∧ ∀φ ∈ E ∃ψ ∈ D C ⊧ φ = ψ
The definition requires that all elements of D and E have an intensionally identical element in the other context. With this notion, we can embark on the semantics of Tell. σ(Tell) = ⟨2, 1⟩ in our example. Tell refers to an utterance event in which a proper pointed context ⟨X, D⟩ was communicated from the subject to the indirect object. Here X must be the representation of the agent of the utterance within D and D is what the agent wanted to say in the utterance, from the perspective of the agent. D is not necessarily consistent with the given context C. For example, John was away but Charles told me he was at home. In a context C, this would mean that the content [(John Girl)Love] of the complement must be true on D. But—assuming that John did not properly introduce himself to Mary when he uttered “I love you”—this is not directly the case. The connection between C and D is a context E such that D ⊧ E and that is identified by contextually given information with the context CC given by the complement. This identification can be defined by requiring that there is a complete and consistent subcontext G of C∘D∘[X = John] such that G ⊧ E = CC. This will be the case if in G John is identified with whatever way he is represented in D, Girl with whatever way she is represented in D and D entails that he loved her. In short, the complement of Tell needs to have a counterpart that is entailed by D. Taking stock, this gives a, b and c in (130).
(130) a. A possibility w must assign four-tuples ⟨e, u, v, ⟨r, G⟩⟩ to Tell, with G a proper context and r a representation in G. Here e is the event of telling, u the teller, v the addressee of the telling and ⟨r, G⟩ the pointed context communicated in the telling event. r is the representation of the speaker of the communicated context within G. b. (X Y D)Tell is complete within a context C iff C ∘ D is complete. c. Given a context C for which w is a possibility, (X Y D)Tell is true with respect to w and C iff ⟨Fw ((X Y D)Tell, Fw (X), Fw (Y), ⟨r, G⟩⟩ ∈ Fw (Tell) and there are contexts E and H such that G ⊧ E and H is a complete and consistent subcontext of C ∘ G ∘ [X = r] and H ⊧ D = E.
The intuitive interpretation of the set of concepts is to view them as the natural concepts that interpret words in natural language and their signature as the indication of the arguments these concepts take and the given earlier
mental representation
147
material that the concept may build on. The context leading up to a representation (if C = [r0 , …, rn ] and i ≤ n, Ci = [r0 , …, ri1 ] is the context leading up to ri ) is the context to which it was added (e.g., by being selected by the interpretation algorithm of the last chapter). Viewing complex objects as indexed representations with arguments and contexts presented by the new formalisation may seem abstract. It may be better to think of them in terms of multisets of linked contextualised concepts because that might be closer to the way they would be implemented in the brain. Different occasions of adding the same linked concept with links to the same other linked concepts provides a better way of thinking of double occurrences of formally identical representations than indexing. Linking would also be compact way of constructing complex objects and warding off data corruption. Contextualisation by labelling is perhaps also more natural than contexts as sequences of representations. But the old and the new way are equivalent and the new way makes it easier to deal with the logical interpretation of these structures. A concept organises a set of given representations into a new unity which then functions as a new representation that can be added to the context in which the representations were given. Contexts are a way of articulating what is given already and what can therefore be constructed into a new unity. A model for a context is a possible state of affairs where all representations of the context are true. When conceived of in this way, contexts serve as the bearers of information, i.e., as propositions, as content of utterances, and as information states. To interpret a natural language utterance as an assertion and to accept its content amounts to extending the context which represents the agent’s information by the content of the utterance. The same happens if new information is found by visual or other perception. It is intuitive that expressions such as John, I, or my brother denote objects. They differ from expressions such as “a girl” in doing so in a definite way. If a context C contains Johni , then any model of the context extended with Johnj will assign to Johnj the same object. C ∘ [Johnj ] ⊧ Johni = Johnj
In this case, it is so because the natural concept, or better, given that John is a very common name, every natural concept that interprets the word John by being an antecedent of its presupposition is functional. Girl is not a functional concept. This is the notion of definite representation that is developed in more detail in section 5.5. It is perhaps less intuitive that the property of denoting objects (in a definite or indefinite way) extends to other categories, such as verbs and
148
chapter five
logical operators. For verbs, it is customary to assume that they denote states and events. Internal objects for logical operators are, however, more unusual. They seem to behave as a sort of Ding an Sich and denote that aspect of the world in virtue of which the logically complex proposition would be the case. While this matter needs to be taken up in future work, for the time being it suffices to think of the objects denoted by logical operators as new abstract objects whose existence in a possibility is equivalent to the proposition being the case in that possibility. 5.2. Logic What should be retained from the discussion of Tell presented above are two ideas. The first is that the completeness of a context argument D depends on a (pre)context of the representation, which would contain the non-context arguments of representations in D. It follows that the pre-context must always be a parameter in a definition of the truth of representations with a context argument. The semantic notion we will be defining is C, w ⊧ φ: φ is true in w with respect to C. Context arguments need not be complete by themselves (this is what results from the interpretation algorithm) but their arguments need to be defined in order for them to have a truth-value in any possibility. The existence of the arguments in the representations of the context argument must therefore be guaranteed by the context of evaluation. For example, in (131) (131) C, w ⊧ (X Y[(X Y)Love])Tell
C must contain X and Y to make C ∘ [(X Y[(X Y)Love])Tell] complete, but also to make the incomplete [(X Y)Love] complete. Below, we define completeness and the interpretation of the context per operator. This leaves the general question open whether there is a general notion of completeness for representations with context arguments and what is the denotation of a context argument. Tell is an example of an operator introducing a context that is not necessarily consistent with its pre-context. Other examples of the same phenomenon are found in belief, negation (if the negated material is already false in the pre-context, the negated material is not consistent with the precontext) and in counterfactual implication. Given the modest aim of this section to reconstruct the logic of basic Discourse Representation Theory (Kamp and Reyle, 1993), counterfactuals will not be treated here.
mental representation
149
In the remainder of this section, it will be shown that the representational scheme can be extended to a full logic by giving a syntax for logical operators (taken to be special concepts) and an appropriate model theory. The strategy is to keep as close as possible to DRT even where that is known to be problematic (mainly for implication, which comes out as a rather strict approximation to its much more flexible natural language counterpart). 5.2.1. Logical Operators
The possibilities need to have three components now: ⟨Uw , Fw , Bw ⟩. The new component ⟨Bw ⟩ assigns pointed proper contexts to persons in Uw that represent their beliefs. A pointed proper context is a pair ⟨r, D⟩ with r the representation in D of the subject of the belief. For quantification, a new auxiliary notion is needed.
variant(w, D) = {v ∈ Possibilities : v is exactly as w except for the values Fw (X) it assigns to the members X of D}.
The following is just an abbreviation. C + φ = C ∘ [φ]
Basic Formulas
A has signature ⟨n, 0⟩ C + (X1 …Xn )A is complete if X1 , …, Xn are members of C C, w ⊧ (X1 …Xn )A iff ⟨a, Fw (X1 ), …, Fw (Xn )⟩ ∈ Fw A and Fw ((X1 …Xn )A⟩) = a (the last clause is not superfluous: the sequence being in the extension of A does not guarantee that its first element is the value of the representation or inversely)
Identity
C, w + X = Y is complete iff X and Y are members of C C, w ⊧ X = Y iff Fw (X) = Fw (Y) C + (X)Id is complete if X is a member of C C, w ⊧ (X)Id iff Fw (X) = Fw ((X)Id)
Complex Formulas Negation
A concept Not with signature ⟨0, 1⟩ C + (D)Not is complete iff C ∘ D is complete. C, w ⊧ (D)Not iff ∀v ∈ variant(w, D)v ⊧/ C ∘ D
150
chapter five
Implication
A concept Imp with signature ⟨0, 2⟩ C ∘ (D, E)Imp is complete iff C ∘ D is complete and C ∘ D ∘ E is complete. C, w ⊧ (D, E)Imp iff ∀v ∈ variant(w, D)(C, v ⊧ C∘D ⇒ ∃u ∈ variant(v, E)C∘ D, u ⊧ E)
Disjunction
Concept Or with signature ⟨0, 2⟩. The simplest way of defining the truth conditions of disjunction is by defining (DE)Or = ([(D)Not]E)Imp. This forces one to rely on the interpretation algorithm to deal with the problematic cases in the literature.2
Epistemic May
Belief
Fact
2
Concept May with signature ⟨0, 1⟩ C + (D)May is complete iff C ∘ D is complete. C, w ⊧ (D)May iff ∃u ∈ variant(w, D)C, u ⊧ C ∘ D
A concept Belief with signature ⟨1, 1⟩ C + (XD)Believe is complete iff ∃G G is a subsequence of C such that G ∘ D is a proper context. C, w ⊧ (XD)Believe iff Bw,Fw = ⟨Y, B⟩ and ∃E, G(B ⊧ E and G is a proper subcontext of C ∘ B ∘ [X = Y] and G ⊧ E = D).
Fact with signature ⟨0, 1⟩ C + (D)Fact is complete if C ∘ D is complete C, w ⊧ (D)Fact iff ∀φ ∈ DC, w ⊧ φ This notion is needed for defining complex presuppositions. E.g. (X D)Know can be true only if D is a fact. (X(D)Fact)Know is therefore a better definition. By completeness, (D)Fact then needs to be in C, and thereby, the representations of D need to be in C as well. This does in fact work. The nastiest problem is the cataphora in examples like:
(132) Either it is in a funny place or this house has no toilet. The outcome must be as in (133) which also gives a first order gloss: (133) House, ([(House)Toilet, ((House)Toilet)FunnyPlace])Not,([(House)Toilet]))Not)Imp ∃x(house(x) ∧ ∀y(¬(yisx’stoilet ∧ Funnyplace(y)) → ¬∃zy = z))
The resolution of it has to wait until the house’s toilet is available and the toilet then needs to be assigned to the context of the first disjunct while the house itself is externally identified. This is the only consistent and complete reading and would be preferred over the in situ interpretation of the toilet.
mental representation
151
Relative
Relative with signature ⟨1, 1⟩ (facts from contexts) C + (X D)Relative is complete if C ∘ D is complete and X is a member of C C, w ⊧ (X D)Relative iff ∀φ ∈ DC, w ⊧ φ and Fw ((X D)Relative) = Fw (X)
This is a way—borrowed from the exocentric relative clauses of English—to get hold of the value of a representation after an update with a context. It can be used for the semantics of relative clauses and for reconstructing the notion of a definition. X must be a member of D in a proper use. The construction gives a definition only if (X D)Relative is functional. This completes the logical definitions that we will use from here onward. C, w ⊧ φ is equivalent to saying that w is a possibility for C+φ. That means the above definitions also define which possibilities survive an update. Let [[C]] = {w : w is a possibility for C} Then [[C + φ]] = {w ∈ [[C]]: C, w ⊧ φ}. The interpretation given to the formalism is therefore a dynamic logic: it says under what conditions a new representation can update a context, it says what is the form of the new context and it says what possibilities will be retained and what possibilities are eliminated. Neither the form nor the set of possibilities in which the form is true is what the context really is: they go together. The form is needed for complex operations such as presupposition resolution and accommodation and for defining interpersonal agreement, the possibilities—and in future extensions of the system probability distributions over possibilities—determine how it can be extended by communication, perception and inference. This makes it a more realistic model of mental information states than purely eliminative or purely syntactic models of information states.3 An issue not addressed in the definitions concerns the internal objects needed for the logical operators. This is an admitted imperfection: objects should be supplied for negations, implications, facts etc. The problem comes to the fore where one would need argument occurrences of operators. In natural languages the problem seems to be restricted to facts, where the solution is easy: the sequence of all the denotations of the elements in the context would do fine. The definition for Not should more properly read as in (134).
3 Veltman (1996) gives a purely eliminative model of update semantics, DRT can be seen as a syntactic update semantics.
152
chapter five
(134) C, w ⊧ (D)Not iff ∀v ∈ variant(w, D) v ⊧ / C ∘ D and w((D)Not) = a where a is the absence of D.
A proper theory of absences would have to define new abstract objects that preclude the truth of contexts D. A formal construction of the requisite classes of objects (for absences, regularities, epistemic possibilities, etc.) seems possible as a model construction but this enterprise will not be pursued further here. With these definitions, the methodological programme of Kamp (1981) has been completed. In Chapter 4 Interpretation, an account was given of the selection of a representation in a context from a verbal input, an account that is grosso modo the same for the conceptual side of visual perception and for the non-verbal signs that accompany verbal utterances. In this chapter, the sets of possibilities in which a context holds were defined and this definition can be extended to a definition of truth and reference using external properties of mental states. Taken together, a definition of truth conditions for natural language utterances was given. The main difference with Kamp’s programme is that much more of pragmatics was integrated in the interpretation algorithm of Chapter 4 Interpretation. 5.3. Mental Representations in Philosophy In the new format and its model-theoretic interpretation, one comes close to the traditional notion of mental representation, the one featuring in the philosophical tradition from Aristotle to Husserl and prominent in traditional epistemology. This section demonstrates that the structures emerging from Chapter 4 Interpretation and redefined in the last section can indeed be interpreted as such traditional representations while they can—as was shown in the last section—also be seen as the building blocks for a logical language for discourse representation with a standard logical model theory. The notion of mental representation that one finds in the philosophical tradition from Aristotle to Husserl is closely related to perception-based knowledge. A typical modern criticism of this notion is that it is based largely on introspection. Yet given the amount of work the notion performs, that objection seems unfair. While appealing to our experience of ourselves in engaging in perception, knowledge, and thought, this notion has been used in sophisticated accounts of these concepts and activities and has been at the very least partly shaped by the analytical requirements on such accounts.
mental representation
153
Let us list some properties that have been attributed to the notion.4 A. A mental representation is what brings unity to the multitude of direct experience and the multitude of other mental representations. B. Mental representations are both propositional and referential. A mental representation of, e.g., John in his office refers to the state of John being in his office, that is, it is a way in which that state is given, but it is also the thought that there is such a state of John being in his office. C. A mental representation always has an internal object and may also have an external object or objects. In the example of ‘John in his office’, the internal object is the state of John being in his office, and the external object, if any, is the state in external reality that is given by John being in his office. The external object only exists if John is in fact in his office. If a representation does not single out one single object, one can make the assumption that it has many external objects: the objects that satisfy the representation in external reality. D. Mental representations may have external properties, such as being a mental representation that represents the subject of the representation, representing what is perceived by the subject, or representing a particular external object. E. Mental representations are consistent. There is no stable mental representation of a square circle or of John being in his office and outside of it at the same time. F. A mental representation is organised by a concept of its internal object. G. A mental representation may represent counterfactual objects or states (a golden mountain, me being at this very moment on the beach instead of at my desk). H. Mental representations can be logically complex. I. Barsalou et al. (2003) let the conceptual system be grounded in the brain’s modal systems for perception, action, and introspection. Traditionally, concepts (or the representations based on these concepts) are what is shared by perception, action, and thought. J. Mental representations are contextual in the sense of depending for their propositional content, truth conditions, or reference on other mental representations.
4 Twardowski (1977), a rather late and analytic exposition of the traditional notion, is an important source for this inventory.
154
chapter five
The last two properties are more recent considerations. Barsalou’s point of view is based on the traditional notion of representations being the interface between thought, action, and perception. In his view, the content of concepts is determined by their definition in these three modes. Concept-based representations are what ties together motor states, epistemic states, and perception. Barsalou suggests that a concept is just a connection between such modality-specific definitions. Contextual dependency is closely linked to theories such as discourse semantics, dynamic semantics, or update semantics, where meanings are essentially dependent on the context and map contexts to new contexts. In fact, few things are as natural as thinking of mental states in terms of continuously acquiring new information through perception and the subject’s own action and reflection. The relation between dynamic semantics and mental representation as such has not yet been systematically explored. If the formalism developed in this chapter formalises mental representation, it therefore makes sense that it is developed as an update semantics. But this does not decide on the question whether or not some contexts and the representations in them are true and false not in a model but in reality as such. The traditional argument for the logical systems in mathematics is that reality is just another model for the languages studied. All that one needs to assume is that for the predicates in the language, a model determines their extension. It is precisely at this point, that doubts set in about the natural concepts that have emerged in our biological and linguistic evolution. It seems fair to say that many of these do not have clear boundaries, apply only to some degree to the situation to which people apply them, have elements that are not clearly truth-conditional or even completely lack truth-conditional content. In discussing the model theory for the formalism this is not necessarily always a problem. The model theory is there to define a logic for the representations and that is necessary for the foundation of classical and probabilistic inference. But absolute truth in reality is another matter. The following is an attempt. A context is true if it has a possibility w that can be isomorphically embedded to ‘the world as it is’ in a way that respects the way the context is situated. Isomorphic means that its domain can be mapped 1–1 to objects in the world by a function i such that whenever the possibility w specifies that ⟨u1 , …un ⟩ ∈ Fw (A) the natural concept A must hold of ⟨i(u1 ), …, i(un )⟩ in the eyes of anyone who knows all there is to be known about the concept A and about the relevant objects in the world. Knowing everything about a concept A that is individuated only by reference to the subject amounts
mental representation
155
to knowing with what perceptual, representational, and action features it is connected for the subject. For concepts that represent the meanings of words in a public language or those that are publicly given in other ways, such as in shared perception, there may be more to know and the subject itself may not know all there is to know about the concept. Respecting the way in which a context is situated means that the isomorphism i maps the subject of the context, as represented in the context, to herself and does the same with respect to objects of direct experience in the context: if an object in a world x was experienced by the subject and caused her to represent x by a representation φ, the denotation u of φ in the model should be mapped by i to x. A model of the context of representations of a subject should be about the correct elements of the world under the isomorphism. It seems fair to say that some concepts will do well under these considerations and consequently that there are some contexts that are true. It seems equally fair to say that many natural concepts will not fare well and that in consequence there are many contexts that are not true or false in the world as it is. A representation φ is true in consequence of the fact that the context C to which it belongs is true. By belonging to C, it however also belongs to proper subcontexts C’ of C, and in particular, it belongs to the smallest subcontext of C which consists only of those representations of C on which φ depends and of those representations on which they depend in turn. The question whether a representation φ of C is true can be reformulated as a question whether the smallest complete C’ ⊆ C that contains φ is true. This also extends to reference. Definite representations in a context are those which necessarily refer to one single object in each model of the context. Definite representations in a true context will therefore have a single external object if they do not depend on an indefinite representation with multiple external objects. Let us go over the properties of representations again and compare them with the formal system developed in sections 5.1 and 5.2. A. A mental representation is what brings unity to the multitude of direct experience and the multitude of other mental representations. This feature would be captured by the syntax of representations since they use a concept to connect a set of given representations into a larger representation. As discussed above, a representation is not uniquely determined by its syntax—there can be two distinct representations that are indistinguishable in their form—but the syntax accounts for the fact that a concept unites representations into new, larger units.
156
chapter five
B. Mental representations are both propositional and referential. A mental representation of, e.g., John in his office refers to the state of John being in his office, that is, it is a way in which that state is given, but it is also the thought that there is such a state of John being in his office. A representation in a context can be identified with the minimal proper context that contains it. The set of possibilities of such a context can be identified with its propositional content. That is the extra information that it contributes to the context. C. A mental representation always has an internal object and may also have an external object or objects. In the example of ‘John in his office’, the internal object is the state of John being in his office, and the external object, if any, is the state in external reality that is given by John being in his office. The external object only exists if John is in fact in his office. If a representation does not single out one single object, one can make the assumptions that it has many external objects: the objects that satisfy the representation in external reality.
The function from possibilities w for the context of φ to Fw (φ) can be identified with its internal object. If a proper context of a representation is true, the representation denotes an external object. If it is definite in the context, the external object is also uniquely determined, if it does not depend on indefinite representations. In that case, the existence of an external object is guaranteed, but not its unique determination.
D. Mental representations may have external properties, such as being a mental representation that represents the subject of the representation, representing what is perceived by the subject, or representing a particular external object. This property was used for the situated isomorphism between objects in a model and external objects in the world. In this notion, a context of mental representations is taken as a representation of somebody’s mental state. The representations originating in observations of the subject, as representing the observed objects. E. Mental representations are consistent. There is no stable mental representation of a square circle or of John being in his office and outside of it at the same time. This is captured by considering only contexts that are complete and consistent. Notice that this property falls out of the selection algorithm of Chapter 4 Interpretation: an inconsistent context has prior probability 0, which
mental representation
157
means that neither the context nor the representation that makes it inconsistent will be selected. An incomplete context would result from selecting a concept with a presupposition that remains unlinked, something that is also ruled out by the interpretation algorithm. F. A mental representation is organised by a concept of its internal object. This is captured by the definition of a representation in the formalism: it always needs a concept to organise it. G. A mental representation may represent counterfactual objects or states (a golden mountain, me being at this very moment on the beach instead of at my desk). Such objects can be represented but their representations are not true. It is possible, however, that such representations occur as part of context arguments and the representations in which that occurs are true. H. Mental representations can be logically complex. This was worked out in section 5.2. I. Barsalou et al. (2003) let the conceptual system be grounded in the brain’s modal systems for perception, action, and introspection. Traditionally, concepts or representations based on these are what is shared by perception, action, and thought. The modal specificity of conceptual content is instrumental in providing content to the judgement that a particular concept applies to a sequence of objects. If the concept were not applicable, the objects would look different, feel different, sound different, would not be carried out in this way, etc. There is no guarantee that what is there in the various modes adds up to the determination of a sharp extension. This, however, seems to be a realistic feature of how thought relates to reality. Reasoning is possible without external truth—it can be defined purely on possibilities (and be extended to probabilistic reasoning by a probability distribution over possibilities). J. Mental representations are contextual in the sense of depending for their propositional content, truth conditions, and reference on other mental representations. This is a consequence of the way contexts are set up. Mental representations, even those arising in vision, can and often do collect representations from
158
chapter five
memory. When I see my sister and recognise her, part of my representation of the experience is the connection between my sister as given in memory and the current visual signal. In sum, it may be concluded that there are enough similarities to see the new formalism as a formalisation of the traditional notion of mental representation. Frege’s criticism of the notion (Frege, 1884) is that it conflates concept, proposition and reference: one and the same entity is all three. The criticism is met here by the distinction of three different roles: as the source of links to the context (concept), as the concept identifying its object (in an argument occurrence), as the claim that its object exists (as a member of a context). A concept is a representation that is unsaturated only in a very weak sense: it will make a stronger claim if its argument places are properly bound and it may mean next to nothing if they are not. Lack of saturation in the account of this book is much better located in the anaphoric and presuppositional demands on the context by a concept that is given as the meaning of a word. If the demands remain unfulfilled, the interpretation process crashes. It is not accidental that certain concepts exhibit absolute demands on the context of this kind (the usefulness of personal pronouns is directly connected with an absolute demand of this kind, certain verbs will require arguments for the concept they express to have any substance) but there are languages without obligatory arguments and without pronouns. And quite possibly also without presupposition triggers that cannot accommodate their presupposition. It seems therefore that a notion of concept based on saturation is problematic, since even without arguments, concepts can (weakly) determine objects and make the (weak) claim that such an object exists. In the difficulty of stating what the internal and external objects of logically complex representations are, one can however see the beginning of an argument against the traditional claim that every proposition can be seen as the claim that some representation has an external object. That claim is however the basis of quite successful modern enterprises such as Martin-Löf’s type theory (Martin-Löf, 1987). The argument could presumably be warded off by a type theory for classical logic. 5.4. Belief In section 5.2, the semantic interpretation of a belief ascription was defined as follows.
mental representation
159
Belief
A concept Belief with signature ⟨1, 1⟩
C + (X D)Believe is complete iff ∃G G is a subsequence of C such that G ∘ D is a proper context. C, w ⊧ (X D)Believe iff Bw,Fw = ⟨Y, B⟩ and ∃E, F(B ⊧ E and F is a proper context that is a subsequence of C ∘ B ∘ [X = Y] and F ⊧ E = D).
This is ipso facto also a definition of somebody agreeing with another person from the perspective of a person with information C. The second person must be assumed to have a set of representations that can be identified with a set of representations within C using the conceptual resources available in C or known to be available to the other person and taking into account that the other person’s beliefs about herself are about herself. This account makes sense of many ideas developed for the semantics of belief attributions and solves some well-known problems in this area: Frege’s problem, logical omniscience, Schiffer’s idea that objects in a belief attribution are always object under a description, counterparts by experience and communcation and finally intensional identities as discussed by Edelberg. a. First of all, it deals with Frege’s problem, that is, with the problem that forced Frege to accept that names have an objective Art des Gegebenseins, a Sinn (Frege, 1892). (135) The Babylonians did not believe that Hesperus was Phosporus.
Frege’s problem is to explain that two names of the same object can occur in such a belief attribution. If the meaning of the name were exhausted by its denotation this would not be possible and Frege accepts another meaning component in the meaning of names: its sense. It seems unproblematic to assume that the Babylonians had two definite concepts MA and EA for the morning and evening appearance of Venus (two different representations of stars) and that they did not accept MA = EA. If the name Hesperus is interpreted by EA and the name Phosphorus by MA the attribution (135) would come out as true in a model of a context C that reflects our current beliefs. Moreover on such a context C, MA = EA as is also the case outside. MA and EA have Venus as their external object, both are identified with Venus in C, what is missing is the internal identity in the Babylonian beliefs. What is problematic is to associate MA and EA with the names Hesperus and Phosporus as their sense and nothing said in this chapter forces such a
160
chapter five
view. The names are devices to pick up given representations in the context. It would be possible to pick up other representations for Venus with the names that can be attributed to the Babylonians, let’s say two times MA or specialisations of MA. But these will be unfavoured interpretations of (135) by having low priors or as “uncharitable” interpretations. Frege’s example is special in involving names which seem descriptive of the representations to which they would resolve. Other representations again will be ruled out by not being attributable to the Babylonians and thereby unable to meet the criterion of identifiability. Exactly the same holds for Kaplan’s demonstrative variant of Frege’s puzzle, that is, for a very slow pronunciation of The Babylonians believed that this is not … that accompanied by a pointing to Venus in the evening and one to Venus in the morning after a long wait. These are pointings to the same object but not according to the beliefs of the Babylonians. The demonstratives and the associated pointings must also be resolved to MA and EA, for the context to be true according to the Babylonians and it is not necessary to assume an objective way of being given that is associated with “Phosphorus” and “Hesperus” and serves as their meaning. Other concepts to which the names could be resolved would not meet the criterion of identity between the context given by the complement sentence and a subcontext of the Babylonian beliefs. b. In this new picture, logical omniscience disappears. To use an example much like Kamp’s in Kamp (1990), let us assume John has pulled all coins out of his pocket to purchase a sausage. He sees all of the coins but has not yet done the calculations, so the crucial question Does he have € 2.50 or not? is to him still an open question. He sees six 10 cent coins, five 20 cent coins, and a 1 euro piece, and even realises what they are and what is their number (this is hard to switch off in vision). Even so, however, he does not yet believe he has enough to buy his sausage. It holds in all his belief alternatives (possibilities of his belief context) that he can buy the sausage, since he has the requisite €2.50. But, crucially, the state of his having €2.50 is not yet part of his belief context and so he still does not have the belief until he has performed the calculation and updated his belief context with the new representation of the amount given by his coins. The notion of logical consequence under which John has € 2.50 can be defined as in (137), based on the notion of variant(w, D) defined in (136). (136) variant(w, D) = {v : Uw = Uv ∧ Fw ⊂ Fv and dom(Fv ) = dom(Fw ) ∪ D}
mental representation
161
But he does not have €2.50 in his belief state yet under the weaker notion that is part of the proposed definition of belief. (137) C ⊧ D iff ∀w(w ⊧ C∃v ∈ variant(w, D)v ⊧ C ∘ D
It is therefore a logical consequence of John’s beliefs that he has € 2.50 but he does not believe it yet, since that belief uses a notion of logical consequence that will not extend the domain of the model. Contexts in this view are not just sets of possibilities or syntactic objects: they are both at the same time. Belief contexts define a person’s epistemic alternatives but they also include the objects that the person believes to exist, and that not only in the representations that make up the context but also in the assignment of objects to these representations. The objects of belief are objects in the co-domain of the function that assigns objects to representations in any model for a context. In our example above, the context entails that John has €2.50 but the sum of the values of the coins is not yet given. In order to ascribe John the belief that he has enough, the domains of the possibilities for Bw,john would need to be extended. If John—as he is bound to—makes the calculation, he thereby acquires the belief. c. A central property of the representations considered here is that they cannot contain an object without the mental representation that says how the object is given to the subject. This is related to Schiffer’s idea (Schiffer, 1992) that in belief attributions, there is always a way in which the object is given to the belief subject, a way that may not be recoverable from the syntactic form of the belief complement alone. Supposing the example (138) holds, it follows that the belief subject John has representations denoting the same objects as She, Loc with respect to a proper subcontext of C∘Bw,john ∘[X = John], but what these representations are, i.e. how these objects are given to John cannot be determined from the form of the complement. (138) John believes that she is here. (John [(She Loc) At]) Believe
In Quine’s famous example Ralph believes that Ortcutt is a spy (Quine, 1956), Ralph would not assent to the statement that Ortcutt, the well-respected mayor, is a spy, because of the representation to which he resolves the name Ortcutt. In the common ground C for the example where Ralph has seen a man with a brown hat taking pictures of the local missile base, a man we know is Ortcutt, we can resolve Ortcutt to the man in the brown hat and attribute the belief that Ortcutt is a spy to Ralph.
162
chapter five
Kripke’s Paderewski puzzle (Kripke, 1979) can be solved in the same way. Many people know Paderewski was both a well-known Polish politician and a famous concert pianist but not everyone is aware of this fact. John even thinks it it false, a conclusion based on his low opinion of politicians’ musical abilities in general. So (139) is true. (139) John thinks that Paderewski the politician is not the same as Paderewski the pianist.
John knows that the politician is called Paderewski and also that the pianist is called Paderewski. Paderewski is a regular proper name but it will be associated with two different ways in which the object is given in John’s beliefs. The puzzles make it clear that complements that are each other’s linguistic negation can be both beliefs of the same subject at the same time. But the representations that interpret the complements are not formal negations of each other. The belief states of the subject can be fully consistent. d. The proposed account also recovers the idea of counterparts by experience and by communication of Zeevat (1997). If one knows that John sees Tom, John will have a representation of Tom, as, let us say, that guy over there (abbreviated as Guy below). John’s representation is a counterpart of Tom because (140) is the case. We see Tom and that he is denoted by Guy. We attribute Guy to John and there is consequently a proper subset F of C ∘ Bw,j ∘ [X = j] such that (140). (140) F ⊧ Tom = Guy
Therefore, we can attribute any belief of John that involves Guy as a belief about Tom. Similarly, communication makes it the case that a referent derived from communication, e.g., from an utterance of Bill has a new girlfriend, makes the new referent a counterpart of the one in the speaker’s belief context. Assume the speaker’s context is C and the hearer’s D. The utterance creates a new representation R1 for Bill’s new girlfriend. If the speaker reported what he believed, his own representation R of the new girlfriend is identified with R1 by a proper subcontext of C ∘ D. If the hearer believes the speaker, his model of the speaker’s information C’ contains the expressed information and D ∘ C’ contains a proper subcontext that identifies the hearer and the putative speaker’s representation. Even as outsiders witnessing the conversation, it is possible to make the identification based on beliefs we attribute to speaker and hearer. From the hearer’s and the outsiders’ perspective, it is quite
mental representation
163
possible that the speaker was insincere and deceived the hearer or both the outsiders and the hearer. The outsiders can withhold belief in the speaker, and then the speaker merely told something to the hearer. The putative girlfriend in the story will still be identified with the girlfriend in the new hearer belief. A Geach example (Geach, 1962) like (141) raises two questions. The first is how it is possible that “she” is resolved to the NP “a witch” that is only given as part of what Hob believes: there could not be an externally given witch given that witches do not exist. The second is what it means that Hob and Nob have their belief about the same fictional witch. (141) Hob believes that there is a witch in the village and Nob believes she poisoned his well.
The answer to both questions should be that Hob’s representation of the witch in his belief and Nob’s representation of the witch in his belief should be identifiable given their other beliefs. If D is Hob’s belief context and E is Nob’s D ∘ E should have a proper subcontext that identifies their respective representations WH and WN. And it is clear that this will be the case under many circumstances. Newspaper stories, rumours, the suspicious behaviour of an innocent old woman all can make it the case that D and E have enough information to identify WH and WN. e. Edelberg (1992) gives a convincing counterargument against the idea that all coreference between attitudes can be captured by counterparts by experience or communication, thereby refuting an account like Zeevat (1997) that relies solely on such counterparts. A typical example illustrating Edelberg’s objection is (142). (142) Bill has staged a car accident by pushing a car to a tree and pouring tomato ketchup in the grass next to the driver’s seat and hides behind a tree. John comes by and concludes that somebody had an accident and walks away. Mary arrives on the scene sees the car and notices the ketchup in the grass concluding that the driver is wounded.
(142) supports the description in (143). (143) John believes that somebody had an accident and Mary believes he is wounded.
The person who had the accident is inferred by John in the example and Mary has made the same inference. Since nobody had an accident, the identity cannot be explained as a counterexample by experience and since there was no communication, the representations of John and Mary cannot be counterparts by communication.
164
chapter five
There is however more than enough in their respective beliefs for the identification of the putative driver. It is the driver of the car in the same accident scene that the outside observer will take them both to represent on the basis of their observation. Another example of Edelberg is the following and illustrates the asymmetry of counterparthood, predicted by the definition here. Detectives Arsky and Barsky both separately saw the body of Smith whom they both wrongly believe to have been murdered (the death was accidental). Later on, Barsky—but not Arsky—acquires the belief that the murderer of Smith is from out of town. Arsky comes to believe that he has red hair. The detectives never speak to each other. (144) a. Arsky believes somebody killed Smith and Barsky believes he is from out of town. b. *Barsky believes somebody from out of town killed Smith and Arsky believes he has red hair. c. *Arsky believes someone with red hair killed Smith and Barsky believes he is from out of town. d. Barsky believes somebody killed Smith and Arsky believes he has red hair.
Examples (144a) and (144d) show again that a counterpart analysis is not sufficient: there is no perception of the murderer and there is no communication about the inferred murderer. Examples (144bc) show that the order matters: they both seem incorrect on the basis of the story and they are the (144a) and (144d) turned around. This asymmetry can be explained by assuming that the linguistic description of the murderer matters. Barsky does not have a representation of somebody with red hair who moreover is the murderer of Smith. Arsky does not have a representation of somebody from out of town who moreover killed Smith. But they can share beliefs about the murderer of Smith. An appropriate counterargument to this analysis would be Geach’s (145). Surely, it is not necessary that Nob shares Hob’s belief that the witch lamed his mare. (145) Hob believes that a witch lamed his mare and Nob believes she poisoned his well.
Nob however would have to share Hob’s opinion that she is pretty, if the example were (146). It is the information in the NP that needs to be shared with the other speaker. (146) Hob believes that a pretty witch lamed his mare and Nob believes she poisoned his well.
mental representation
165
This section has demonstrated that one can develop an account of belief sentences within the account of mental representation of this chapter that does not need anything beyond a representational account of belief. It does not need Frege’s objective senses or Schiffer’s hidden ways of being given. It does not need a counterpart relation, anchors or entity representations, as in Zeevat (1997); Kamp (1990). The function of the counterpart relation or anchors is taken over by the relation C ⊧ R1 = R2 demanding that the representations have identical denotations in all models of C. The entity representations or Heim’s file cards (Heim, 1983a) can be defined per context as the set of representations that are identified by the context. This can be extended for representations that have an external object to the class of true representations in the context that the context identifies with a given representation and so gives a concept of how a given object is presented in the context. A context can have several such “presentations” of the same external object and—as discussed above—this is crucial for the identity puzzles around belief. The only thing that cannot be reconstructed is haecceitism. But according to Duns Scotus only God and the angels can know objects by their haecceities and it seems right to side with Duns Scotus and Schiffer on this issue. 5.5. Definiteness The typological problem of definiteness is quite distinct from the semantic question of providing a meaning of the definite article (if indeed there is such a thing). Many languages have independently developed systems of definiteness marking, which include optional and obligatory articles, special morphemes, and word order (Lyons, 1999). This typological fact can only be explained if definiteness is a natural concept. How else could humans see definite NPs as a unity and learn when to mark NPs for definiteness? How could many different languages develop marking strategies for the same distinction? If definiteness were not a natural concept but one that falls apart into different concepts, one would expect splits in these marking strategies whereby one kind of definiteness would be marked and the other not marked or marked in a different way. The concept must be not just natural—a prerequisite for humans being able to use it in their generation decisions and in interpretation strategies—but it must be the sort of concept that plays an important role in reaching coordination in communication. Only such concepts can explain the recruitment processes that grammaticalise lexical items to become the markers
166
chapter five
of definiteness (as articles and morphemes) or that explain constraints on word order. One can take accounts of the semantics of the definite article and try to find a natural concept underlying these theories. Treatments of the semantics of the definite article form three groups: The oldest one is based on the notion of a definition and is exemplified by Frege’s proof-theoretical account in Frege (1884), by the quantificational version of Russell (1905), and more recently, by the partly presuppositional account of definites of Heim (2012) which can be considered an update of Strawson (1950)’s account, itself based on Frege’s account, but removed from its proof-theoretical setting. The concept of a definition, however, applies to some but not to all definite expressions. Demonstratives, personal pronouns, and names are not definitions since in these cases, there is no property that can be understood as the definition of the object. It also leaves out the most frequent class of lexical NPs starting with the, namely those where the noun by itself does not give a unique description, e.g., the professor, the doorman, or the cat. In order to extend the concept of definition to the larger class, additional stipulations are needed. Heim (2012) indeed finds hidden properties in pronouns and names and hidden additional restrictions in short definite NPs. While this is a possible way to go in semantics, it does not seem an intuitive way at all. It is just implausible that a child learning a language would naturally hit on the idea that there is a definition hidden in a personal pronoun, a demonstrative, or a short definite description. A problem for this kind of approach is also the well-known problem of the rabbit in the hat. In this case—there are two hats in the scene only one of which contains a rabbit and there is a second rabbit—a definitional theory seems to predict that one should say “the rabbit in a hat” since “hat” does not give a definition. But languages typically do not follow this prediction. A more recent theory that address definiteness is the familiarity theory of Christopherson (1939), Hawkins (1978) and (Heim, 1982). This theory claims that one uses definites for objects one is already familiar with and the concept of familiarity would be a candidate for a natural notion of definiteness. Again, while definite expressions are often used for familiar things, there are many uses of definite expressions for things that are new to the discourse. Long definites and bridging definites are typically used for discourse-new objects and an important use of demonstratives is to draw attention to a discourse new object in order to talk about it. In many theories, the fix is to appeal to presupposition accommodation but that does not always work (for example, demonstratives to discoursenew objects are misdescribed as accommodating their referent).
mental representation
167
Another more recent theory is the functional account (Löbner, 1985). This approach claims that in the nominal cases, the concept structure associated with the noun is coerced into being a function that maps explicitly given or context-supplied arguments to the referent. This functionality would then be a candidate for the natural concept of definiteness. The theory finds confirmation in a wide range of uses of definite descriptions, but does not obviously apply to the anaphoric uses, including the pronominal definites, and to some properly definitional uses. So there indeed seem to be natural concepts important for coordination behind these approaches to the semantics of definite NPs but none of them applies across the full range of uses. The semantic theories are typically completed by adding additional elements, sometimes from the other approaches, to obtain a satisfactory coverage and the results seem empirically adequate.5 On the whole, these accounts do what they promise but the core concepts behind them do not give a notion of definiteness that could provide a satisfactory typological explanation of definites. The underlying concepts do not apply across the board and their importance to communicative coordination is not clear. A combination of concepts would do better but a combination is not what one needs: it does not explain the conceptual unity of definiteness which finds a morphological, lexical, or syntactic expression in many different languages. The point of this section is that there is a simple logical notion connected to a proper context of representations C which divides all representations that are either in the context or could update the context C into definite and indefinite representations. The representations that could not update C are ‘meaningless’ because they have arguments that are not interpreted in models of C. Intuitively, those mental representations that recognisably single out one internal object within the larger mental state are definite. This can be defined as in (147). It is necessary to add a copy (the same representation with a new index) of the given representation to test the property: the copy must have the same denotation as the given representation. (147) Definite representation
φi is definite in C iff for all j C ∘ [φj ] ⊧ φi = φj
5 The approach of this book suggests that empirical adequacy needs a stricter definition. One wants to predict under what circumstances speakers employ a definite expression and to predict the interpretation that hearers arrive at for particular uses of definites. The semantic theories typically give only an approximative answer to the second half of this question, but leave the answer to the first question to ‘pragmatics’.
168
chapter five
A definite expression in a natural language is then one that is intended to be interpreted by a definite representation. This is the whole theory. But it is necessary to show that the theory captures the right representations and that definite expressions can guide the hearer to the right representations. Everything that already has a representation is definite because it can be represented as (R)Id, which applies the identity function to the old representation R. And all cases of applying another functional relation to given representations are captured as (R)F, where F is the functional relation. By a further detour, (R C’)Relative also everything that is definable from a context by a definable functional relation is definite. This new concept is one that is motivated by relative clauses in natural languages.6 C, w ⊧ (R C’)Relative iff C ∘ C’ is complete and consistent, C, w ⊧ C ∘ C’ and Fw ((R C’)Relative) = Fw (R) If C is a context such that given C, R ∈ C has a unique internal object, (R C)Relative is a definition of that object, comparable to “the x such that φ”.
Definiteness defined in this way therefore collects all three core notions. The prototypical familiar object is one that is available in the context. The prototypical functionally defined objects are the objects that are defined from given objects by functional relations. Proper definitions are captured in the last step.7 The definition of definite expression and definite representation has direct practical consequences. If a speaker wants to coordinate on an external object, the choice for a definite description is strategic. If the hearer indeed chooses a definite representation to interpret the speaker’s definite expression and moreover the intended one with an external object, coordination is reached on that external object: it is in joint attention by means of joint definite representation. Even if the external object is not there, there is coordination the concept that would give the object, if it would be there. Definiteness is therefore a verbal means for reaching Tomasello’s joint attention (according to Tomasello (1999), a specific human ability, absent in
6 This is not the only way to deal with relative clauses but one that is suggested by the externally headed relative clauses such as are found, e.g., in English. It is meant for the case that R occurs in C’, otherwise it collapses to Id. 7 Beaver observes in Beaver and Zeevat (2006) that proper definite descriptions cannot be given by accumulating descriptive material until uniqueness becomes probable. From our perspective, that means that a definition must be recognised as such by speaker and hearer in order to count as a definite representation.
mental representation
169
chimpanzees without special training and realised in the uniquely human behaviour of pointing). It is only to be expected therefore that human languages recruit words, morphs, and word order to mark a logical distinction that is directly relevant for joint attention to an object. In order to check that definite expressions can be systematically interpreted by a definite representation, definite expressions aligning with the referential hierarchy (see Chapter 3 Self-Monitoring) will be checked in (148). (148) reflexive pronoun (X:AGR:clause:crash)Id anaphoric pronoun (X:AGR:focus:crash)Id demonstrative pronoun (X:AGR,DISTALITY,(Speaker X)Point:(visual;focus):crash)Id name (X:(X NAME)Called:cg:crash)Id long name (X:(X NAME)Called:cg:accommodate)Id
These all call for a resolution to a given representation and are definite by having the identity function Id as their concept. In the first four cases, the resolution is mandatory: only long names (with an apposition) can accommodate, presumably because the extra information makes the accommodation of the name with the apposition into a definite representation. The same division of labour is found in definite descriptions, where the long ones accommodate and the short ones have to be resolved either by resolving to a given representation or by finding highly activated arguments for a functional interpretation of the noun. A provisional definition is given by the disjunction in (149), which suggests that a short NP is anaphoric, metonymically anaphoric, or provably functional, taking in a highly activated argument (in that order). (149) the N (short) (N:cg:crash)Id; ((N::cg:crash)F::cg:crash)Id; (X:definite((X)N):focus:crash)N
The second case covers metonymic reference. The ham sandwich as in (150), said by one waitress to another (Nunberg, 1995) refers to a activated ham sandwich ordered by a customer. The resolution needs to find the sandwich as ordered by the customer and the function (from an order to the person who made the order) to give the customer who ordered a ham sandwich.
170
chapter five
(150) The ham sandwich left without paying.
The third option is to find an in-focus X such that N interpreted as a function gives (X)N as a definite representation. Here one should allow both coercion of N and in-focus antecedents that could not function as antecedents for anaphoric pronouns. E.g., for ‘the waiter’ in a restaurant, the noun should be coerced into being ‘the waiter who helped X’ and ‘X’ could be our party even if our party is not an overt antecedent. It is the long definites that are the most complicated, both from the perspective of syntax and from the perspective of representation. Let’s treat Newman1 from Kaplan (1968), the non-vivid name for the first person born in the 21st century (this person does not seem to have become any more vivid with the passage of time). The descriptive material gives a context (151). (151) 21 (21)Century Man (Man)Birth ((Man)Birth (21)Century) In (((Man)Birth (21)Century) In)First
This context makes Man and (Man)Birth definite under the rather dubious assumptions in Kaplan (1968), so it is possible to form a definite representation (152) with the help of the concept Relative introduced above. (152) (Man [21 (21)Century Man (Man)Birth ((Man)Birth (21)Century) In (((Man)Birth (21)Century) In)First] Relative
The hearer will select this representation for Newman1. The theory of definite expressions seems to work especially well for definite descriptions like the rabbit in the hat, the man with the brown hat, the man holding the campari which are meant to identify their referent by means of an additional visible attribute of being in a hat, wearing a brown hat, or holding a campari. In these cases, we can assume there to be other hats, other brown hats, or more putative camparis, but the identification strategy requires one of them to be identified as part of identifying the rabbit or the man. The setting of identification within a visually given scene makes it then necessary that the hat, the brown hat and the campari are interpreted by visually given representations, i.e., the NPs in question fully meet the criterion that the speaker intends them to be interpreted by definite
mental representation
171
representations, unlike the definitional or functional theories which would prefer an indefinite realisation because of alternative hats or camparis in the scene. It could perhaps be argued that the preference for definite marking comes from a preferred interpretation of forms “the N of an N” as an indefinite. For example, (153) does not entail that the colleague the speaker has in mind has only one son. (153) The son of a colleague came to see me.
This is to be contrasted with (154) where the two examples contrast in acceptability, with the functional father outperforming the non-functional son for acceptability. (154) The father of every student came to see me. (?) The son of every colleague came to see me.
This suggests that (153) is a complex indefinite, with the definite article co-expressing the possessive relation with the preposition of, as in the similar the son of Tom where likewise there is no uniqueness implied. The man with a brown hat has nothing to do with a genitive however and this alternative explanation of the definite article must be rejected. In (154) there is, however, definiteness. This is captured by letting the quantifier outscope the head noun in updating. Once the quantifier is present, ‘the father’—due to the functionality of father—will be a definite representation within the scope of the quantifier, but the son will not, since it would need the unlikely extra assumption that son is functional in the full domain of colleagues. A full development of what interpretations are available in what linguistic circumstances needs to make a distinction between singular and plural nouns and between classes of nouns: sortal, relational or functional. Sortal nouns can be used to reidentify objects of the sort expressed by the noun, they can pick objects meeting the sort out of a given set of objects (subsectional anaphora), and finally, they can set up a new class of objects of the sort. This gives a disjunctive representation as in (155). The first two options give a definite representation, the last one a definite or indefinite plural or else a singular indefinite (if one does not count sun and earth as sortal nouns). The definite plural picks out the whole class determined by the sort. Subsectional anaphora will be represented as (X :: focus : crash Sort) Subset and denote the subset of Xs which are of the sort Sort. (155) would be the general treatment of a sortal noun, expressing a concept Sort.
172
chapter five
(155) (Sort::cg:crash)Id; (X::focus:crash Sort)Subset; Sort
The first disjunct looks for an instance of Sort in the common ground (a preference for recent instances and tolerance for mild variation are or should be part of the interpretation algorithm of Chapter 4 Interpretation) and lets the referent of that instance be the referent of the current noun. The second disjunct tries to find a set currently in focus which contains objects of sort Sort and makes those objects into the referent of the representation. The final option is to denote some set of objects of sort Sort. The results of the first and second disjunct are necessarily definite but an indefinite determiner and a partitive construction can make the referent an indefinite part of these sets. The third disjunct is only definite if it is the full class of objects given by the sort. Functional and relational nouns expressing a concept Function or Relation can also pick up a given class of objects, can do subsection, or map a given class to their relata. (156) ((Z::cg:crash)Function::cg:crash)Id; (X::focus:crash, (Y::argument;focus:standard)Function)Subset; (X::(argument;focus):crash)Function
The first option finds a representation (Z) Function in the common ground and picks up its reference. The second option finds a superset X and a function argument Y and forms a representation that denotes the set of objects that bear Function to an element of Y and are members of X. The third and last option gives the image under Function of the members of its relation argument8 that is explicitly or implicitly given. All three disjuncts give definite representations. Relational nouns (son, neighbour) expressing a relation Relation give the same three options. (157) ((Z::cg:crash)Relation::cg:crash)Id; (X::focus:crash, (Y::argument;focus:standard)Relation)Subset; (X::(argument;focus):crash)Relation
The third option in the singular does not give a definite representation: it is similar to sortal nouns. Indefinite articles and the partitive construction can turn both relational and functional nouns into sortal nouns again.
8 There are good arguments for letting the implicit argument also include events and situations in focus (the beer at a picnic, the waiter involved in our meal, the rain at our walk). Realising these as syntactic arguments is sometimes problematic.
mental representation
173
(158) A brother of Mary came to see me. (she may have more than one brother) A father came to see me. (implicit plural argument needed) I was surrounded by brothers and sisters. (of an implicit group or individual)
The claim for English is that if the intended interpretation is definite, the NP must be marked as definite, either lexically or by the definite article, unless the NP is possessive, has a genitive argument or a demonstrative determiner. This would be the work of a syntactic max(def) constraint that is satisfied by any of the four markers and restricted in English to proper nouns with the exclusion of names including property names like cleanliness (there are expected idiomatic counterexamples like the Tower and the Alps and possibly the phonologically ambiguous the S/sun). The set of markers and the restriction is open to typological variation and as a matter of fact, languages vary a good deal in what NPs need definite marking. The indefinite marks that a choice is made from a larger set, forcing coercions on nouns that normally have a singleton reference (a moon) and functional nouns (a father). Similarly, a definite marking by the can make a relational noun functional, forcing an accommodation of local functionality. Finally, a sortal can be forced into a functional or relational role. Coercion must therefore be a part of the treatment of nouns and articles. The three cases without a definite article are demonstrative, possessive, and genitive marked NPs. (159) would cover the classic demonstrative with associated pointing. The other uses of demonstratives require further study. (159) (Noun:(Speaker Noun) Point:visual:crash) Id
Genitive and possessive can be schematically represented as (GEN) Noun and (POSS) Noun, with a further decomposition needed to allow for an identification of the relation marked by genitive and possessive. Since the noun is often functional with possessor/genitive marking and anaphoric and subsectional readings are also frequent, the interpretation is often definite. But definiteness is not guaranteed. The possessive or the genitive would make the marking of definiteness superfluous in English and other languages because the presence of the genitive or possessive already is a strong predictor of definiteness. This gives us a reasonably complete characterisation of definite phrases in English. The account, however, also faces a number of minor problems: 1. Not all anaphora pick up the referent of their antecedent. Typically, the problem arises in contrastive pairs as in (160). In such constructions, interpretation is strongly supported by parallelism. From the perspective of this
174
chapter five
book, parallelism can be seen as offering an interpretational shortcut: one amends a copy of the representation of the first element by slotting in representations of the contrasting elements in the second for the corresponding contrasting elements of the first. In the examples below, it would be the mule in the first and the mistress in the second. The result should, as always, pass the production filter. Just like in describing ellipsis, one needs to specify what syntactic requirements an NP minimally needs to fulfil to realise its intended representation, while still being recoverable. In these cases, recoverability is guaranteed by the amendment strategy (new elements are basically the old ones with one feature changed). In the example (160b), it therefore suffices to realise the representation of the second paycheck as a pronoun matching its parallel antecedent (notice that in representing a copy of the indefinite representation of the first element, new indices are automatically given). The pronoun’s presupposition does not play a role in activating the interpretation (which occurs through amendment of the first interpretation) and the interpretation passes the production filter. (160) a. If a farmer owns a donkey, he beats it, if he owns a mule, he treats it kindly. b. The man who gave his paycheck to his wife is wiser than the man who gave it to his mistress.
The claim is therefore that there is a sense in which the use of these pronouns is incorrect: they do not allow for standard direct interpretation. This is confirmed by observing that as soon as the antecedent sentence is changed into one that does not contrast with the second anymore, the pronoun use becomes incorrect. 2. In spoken Italian, one expresses one of his books by il suo libro (the his book).9 This can be explained as an overgeneralisation of the requirement to put definite articles on expressions which are supposed to have a definite interpretation to all possessive marked NPs, a class of NPs that often but not always have definite meaning. It can be argued that the definite article becomes part of the marking of the possessive construction. Similarly, French Je me suis rompu le doigt (I broke me the finger) can be explained in this construction as an overgeneralisation from functionally specified body parts like heads, necks, and noses. (The article becomes part of marking the construction expressing being affected in a body part).
9
Un suo libro is the more cultivated variant and the norm in written Italian.
mental representation
175
Dutch de meeste mensen (most people, with obligatory definite particle) could perhaps be explained by a similar overgeneralisation based on superlatives. In these cases, the definite article should be seen as a device that helps the recognition of the concept expressed by a lexeme(de meeste, definite article+possessive), a construction (to V oneself the body part), a role in which it no longer expresses definiteness of the intended interpretation. They represent an obligatory overuse in which they do not mark definiteness anymore. It is probably right to see occurrences of the as in the son of Tom, the daughter of a colleague or the father of a student as similar cases of overmarking where the combination of the and of is a form of case marking the second NP as a semantic genitive. Beyond these exceptions, definite NPs can thus be defined simply and intuitively as a property of the intended hearer representation of the NP and this can be elaborated into a treatment of the linguistic data. It also makes the typological problem go away: definiteness is an important conceptual division for representations and guiding the hearer in this respect facilitates coordination on objects. It is a simple theory and one wonders why this view on definiteness is not the standard view. Two factors seem to be involved. One is that definiteness must—as it is here—be analysed as a context-dependent notion.10 This context dependency is problematic for theories of definites that work with eternal semantic objects. The other factor is the notion of representation itself that has properties of discourse referents as well as properties of the ways in which they are given. This creates the uniformity that makes anaphoric definites and discourse-new definites definite in the same sense, The discourse referents and the way in which they are given are kept separate and therefore it seems one is dealing with different uses. But the other accounts of definiteness can clearly be made to incorporate the representational account of definiteness given above. One can relativise the semantic universe to a representation CG of the common ground. This representation can be built up by adding the joint information in a
10 The exception is Frege’s proof-theoretical account in Frege (1884). Due to the possibility of premises and assumptions, that is in fact a dynamic account. One could even go further and use the possibility of existential instantiation in proofs to argue that Frege’s account includes modern discourse referents and is a complete account of definiteness. The connection of proof theory with discourse semantics is important for all accounts, but has not been systematically explored.
176
chapter five
Skolemised form to make sure that all existential statements have a witness which can serve as an antecedent for anaphora. On models of the common ground, one can have definitions as definable functions applied to the Skolem objects and other given objects or as (CG provably) unique descriptions in which the Skolem objects can occur as constants. The definitions of objects now form a far larger class and include anaphora to the common ground and functional uses. The additional hidden restrictions and predicates can moreover be recovered as the conceptual content associated with the Skolem objects. A full familiarity account can be developed by closing off the set of discourse referents under definable objects. Familiar objects can be recursively defined by starting with the discourse referents and adding at each recursion step objects obtained by applying the definable functions to objects in the last given level. In this development, the familiarity theory starts resembling the definition theory. The functional closure of the Skolem objects also allows a reconstruction of the functional view. As discussed above, coercion of noun meanings is important in obtaining definite and indefinite representations where required. This is maybe the most important aspect that a semantic theory of definiteness and indefiniteness marking needs to take on board. Based on the above, one could thus provisorily assume that the problem escaped analysis because philosophers did not try to classify the right kind of entities, that is, mental representations that would have characteristics of both concepts and discourse referents. (A representation is an instantiation of a concept, and as such corresponds to the discourse referent of the instantiation.) If a representation is given in a context, its discourse referent is old and referring back to the referent through the representation is always definite. If a representation determines its internal object uniquely in all the possibilities singled out by the context (which makes the representation a function from its arguments to its internal object), it also is definite representation. But both of these notions follow from the simple definition of definiteness that was given at the start of this chapter. It was Frege’s revolt against psychologism that made definiteness problematic. While one would not want logic to be based on introspection, none of the better foundational strategies (proof theoretic, semantic, or argumentation theoretic) used in the foundation of logic seem to require an abandonment of the formal structure of mental representations. The Fregean revolution used natural language and its semantics as a basis on which to build formal languages for investigating mathematics, but it
mental representation
177
also introduced the idea that natural language has objective meanings. This chapter shows that this idea has its set-backs. The more modest view that natural language is a useful tool for coordinating our plans, desires, and understanding of the world around and inside us is perhaps in the end more fruitful.11 5.6. Comparison with Discourse Semantics Why is the above a good approach to mental and semantic representation? The answer given so far, is that it is to a large extent quod semper ab omnibus creditur: it is in line with the traditional concept of mental representation. The fact that it is immediately applicable to visual perception, facial expression, and gestures is a pleasant corollary from that. Moreover, Clark’s observation (Clark, 1996) that transactions conducted by verbal communication normally cannot be understood without integrating the scene and the nonverbal communication, turns this corollary into a strong argument for this approach to mental representation as a semantics for verbal utterances. Second, the account makes it unnecessary to rely on discourse referents or variables. There are occurrences of concepts and integrations of these occurrences into larger representations. Finally, this treatment brings improvements to dynamic semantics. Above, the improvements for propositional attitudes were discussed at some length. The notion of anchoring in DRT—necessary for attitudes within DRT—becomes superfluous because representations are their own anchors. The use of consistent contexts only is an improvement on DRT and the construction of contexts for beliefs and other contexts that can be inconsistent with the embedding context improves on the DRT account of accessibility. From the perspective of this book, the connection with Chapter 4 Interpretation is the crucial argument. If all that happens in interpretation is
11 This is how the author would explain the view of Brouwer (1929). One does not need to be as sceptical as Brouwer about the potential of verbally supported communication in mathematics or other areas. Where one understands properly and one’s interlocutors share the prerequisites, a lot is possible. Whether one likes it or not, the transmission of culture seems to be a decisive factor in human civilisation and technology, and it would seem this gives an evolutionary argument for frequent success in communication. One also does not need to be sceptical about objective meanings: their existence would make an explanation of communicative success easier. The point made here is merely that our theoretical picture of what is going on in verbal communication does not seem to require them to the full extent.
178
chapter five
linking concepts to other concepts and placing them in contexts, there does not seem to be an alternative. The formalism offers a proper way of dealing with quantifiers. This seemed the fundamental problem for an approach to semantic interpretation like the one developed in Chapter 4 Interpretation and was in fact the most serious obstacle in trying to develop the account given there. (161) is a treatment of the standard example Every man loves a woman: (161) Context A is given and a word every is encountered. This creates a new situation with (B, C)Imp in context A, resulting in three contexts: A, B, and C. The next word is man. Syntactic restrictions force the concept Man into context B The following word is loves. A syntactic restriction puts (Man Future) Love into context C (syntactic restriction). Future is a link to the future. The next word is a (which only prevents resolution of the next word woman). Woman can be entered into context A or C and binds the link to the future Future. (B is ruled out by syntactic restrictions.) Close the auxiliary contexts B and C at the end of the utterance.
The result is one of two additions to context A in (162). (162) A + Woman, ([Man],[(Man Woman) Love])Imp A + (Man,[Woman, (Man Woman) Love])Imp
The strong point of this treatment is that the ambiguity emerges just by a lack of contraints. In actual usage, it would presumably be resolved by prior probability. Languages can prefer a reading where the quantifier order follows the surface order (e.g., Dutch),12 but this preference gives way when confronted with conceptual inconsistency as in the Dutch version of (163). (163) A pillar is supporting the bridge on both sides.
The reversed preference would result directly from maximisation of prior probability. Consistency and completeness are demanded of the results. Completeness reconstructs a more general version of the constraint on accommodation in Van der Sandt (1992): no variables can become unbound in an accommodation. In the example above, it effectively rules out that
12 Such a preference could be described by a production constraint (if Q scopes over R, then Q comes before R) or by a monitoring constraint that makes scope a feature that is monitored for.
mental representation
179
Woman could be the sole occupant of C in (163) or that (X Y)Love would occur in A or B. It is useful to look at some more difficult examples. (164) is the original Bach-Peters sentence (from Bach, 1970), which was a successful argument against deletion or reduction theories of pronouns. Its argumentational success seem to imply that people at least understand what the sentence means. To the best of my knowledge, however, there are no semantic theories on the market which can interpret the sentence, which seems to imply that the example is not just a counterexample to deletion theories of pronouns, but also to the standard ways of dealing with the semantics of quantifying expressions in natural language. (164) Every pilot1 who shot at it2 hit every Mig2 that chased him1 .
A purely syntactic treatment (e.g. Montague grammar or DRT) would produce (165) or a variant with the Mig and the pilot exchanged. This does not give a complete context of the result A, since ‘Mig’ in ‘Shoot’ is not bound. (165) A: (B C) Imp B and C open B: Pilot in C: D and E open C: (D,E) Imp D: Mig D: (Mig Pilot) Chase E: (Pilot Mig) Hit B: (Pilot Mig) Shoot
In the theory of this book, it is possible to repair the declaration by ‘rolling out the Mig’ in the structure above: an operation that adds Mig to B in order to interpret the pronoun. This gives the intuitively correct result (166): (166) ([Pilot,Mig,(Pilot,Mig) Shoot] ([(Mig Pilot) Chase],[(Pilot Mig) Hit]) Imp) Imp For every pilot and Mig such that the pilot shot at the Mig, if the Mig chased the pilot, the pilot hit the Mig.
All that it takes for this reading to emerge is some tolerance: lexical material should be allowed to drift upwards as a last resort (or as a first repair if the example is deemed syntactically incorrect). Given the absolute need for completeness, this suffices. Another example of the same phenomenon would be (167). (167) John dated a girl if he liked her.
A syntax-based treatment results in the incomplete (168): (168) ([He like her] [John date a girl]) Imp
180
chapter five
John (a presupposition trigger) will move to the highest level, as a given antecedent or by global accommodation. ‘A girl’, however, is just in a wrong position unless one allows it to select a higher context, as in (169). Notice that this way of proceeding predicts correctly that the example does not allow a non-universal reading for the girl. (169) [John, ([Girl, (John Girl) Like] [(John Girl) Date]) Imp]
All of the arguments discussed here show sufficiently that the approach to semantic representation sketched in this section performs better than the traditional approaches. 5.6.1. From Contexts into Discourse Representation Theory What corresponds with the proper contexts of this chapter in DRT are DRSs, ordered pairs of a set of discourse referents and a set of conditions. While discourse referents are primitive objects and are only labelled by variables, we will disregard this distinction here and assume that the discourse referents are just variables, and that conditions are either complex, i.e., composed of other DRSs and a logical operator, or simple, that is, first order atomic formulas with discourse referents as terms. A closed DRS is a DRS without free occurrences of variables, i.e., every variable x in an occurrence of a formula in a (sub)DRS K is either a discourse referent of K or a discourse referent of a DRS K0 to which K has access. Having access is recursively defined over complex conditions of a DRS K. If K0 occurs in a condition believe(x, K0 ), ¬K0 , K0 ⇒ K1 , K1 ⇒ K0 of K, then K0 has access to K and to the DRSs to which K has access. In addition, in a condition K1 ⇒ K0 , K0 has access to K1 . Proper contexts defined in this chapter determine a DRS with almost the same truth conditions. Let C = [φ1 , …, φn ] be a proper context. A representation φi introduces both a condition and a new discourse referent to C Translation is just a matter of the steps in (170). (170) 1. assigning new variables to each of the φi of C and making them discourse referents of the DRS corresponding to C. 2. replacing φi s occurring in prefixes at any depth of embedding by the discourse referent chosen for φi . 3. performing the same steps 1, 2 and 3 for the context arguments of the representations of C. 4. adapting the syntax of the conditions 5. omitting the presuppositional arguments of the predicates 6. choosing a single representative for identified discourse referents
mental representation
181
The example context of the introduction would give (171), (172) and (173). (171) John Girl, (John Girl [(John Girl)Love]) Tell
(172) ⟨{x, y, z}, {John,x Girl,y, (John Girl [(John Girl)Love]) Tell,z}⟩ (173) ⟨{x, y, z}, {John(x) Girl(y) Tell(z, x, y, ⟨{u}{Love(u, x, y)}⟩})⟩
The translation has exactly the same truth-conditions as the input to the translation process, but “forgets” which representation introduced which discourse referent. That makes the translation non-invertible and makes it impossible to give a direct account of belief ascriptions and interpersonal agreement. The semantics of belief ascriptions therefore has to be different in DRT. One can describe the general design of DRT as based on an early decision to organise mental representations around discourse referents rather than around the representations of these discourse referents (the conditions in DRT). Discourse referents are pointers to objects postulated in existential beliefs and can be used in the account of anaphora that motivates DRT. The need to generalise to anaphora in quantified statements makes it wellmotivated to also use them as the variables in representing quantification. The notion of representation proposed here starts from the idea that the brain does not need to save space by pointers and that anaphora can be dealt with by a direct link to their antecedent. Pointers are needed in a medium of limited dimensionality like a string of characters or a diagram, but there are no arguments to the effect that the brain has a dimensionality problem when it comes to linking two representations. Pointers just solve a practical problem: they save space in a linear medium such as a text. That is also the case here: large examples in the formalism become unwieldy and pointers would help. It is, however, unnecessary to recruit the pointers as variables. The result of this chapter could be stated as follows: one obtains a better notion of the mental representation of information without the discourse referents. The advantages are a closer fit to our introspective intuitions about mental representation and, consequently, a better match between
182
chapter five
verbal communication and visual perception, as well as progress with the semantics of attitudes and in understanding linguistic definiteness. 5.7. Conclusion This chapter argued that there is no need to abandon the structures coming out of Chapter 4 Interpretation, since they can be directly interpreted by a model theory and that the resulting theory has important advantages over other styles of doing discourse semantics. The structures seem naturally close to the traditional notion of mental representation in philosophy. To account for definiteness, the formalisms was developed to meet with all the cases. This needs to be done for other central phenomena in natural language semantics and pragmatics building on earlier approaches, e.g. tense and aspect, voice, and modification. The treatment of definites proposed here suggests that—like case-marking prepositions—they should not be treated as words setting up a concept of their own but rather as constraints on the items they mark, constraints restricting the kind of interpretation these items can receive. It would then follow that their semantic contribution is evaluated by simulated production and that they have no meaning of their own. Following Chapter 4 Interpretation, this chapter rounds up an account of the truth conditional meaning of natural language utterances. One could describe the theory that was developed in Chapters 4 and 5 as an account of truth conditions of natural language expressions in a context, with an actual model of prior probabilities the most important missing piece. Production collaborates with prior probability in selecting full representations for verbal and other signals in a context. This chapter defines the contribution of these representations as an update of the context in which they are selected and the conditions under which such contexts are true. For the purpose of stating the relation with models, the approximate way in which concepts with their arguments are represented is good enough. It also seems correct that though many arguments under selection end up unbound, one always needs to state all possible arguments. For linguistic purposes—from our perspective, the needs of production—it would be necessary to allow for morphology that expresses changes in the set of arguments, both in the number (e.g., for voice or causative morphology) and in the interpretation of the arguments (e.g., morphology promoting the subject to an intentional agent). Dowty’s theory (Dowty, 1990) is eminently suitable for labelling an unlabelled sequence of arguments. The proto-agent
mental representation
183
and proto-patient properties P are schematically provable from conceptual knowledge. (174) (X1 …Xn )A ⊧ P(Xi )
A version of Dowty’s theory would be able to obtain labels that can be interpretable in production algorithms as case assignment and preposition selection. But it cannot deal with the communality between concepts A1 …An which are related by morphology, since those morphological operations have to be interpreted as changing the set of arguments. For this purpose— and there may well be others—, frames would be a better representation format for concepts. Putting this question aside, what was given in the last two chapters is a full account of natural language semantics. This is perhaps the right place to compare these results with Montague’s programme. The most important difference is that our theory was not built on an account of syntactic parsing but on an account of utterance production. This was motivated by the possibility of doing both pragmatic enrichment and disambiguation (perhaps a special case of pragmatic enrichment) by Bayesian interpretation using a combination of production grammar and prior probability of the message in the context. The absence of the rule-to-rule hypothesis is the second significant departure from Montague’s programme, and follows directly from the first. There is just no constituent structure that could be used to key semantic translation. The third point that distinguishes the current proposal from Montague grammar is in the structures built. As in DRT, these are not logical forms or abstract meanings but rather new contexts, interpreted as a new information state of the interpreter. The proposal does not appeal to objective meanings but successful communication implies a strong relation between the intended effect of the speaker and the effect on the hearer if the speaker is successful. This in turn requires that there exists sufficient agreement about the content of the concepts employed in the communication. The fourth difference is that pragmatics does not follow after semantic interpretation but is completely integrated. The status of pragmatics is more extensively discussed in Chapter 6 Final Remarks.
chapter six FINAL REMARKS
6.1. Rounding Off The main goal of this book, discussed in Chapter 1 Introduction, was to explain how it is possible that coordination on speakers’ intentions is the normal case in human verbal communication and how speakers and hearers can achieve it. The book from Chapter 2 Syntax onwards, while leaving many questions open, basically achieves this aim. Chapter 4 Interpretation describes an incremental process that leads to the emergence of the most probable interpretation of verbal utterances, a process that starts with the basic cues in the utterance, its start, its end, its words, morphs and constructions. These cue the concepts with which they are associated and a process of conceptual integration among these concepts and the concepts given in the context creates an integrated representation. At each step, the probability of the concept for the cue in the context, for the link between two concepts is evaluated by consulting prior probability and production probability to determine the activation of the concepts and their links, leading to the selection of a winning interpretation. The algorithm allows the conclusion that hearers can quickly converge on the most probable interpretation of the utterance in the context. Chapter 5 Mental Representation shows that the resulting interpretations can be taken as expressions in a logical representation language which captures the traditional notion of a mental representation. Chapter 2 Syntax makes the case that symbolic grammar allows linear time production of syntactically correct utterances from their semantic specification. This process should be refined by the prior probabilities employed in interpretation to ward off unwanted interpretations by the hearer by various optional and seemingly meaningless marking devices. Chapter 3 Self-Monitoring makes the case for optional marking as an explanation of various formal features (word order, optional morphology, optional particles) of utterances. An integration of monitoring in the algorithm of Chapter 2 Syntax would involve the integration of the features for which monitoring is defined in the semantic representation and a continuous check of their values under prior
186
chapter six
probability. Negative results will then force lexicalisation of the features or change the prominence weights for word order. A proper definition will be left to future research. It is clear however that the integration of monitoring does not need to destroy the quasi-deterministic production algorithm. If the hearer computes the most probable interpretation of the utterance and the speaker is able to ensure that that interpretation normally is the interpretation the speaker intended, coordination is a normal event in verbal communication. Coordination requires similarity in the data that speaker and hearer have at their disposal. For priming data and matching values one has to rely on rough equivalences between speakers. Priming is a stable phenomenon between subjects and the resources which influence it, i.e., frequencies, do not vary much between speakers since speakers have similar experiences and concepts. Concepts, too, can be assumed to be largely shared between speakers and that includes preferences for arguments and antecedents that give matching values. Likewise, simulated production is more or less shared between speakers who share a language. But it is clear that for monitoring and interpretation, the stochastic maxima need to be robust maxima in the sense that winners must win by a margin that undoes the effect of variation between the stochastic models of the speaker and the hearer. This means that the formalisation of Bayesian interpretation needs uncertain probabilities and that speakers and hearers can only use inequalities between probabilities in their decisions if these are beyond uncertainty. 6.2. Computational Linguistics The author entered computational linguistics as a semanticist in the hopeful days when it was still thought possible that one could build large-scale systems which would extract information in a logical format from naturally given texts (e.g. newspapers) using a grammar model in the tradition of Montague grammar. One can have certain reservations about the methods but the goal itself is still far from ridiculous. A logical format comes with a precise semantics. That enables the development of provably correct interfaces to all kinds of applications that can be implemented if they are described precisely. If one could succeed in this enterprise, many of the science-fiction goals of computational linguistics would be within our reach. The year 2001 passed quietly and without any version of Hal trying to take over our spaceships.
final remarks
187
But as far as technologies are concerned, computational linguistics became fascinated—some would even say obsessed—with stochastic methods for parsing and translation. Part of this book was devoted to explaining why this is a rational fascination and why it meant progress. It was stochastic parsing that shifted the attention from semantic composition to the ambiguity problem. This book hopes to contribute to the realisation that also in semantics and pragmatics, disambiguation should be in the centre of attention, with lexical semantics and pragmatics having front-line status. Composition can be solved in many ways, but the importance of an account of composition is a function of the use to which one wants to put it, as argued later on in section 6.4. In the conception of this book, natural language interpretation is like other kinds of perception and therefore essentially stochastic. It needs to be so because its task is to find the most probable interpretation within a vast space of possibilities. And finding the most probable interpretation is the essence of the hearer strategy, which in turn is a key element of an explanation of why coordination is the normal case in communication. The book also argued against the idea that traditional phonology, syntax, semantics, and pragmatics should be seen as technologically irrelevant because stochastic models seem to consistently win when it comes to parsing and interpretation. Like the emphasis on composition in semantics and pragmatics, this view in computational linguistics is an obstacle to progress. There is nothing problematic with a linguistics that aims at a general predictive theory of how meanings are coded in language and speech signals in a way that still allows the hearer to recover the intended meaning from an utterance. This is also a necessary ingredient of a cognitively plausible stochastic model of interpretation of verbal signals: it is the model of production probability. And one cannot stress enough that the symbolic linguistic methods that have been employed in natural language generation are quite successful. For the purpose of parsing and interpretation, however, it is a problem that the set of utterances they output are not yet sufficiently complete, i.e. there are naturally occurring utterances that could not be produced by “syntactic realisation”. At this point, it is not clear to what extent stochastic aspects play a role in the linguistic programme of explaining how meaning is coded in sequences of words and in speech. The process where stochastic processes clearly do play a crucial role is self-monitoring as discussed in Chapter 3 SelfMonitoring. There, stochastic preferences in interpretation crucially shape the form of the utterance (and influence the historical development of the relevant languages). What is not known, is to what extent stochastic models play a role in pure syntax and morphology. An interesting and still quite
188
chapter six
defensible hypothesis is that they play no role at all. Under this hypothesis, production without monitoring is just a matter of symbolic rules. It follows from the success of stochastic approaches that one particular computationally inspired enterprise should be regarded as flawed in its motivation: the attempt to do grammatical description in formalisms that allow fast symbolic parsers. These formalisms were developed in response to the emerging insight that efficient symbolic parsing was not possible for the transformational grammars that emerged in the 1950s and 1960s, with Peters and Ritchie (1973) showing that unrestricted transformations lead to untractable parsing. The grammar formalisms developed in response were formalisms such as definite clause grammar, functional unification grammar, lexical functional grammar, generalised phrase structure grammar, unification categorial grammar, combinatory categorial grammar, tree adjoining grammar, and head-driven phrase structure grammar. The enterprise of securing efficient parsing should however be regarded as unsuccessful because the only way to provide a linear parsing algorithm for these formalism is by going stochastic and using an n-best algorithm or—but that seems empirically problematic—a version of Marcus-style deterministic parsing. And yet, linear parsing is essential because without it, one loses the advantage of the superior computational properties one was after in the first place since non-linear parsing algorithms do not explain human performance on the parsing task. Section 6.5 however shows how some of this work can be reinterpreted in the current setup—and this type of reinterpretation is likely be extendable to the other cases. At the same time, it would be hard to overestimate the importance of these formalisms for achieving proper formal description of languages and their importance in achieving computationally testable descriptions. The enterprise may be flawed in its primary motivation, but it is a great success otherwise. Much work in linguistics still suffers from the absence of proper formal testing through computation, i.e. from low standards of formalisation. The goal of obtaining the most probable interpretation is not achievable with Aristotelian grammars. Chapter 1 Introduction, however, also questioned the feasibility of a stochastic pipeline model of interpretation and if that argument is correct, stochastic parsing may be less important than one may think at this moment. As Montague put it, syntax is only interesting as a preliminary step towards interpretation and given that the bulk of the interpretational ambiguities are not syntactic, it is not likely that stochastic parsing would be a central step in the right direction. But stochastic parsing clearly was a methodological step forwards and has enormously increased the insight in what is possible in linguistics with stochastic methods.
final remarks
189
Pragmatics is probably the area where underdetermination of meaning by form is most severe. This can be explained by historical linguistics. The standard assumption in historical linguistics—and one well-confirmed by case studies—is that the earliest words are names for kinds of things and activities (lexical words) and deictical elements (such as this and that). The first two types of words (the lexical words) can be defined ostensively, and that is not possible for the other kind of words (the functional words). Clearly, there is nothing in nature called every or therefore that one could point at in an attempt to state the meaning of a new word. Such words come into being by a grammaticalisation process by which lexical words acquire additional pragmatic meanings. On the part of the source word, this requires properties that allow for the process to start but it also needs a pull: the new function of the word has to be useful. Even so, it is unlikely that complete marking systems for speech acts, or discourse relations could have emerged this way. They would have been the last to emerge and the process whereby they formed loses its pull as soon as it reaches a point where the result is good enough. And, despite the underdetermination of form by meaning, one needs to assume that all human languages are pragmatically adequate, since without the assumption of standard communicative coordination, they would start to disappear. The pull for full pragmatic systems therefore would disappear before they can be achieved. This pragmatic underdetermination has led to a whole tradition which studies pragmatics as an inferential process based on a prior computation of literal meaning. The line of research was started by Grice (1975) but it is equally prominent in relevance theory (Sperber and Wilson, 1995) and Structured Discourse Representation Theory (Asher and Lascarides, 2003). The present book instead proposes an integration of pragmatic and semantic interpretation as an alternative to the pipeline approach where a generalised grammatical model (such as the free categorial grammar of Chapter 4 Interpretation) takes care of everything. It would be wrong, though, to conclude that this is not an inference process because the ‘grammar’ uses probabilistic reasoning. Probability is just an extension of classical reasoning to deal with uncertainty. The crucial difference is that sets of full interpretations are compared by prior and production probabilities. Because of this, ambiguities from any level are resolved in just one single round of comparison. This is not a new idea. The abduction approach of Hobbs et al. (1990) has the same property and Kehler (2002)’s holistic approach to the integration of discourse structure with pronominal and temporal resolution is a clear application of the idea. In this book, the idea is merely radicalised and
190
chapter six
applied also to other ambiguities. As such, it offers an inference model that is simpler than generalised abduction. Goals So in what way could computational linguistics reach the goal of finding precise representations of the intended content of utterances? The considerations of the preceding chapters suggest the following programme: 1. Syntax Classical syntax and morphology should be developed so as to provide a complete mapping from semantic representations to syntactic forms in a quasi-deterministic way. Syntax and morphology should become complete in the sense of the production function being surjective with respect to representative corpora of language use. An interesting—and still open—theoretical question is whether such models should incorporate stochastic rules to achieve completeness or to predict stochastic features of these representative corpora. Complete production grammars currently do not exist but they would be essential for the determination of production probabilities in interpretation. 2. Monitoring Monitoring is more important for proper utterance generation than it is for interpretation. It should be noted that overmarking by ‘pessimistic’ monitoring does not disturb the interpretation process while undermarking by overly ‘optimistic’ monitoring will lead to misunderstanding. Monitoring is not directly relevant to the goal of extracting logically interpretable information from linguistic text but it plays a crucial role in achieving the further goal of developing conversational agents. 3. Interpretation Interpretation presents computational linguists with three tasks: the development of an account of concepts, cues, and prior probabilities. a. Concepts If concepts are—as this book claims—the backbone of interpretation, one needs to get hold of them. In this kind of research, semantics plays a central role. The concepts expressed by functional concepts cannot be learnt. They need to be hand-written, and that includes the silently expressed concepts needed for turns, utterances, and interjections. Systematic lexical semantics, which in conjunction with automatic learning would be needed to deal with the
final remarks
191
overwhelming majority of lexical words in human languages, is still a largely unexplored area. And then there is also a need for better computational representations of these concepts. If these areas of research develop in the right direction, disciplinary distinctions between semantics, computational linguistics, ontology extraction, semantic web and word-sense disambiguation will largely disappear. What one needs is structure, hierarchy, and prediction from concepts to their arguments, and from contexts to the concepts that words will express in those contexts. b. Cues Within an enterprise that deals with concepts, it would be hard to avoid dealing with cues as well. Words often serve as cues for many concepts. Word-sense disambiguation techniques predict which concept is most probable in a context and the profile of arguments can be reliably estimated from corpora. c. Prior probabilities The task of building a stochastic model of the world and its inhabitants seems overwhelming. On the other hand, some powerful components are already either available or close at hand. ci. Theorem proving Theorem provers are good at finding zeros and ones in prior probabilities and can be a way of boosting interpretation as shown, e.g., by the work of Bos (2003) which checks the Stalnakerian assertion conditions. These conditions interact directly with ambiguities at each level of linguistic analysis. cii. Stochastic selection restrictions Stochastic selection restrictions can be estimated from annotated corpora. They form a powerful filter on argument binding and pronoun resolutions. Automatic learning of selection restrictions is within the state of the art (Brockmann and Lapata, 2003). Type logic can be used for selection restrictions as shown by Asher (2011). ciii. Stochastic communication theory would be an estimate of the most likely move in a conversational situation. Developing such a theory seems straightforward. civ. Conceptual knowledge as embodied in Wordnet, ontologies and other resources is directly relevant for both theorem proving and selection restrictions. It will be a hard problem to configure these different resources to produce a single prior probability.
192
chapter six
None of the enterprises listed above seems however beyond what can be done. The conclusion should be that implementation of the theory outlined in the book would be a lot of work but is basically achievable. This book can also be read as an explanation of why stochastic parsing works. Like other kinds of perception, natural language interpretation is an essentially stochastic process and estimates of the relevant stochastic parameters would help to reach the intended interpretations or to compute the syntactic reflex of these interpretations, the parse trees. Monitoring explains why this is so successful: speakers try to make sure that the most probable interpretation matches their intended interpretation. This may help explain the following paradox. Many stochastic parsers deliver two numbers which have the same external interpretation. One is the success rate of the overall system, let us say it has 81% accuracy. Internally, the probability of a particular parse is also estimated: let us say the most probable parse has a probability of 0.012. These are rather typical numbers, with the accuracy being fairly high, and winning probabilities very low. Both of these numbers can be seen as answers to the question how probable it is that the most probable parse is the right one, p(Right Tree|Input), and yet, they give very different answers. An important factor is surely that while stochastic parsers are not very good at estimating the probabilities of different parses, they tend to be still right about the winner. They systematically underestimate the degree to which it is a winner. But—I would suggest— monitoring, too, may be an important factor in explaining the discrepancy. The speaker has made sure the most probable parse is the right one and the parser is sufficiently sensitive to prior probability to see which one it is. Given that most of techniques for estimating prior probabilities only work on proper semantic representation and would have only relatively weak reflexes on syntactic structure, it is hard for the proposals in this book to have a direct impact on stochastic parsing. The use of a realistic estimate of production probabilities in parsing, however, is a different story. It is this element that may potentially allow a better estimate of the syntactic prior probabilities and lead to better stochastic models of parsing.
final remarks
193
6.3. Pragmatics In Chapter 4 Interpretation and Chapter 5 Mental Representation, pragmatics is not treated separately and the same holds for Chapter 3 SelfMonitoring. Instead, many different decisions regarding interpretation, which are taken there, are guided by pragmatic principles and thereby reduced to grammar—at least if the generalised free categorial grammar of Chapter 4 Interpretation is taken to be a grammar. The pragmatic principles used are still well described by the system of optimality theoretic pragmatics in Zeevat (2009b). This system is abstracted from the presupposition treatments of van der Sandt (1992) and Heim (1983b), closely following Jäger and Blutner (2000), the first treatment of presupposition in optimality theory. Zeevat (2009b) shows that the system given by (175) is in fact an improved version of Gricean pragmatics: (175) faith > plausible > *new > relevance
Zeevat (2010) demonstrates that rhetorical relations can be integrated into a system of this kind. The system obviously includes the handling of presuppositions. Zeevat (2009) runs through a typical range of Gricean implicatures and shows how they can be recovered. Given that rhetorical relations and presupposition are not obviously in the scope of Gricean pragmatics, this makes the system an improvement with respect to Grice as a foundation for pragmatics. The pragmatics is bidirectional due to the constraint faith, which requires that the chosen interpretation should have the given utterance as its optimal realisation. plausible is probability maximation, which demands that there should not exist an alternative interpretation that is more or equally probable. *new enforces maximum integration and finally, relevance is a principle enforcing that activated questions addressed by the utterance should count as settled by the answer provided by the utterance. Remko Scha’s (p.c.) observation that the first two constraints are close to a definition of Bayesian interpretation. They come back in the algorithm of Chapter 4 Interpretation as simulated production and the general preference for bindings favoured by prior probability. *new is the cue integration of vision. It is implemented by the categorial formalism: one first tries to bind and only then one considers non-binding interpretations if these are allowed.1 relevance is not directly implemented. It could be captured by a 1 Pronouns, obligatory arguments, and particles (non-necessary ingredients of language, due to grammaticalisation) do not permit non-binding interpretations.
194
chapter six
general mechanism that would activate questions from earlier utterances as well as the current one and construct links to the future for their answers. This would effectively reduce the part of relevance that is not part of intention recognition to *new. In other words, it would provide better cue integration. Questions can be activated explicitly by the speaker and the hearer but also by other mechanisms. A goal that is activated raises questions about the way it could be reached. Even the choice of a special construction such as John has three sheep instead of John has sheep in answer to the question Does John have animals? will activate a question How many sheep does John have?. A link to the future (resolved to the semantic answer: three) in turn yields a scalar implicature of the form: the precise number of sheep that John has is three. The innovative aspect of the approach in Chapter 4 Interpretation is that the general description is not just pragmatics but also semantics. Meaningful formal properties of the input (word order, morphemes, and words) are checked by faith through simulated production. In the absence of decisions forced by these features, bindings are decided by probability and optional bindings are enforced either by *new or—in terms of Chapter 4 Interpretation—by constructing the most probable links for concepts with a stochastic disadvantage for accommodation and other processes that can replace binding. The reduction of pragmatics to a Bayesian interpretation process is part and parcel of Hobbs’ Interpretation by Abduction (Hobbs et al., 1990). Phenomena that Hobbs discusses in his paper but which have not been explicitly dealt with here could be treated in much the same way as Hobbs treats them. The reverse holds as well. The treatment of presupposition from which the optimality theoretic pragmatics introduced above was abstracted, namely the common core of the views of Heim and van der Sandt, is in fact simply a direct consequence of abduction. What needs to be assumed is that a presupposition trigger requires its presupposition. Any explanation of an utterance with a trigger presupposing P should assign a set of facts T to the utterance such that (176) holds. (176) T ⊧ P
If the presupposition P is locally available in T, this comes at zero cost. If it needs to be assumed, that is, added to T, it comes at some cost. This results in the preference for binding over accommodation. The Heim-van der Sandt preference for global accommodation can be captured by the assumption that the cheapest fix is standardly the assumption that the speaker believes
final remarks
195
P, i.e. by adding P to T. This is a default assignment of costs that can be overridden by any factors that increase the cost of that particular assumption, such as conflicting content, conflicting presuppositions, common ground facts, common ground speaker beliefs, or implicatures of the sentence2— in fact the whole list of cancellation factors listed in Gazdar (1979). More costly assumptions would be for a trigger in a complement reporting an attitude of John, that the speaker believes that John believes that P or in the consequence of an implication that the speaker believes that P holds under the assumption of the antecedent of the implication. An even more costly assumption would be that the speaker is using the trigger in the nonstandard way as a shorthand for P ∧ A where A is the normal trigger meaning. The same pattern holds for the rest of the algorithm of Chapter 4 Interpretation as well. It can be easily implemented in an abductive system.3 What Chapter 4 Interpretation adds to this approach, however, is strict constraints on the nature of the abduction axioms and on their weighting, and on the conditions under which additional assumptions can be made and on the order in which things are to be checked. An interesting conceptual point about the pragmatics resulting from Bayesian interpretation—both in an abduction formulation and in the formulation here—is that quite a substantial part, namely probability maximation and cue integration, applies in equal degree to vision. In vision, as in natural language interpretation, cue integration would prefer interpretations that connect parts of the signal with earlier perceptions by transtemporal identities and causal connections. Probability maximation would become most pragmatic when what is perceived is the behaviour of fellow humans and especially where that behaviour is non-verbal communication directed at the perceiving subject. Intention recognition is central for the perception of any kind of intentional behaviour: seeing John trying to catch a bus is seeing John running in the direction of a bus stop and inferring his intention to catch a bus from the fact that one sees a bus approaching that
2 Heim has only the conflicting common ground facts. The additional preferences for alternative accommodation sites of van der Sandt do not follow from abduction. 3 The statement that a word means one of a set of concepts is a set of abductive axioms, while the formalisation of a concept as a sequence of presuppositions for which links have to be constructed is another set of axiom. Production syntax fits naturally into abduction. A difference is that the treatment of weights must be made context-dependent.
196
chapter six
bus stop. Seeing John holding up his hands in despair, on the other hand, involves the need of a further inference of something that John wants the subject to do, believe, or feel. 6.4. Semantic Compositionality As noted above, the book does not endorse a semantics that is separate from pragmatics and it does not want to read off the semantics from syntactic trees by the rule-by-rule hypothesis. Does this amount to a rejection of the compositionality principle? That does not follow at all. Both syntactic trees and pragmatics-free semantics could be recovered from the interpretation that emerges. The recovery could be done in many different ways, so that it is possible to satisfy many preferences regarding the organisation of syntactic trees and or the set-up of the logical forms. Clearly, one can add another desideratum pertaining to these emergent structures, namely, that they should systematically follow compositionality in the form of the rule-to-rule hypothesis. One could even try to reconstruct the semantic combination processes by spelling them out using lambda-abstraction. And perhaps—but that would be more difficult—one could even reduce the interaction with linguistic and non-linguistic context to an innovative new idea about syntactic structure and work out the relation with the now more concrete logical forms in a compositional way. The point is not whether this would be possible but whether it would help us to gain a better insight into how language works. Personally, I am rather doubtful as to the fruitfulness of such enterprises. The correspondence between syntactic operations and semantic operations does not play a causal role in either production or interpretation. One would be hard put as well to assign it a causal role in the other two linguistic processes that need theoretical explanation: learning and language evolution. From the perspective of the book, one can only partly agree with Frege’s formulation, where compositionality is the claim that the meaning of a complex expression is a function of the meaning of its components and of the way in which they are combined. In this formulation, two other determinants are missing: the prior probability in the context of the propositional meanings that may arise and the contents of the linguistic and nonlinguistic contexts that may have to be integrated. What can be maintained is the combination of a radical version of Frege’s contextuality principle (177)
final remarks
197
(177) A linguistic expression has a meaning only in the context of a successful verbal utterance.
and a compositional rider (178) on contextuality. (178) The meaning of an expression is primitive or given by the context or a function of the meanings of its constituents. The question which primitive meaning or which element of the context that is, is partly constrained by the lexicon of the language. The question which function combines the meanings of the components is partly determined by the way in which the constituents are combined.
The view on natural language interpretation and production developed in this book does not emphasise constituents (they may be important but it is an open question to what extent they are needed for an account of production) and similarly, it does not make syntactic combination rules central to explaining why certain semantic objects combine in particular ways: concepts define their interaction possibilities autonomously and syntax is just one of the constraints that need to be considered. The account also does not support the claim that a combination of verbal meanings by itself leads to interpretation. Rather, it is emphasised that interpretation is systematically underdetermined by form and the words by themselves do not suffice to determine the intention of the speaker. If the account in this book is indeed a rebellion against compositionality, it is so by claiming that compositionality does not play an essential role in either production or interpretation, and by extension that its role is not essential for explaining in what way infinitely many meanings can be expressed by a language and inferred by an interpreter. It is not necessary either for explaining that temporally extended visual signals can be interpreted as an unlimited number of experienced events. The approach presented in this book stresses the importance of selecting between different meanings, thus implicitly going against a view that the central problem of semantics is combination. The choice of a particular combination of meanings is just yet another type of selection, one that may sometimes be determined by syntax. 6.5. LFG 3.0 and PrOT 2.0 The following remark could not have been written before this book was almost completed but the proposals are solid and interesting ones that represent the conceptual progress that was achieved since 1974 around formal models of natural language.
198
chapter six
LFG 1.0 is the rather brilliant design of Ron Kaplan and Joan Bresnan which overcomes several problems with the versions of transformational grammar that were popular in the early 1970s. It is documented in Kaplan and Bresnan (1982). Chomsky’s arguments for transformations (Chomsky, 1957) are just taken on board. One and the same conceptual structure can be coded in quite different ways within the same language or across different languages. It is on the other hand hard to believe that conceptual structures created by different languages differ from each other to a very substantial extent. They vary only moderately, which is by now shown by extensive research (Boroditsky, 2003; Bowerman and Choi, 2003; Deutscher, 2010). If a concept plays a central role in the grammatical system of the language, this leads to superior application and recognition abilities with respect to that concept for speakers of that language. Apart from that, speakers of different languages can be assumed to entertain roughly the same thoughts, which they express using the conceptualisations their languages provide and which have a good deal in common. The transformational proposal makes good sense because one needs to explain how it is possible that one and the same thought gives rise to a wide variety of formulations of that one thought. The main objection to the transformational approach met by LFG 1.0 is the problem it has in reversing the process from formulation to interpretation—and this problem is solved by LFG 1.0. Context-free grammars are annotated with feature structure equations which are solved by an equation solver. Most of the syntactic transformations can be captured by lexical rules which the allow complex feature equations to be associated with lexical items produced by morphological processes. LFG 1.0 implements an inversion of a version of generative semantics where deep structure is understood as semantic representation (logical form). LFG 1.0 thus solves the problem of the complexity of the inverse processing noted by Peters and Ritchie (1973) while still meeting the explanatory goals behind Chomsky’s transformational proposals. A slight set-back is the rather complicated account of generation (Wedekind, 1988), which presumably provides an example of a proposal for NL generation that would resist a linear reformulation. From the perspective of this book, LFG 1.0 adds cue evaluation to contextfree parsing. Morphological and word-order cues are captured in an attractive computational format that allows efficient constraint solving. LFG 2.0 is the optimality theoretic lexical functional grammar (OT-LFG) proposed by Bresnan (Bresnan, 2000). It reverses LFG, turning it again into a production grammar which derives surface forms from argument structures by a system of OT-style constraints. This approach allows a treatment along
final remarks
199
the lines of Chapter 2 Syntax, though it is considerably harder to see each constraint as a separate procedure: some constraints have to be combined and others need to be reformulated.4 But assuming that it is possible to deal with OT-LFG in this way, OT-LFG brings linear generation to LFG. LFG 3.0 is then what should replace LFG 1.0 and LFG 2.0. It is a new optimality-theoretic system consisting of constraint solving defined by the LFG notions of coherence and consistency (two undominated constraints), and the OT pragmatic constraints faith, plausible, *new, and relevance. As in LFG 1.0 and 2.0, coherence prohibits, within a larger semantic representation, semantic representations which are not licensed by concepts that link to them as their arguments or modifiers as well as concepts that need an argument that is not present. consistence prohibits double bindings of argument places. coherence can be naturally extended to deal with pronouns and presupposition triggers. faith requires a semantic representation to have the input as an optimal realisation. plausible is violated by alternative semantic representations that meet the stronger constraints but have a higher prior probability given the context. *new is violated by candidate semantic representations which identify more material in the representation with material given in the context and relevance by representations that answer more of the activated questions. Give and take some conceptual differences, this amounts to the linear and incremental constraint solver proposed in Chapter 4 Interpretation which works on the concepts cued by the words. The main conceptual differences lie in the labelling of the argument places and in the treatment of quantifiers and operators. LFG 3.0 constitutes progress on version 1.0 because the cues are all part of faith and thereby can be checked by OTLFG. Also, this more advanced model integrates pronouns, presupposition triggers, and other pragmatic phenomena and thus represents a principled way of finding the most probable interpretation by the Bayesian method. Proper cues for interpretation defined in context-free rules are replaced by lexical elements, which cue concept representations, the predicates. In other words, in terms of LFG 1.0 a flat structure is assumed throughout. (The trees may re-emerge in OT-LFG simulated production). Apart from highlighting that LFG is a kind of categorial grammar as well—and a rather good one—LFG 3.0 improves on the relation between
4
Zeevat (2008) discusses this technical issue at more length.
200
chapter six
f-structure and semantic representation, which here becomes full identity. f-structure is now just an underspecification technique or an intermediate level that one may but need not assume. The main insight is the underdetermination of meaning by form, which is optimally captured in OT-LFG by mapping many semantic representations to the same surface structure. LFG 3.0 also improves on LFG 1.0 and 2.0 by incorporating full anaphora and presupposition resolution. This book can be read as giving a version of LFG 3.0 with fewer commitments. It is not principally committed to trees, feature structures, grammatical functions, thematic roles, or even to OT, though it is not opposed to any of these either. PrOT 2.0 Based on the rat/rad-problem, Hale and Reiss (1998) suggest that in phonology, one should stick to production OT, and this proposal naturally extends to full linguistics. Let’s call this production optimality theory 1.0, in short: PrOT1.0. It has advantages when it comes to descriptive work because it allows a direct comparison with observations. Unlike bidirectional OT, PrOT1.0 has the appeal of ‘what you see is what you get’. Not surprisingly, PrOT1.0 also has some disadvantages. It gives only a partial model of production without the bidirectional effects discussed in Chapter 3 Self-Monitoring. It also lacks the grammar-based processing account of interpretation which comes with the neural net interpretation of optimality theory as a special case of harmonic grammar. This book can be seen as outlining a full model that should replace PrOT1.0, PrOT 2.0, by developing a stochastic interpretation model based on PrOT1.0, which is then also used to provide the necessary bidirectional improvement on the production model of PrOT1.0 and to incorporate disambiguation as well as other pragmatic phenomena, OT learning, and interspeaker coordination. This is as close to one of the goals that motivated this book as one can get. It was unsettling that there should be two independent versions of the formmeaning relation in optimality theory which are both complete characterisations of the form-meaning relation, namely, the relation between inputs and optimal candidates in PrOT1.0 and the independent notion given by interpretational optimisation. One could assume pre-established harmony but such as step would need motivation that should be more convincing than the corresponding Cartesian doctrine. Moreover, its assumption tends to lead to difficulties such as the rat-rad problem or the emergence of block-
final remarks
201
ing where intuitively there is none, as in the McCawley example (179), which in the standard version of BiOT can be interpreted but not produced (Blutner, 2000). (179) Black Bart caused the sheriff to die.
The plausibility of optimality theory in production is easy to motivate: many decisions need to be taken in producing a verbal utterance and optimality theory corresponds to a simple and efficient way of decision-making where the most important concerns take full priority over less important concerns. This is a characterisation of production OT that comes very close to the motivation for systemic networks developed for natural language generation. Moreover, it is more important that there be a system in the decisions (that makes better signals of the verbal utterances) than what particular decisions are made (evolution will prevent systematic bad decisions). But this characterisation does not carry over to interpretation. The hearer should be right in her construction of what the speaker wanted to say. As in other kinds of perception, the outcome matters and overcoming the underdetermination of meaning by form is the name of the game. A nearly mechanic decision mechanism is not the best tool to use for this task and it is unrealistic to assume that evolution would have preferred it over the Bayesian processes employed in other perceptual tasks at a time when these were already fully available. The assumption of Bayesian interpretation processes turns a productionoriented grammar into the core linguistic resource for interpretation and makes it clear that the stochastic models of the world and its inhabitants needed for perception can also do their work in natural language interpretation by helping to find the most probable interpretation. Moreover, finding the most probable interpretation amounts to finding a pragmatically enriched interpretation, the same enrichment that a properly weighted abductive system would come up with under a proper construction of the context and the concepts cued by verbal utterances. The introduction spelled out an argument to the effect that with production adapted to the hearer strategy of finding the most probable interpretation (implemented by automatic self-monitoring), standard coordination on the speaker’s intention becomes likely—though it cannot be guaranteed. In sum, PrOT 2.0 seems viable as a solution to the rat-rad problem and as an explanation of the gap between production and interpretation in adult language, which moreover avoids the problematic assumption of separate production and interpretation grammars. The answer to that problem is
202
chapter six
Bayesian interpretation with the optimality theoretic production grammar providing estimates of production probability. The gap between production and interpretation is predicted because the grammar serves as the upper limit on monitored production and a lower limit on understanding. Moreover, where the grammar does not assign meanings, linguistic cues will be integrated into maximally probable hypotheses. The learning model of Tesar and Smolensky (2000) does not need adaptation: constraint demotions need to follow data ⟨I, O⟩ if the learner would produce O’ instead of O. What is not needed is the initial dominance of markedness constraints over faithfulness constraints, robust parsing comes from elsewhere. There are some additional data to learn, namely, the cues. That, however, does not seem to present any special problem. It is normal association learning, where a cue is associated with what it cues in experience and the frequency with which it predicts what it cues is reflected in its strength. 6.6. Language Evolution If Gil (2005) is right, no grammaticalisation has occurred in Riau Island Indonesian. There are no word order constraints, no morphology, and no functional items. From the perspective of this book, this would mean that the only restrictions on form are those due to higher level generation (Chapter 2) and automatic self-monitoring (Chapter 3). All constraints that lead to factorial typology remain dormant: any attempt to rank them by learning has failed and will keep failing. Grammaticalisation can be seen as two separate processes, namely, the emergence of grammatical constraints and the emergence of functional items. The latter process is addressed in Zeevat (2006b), where it is explained by the possibility of marking an undermarked yet important semantic feature of the input by a lexical item that to some extent cues the feature. This marking strategy increases the probability of coordination and successful coordination will strengthen the activation of its coordination device for future decisions about new but similar coordination problems. If this innovation becomes standard practice, its association with the semantic feature will grow at the cost of its original meaning, which may as a result even completely disappear. This process is closely related to automatic self-monitoring. If automatic self-monitoring looks at such features, it will force employment of the new
final remarks
203
device whenever no better alternatives are present and the centrality of the features for which automatic self-monitoring can be assumed guarantees a high frequency of the new device as a result. It is in the nature of marking that the tendency to mark from automatic self-monitoring changes the probabilities so that the absence of a marker will become an ever stronger signal that the feature does not obtain. That increases the pressure to mark the feature, which in turn leads to more use and an even stronger signal value for the absence of the marker as an indication that the feature is not present. This can result in marking taking place whenever the feature holds in the input. It then becomes a max(f) constraint by automatisation in which simulated interpretation no longer plays any role. This constraint will survive in the way these constraints tend to. It will allow distortions of the original meaning of the marker, allow merger with other markers, and the marker may even disappear by losing more and more of its phonological effects. These events are impossible under monitoring and indicate a proper grammatical process rather than monitoring. The story about fronting constraints (f < x) is very similar. Freezing under monitoring often will force wh-phrases or subjects to come earlier. This may lead to an invariable decision that can be keyed directly from the semantic feature and automatised. Once again this would be the automatisation of a monitoring effect. A similar story holds for the special positions (one and extra) in the OT grammars described in Chapter 2 Syntax. A special position emerges from the rest of the grammar but its filler is fixed by automatised word-order freezing. These grammaticalisations round off what needs to be said about syntax in general. Syntactic constraints are automatisations of monitoring processes that resulted in uniform outcomes. Such monitoring processes have recruited lexical items to become functional, functional items to become even more functional and created morphology and syntactic rules. There may be many functional items and many rules. The system and its history can be very complex. Yet, although they play a role in preventing ambiguity, it would be wrong to place too much importance on these automatisations. Riau Island Indonesian can be used as an argument that communication is quite possible without them. The mechanism that leads to successful coordination on speaker’s intention would still be the combination of hearers going for the most likely interpretation and speakers making sure that the string of lexical items they produce has as its most probable interpretation the intention they had in producing it.
204
chapter six
In going for the most likely interpretation, hearers would have to rely on the frequencies with which intentions are formulated by putting lexical items in the order in which they come in the utterance. If the prominence order in regulating word order is natural, important stochastic generalisations will emerge even in absence of a proper grammar. 6.7. Conceptual Glue An often asked traditional question is how the constituents of a thought or a proposition are put together and why this happens. The answer going back to Frege is that concepts arise by abstraction and that also due to abstraction, concepts are unsaturated and inclined to combine with adjacent semantic objects. Mainstream semantics develops this idea by using the typed lambda calculus: a calculus of abstraction and function application. Presumably, the reason why the function applications happen is that unsaturated concepts are not themselves thoughts and application of a concept to another semantic entity is the only way to arrive at assertions or referential expressions. This view leads to the idea that lexical items can be classified by the type of the concept they express and that syntactic constructions correspond with the ways in which concepts expressed by the constituents to which the construction applies have to be combined, namely, a combinator of the typed lambda calculus (a λ-expression without constants) that can be seen as the meaning of the construction. In heavily grammaticalised languages such as English, there are natural phenomena that correspond to semantic glueing. Many lexical items have obligatory arguments, meaning that they cannot occur without their arguments also overtly occurring in the same clause. There is some promise in the idea of deriving obligatory arguments from the content of the concept since a concept is usually not informative enough without the obligatory argument. This is on the right track but grammaticalisation also needs to be assumed since obligatory arguments are not a linguistically universal feature of languages. A word having no useful meaning without a particular argument explains the high frequency of that argument co-occurring with the word that is a prerequisite for the argument becoming obligatory by grammaticalisation. Arguments as such are a universal feature, but in this case, one is talking about optional arguments. Another phenomenon of the same kind is obligatory anaphora: lexical items that need an antecedent to be meaningful. This is a property of anaphoric pronouns (with the exception of it and there, which can have zero meaning) and the subclass of presuppo-
final remarks
205
sition triggers which do not accommodate. Again, this can be related to the triviality of readings obtained by accommodation, and once again the phenomenon of obligatory antecedents is not universal. Demonstratives and names are the only good candidates for being universal lexical classes which would have the property of having obligatory antecedents. The antecedents do not, however, come from the linguistic context but—in the prototypical cases—from the visual context and the general common ground. So there is a linguistic problem for Frege’s solution to the glue problem. Optional arguments do not support the idea that lexical meanings are unsaturated or for the idea that functional application is necessary to arrive at propositional meaning. Propositional meaning seems already to be available with the word. And neither does optional anaphora. Instead, one has to assume that lexical meanings get integrated whenever this is possible or, better, that interpreters integrate them, whenever this is possible. A comparison with the situation in vision seems helpful. A combination of cued representations (i.e., concepts linked to parts of the signal) in vision integrates into a new representation which would make a stronger prediction about the signal and thus represent a step in reaching a single representation that covers more of the signal. In other words, it would increase both the validity of the perception and its information value. The validity increases because the prediction about the signal becomes stronger and rules out more integrated representations. The information increases since the integration is compatible with fewer possibilities than the isolated representations. This is as it should be: a natural visual signal is a signal caused by a natural scene. It can be collected in a single hierarchically organised representation, where each level of the hierarchy corresponds to a concept integrating other representations. An integration of representations by a concept should be consistent with the relevant aspects of the signal and where it is not, this should lower its activation in favour of other integrations. The prediction coming from the integration of a set of representations is always stronger than the prediction from the representations in the set. If the signal still meets the prediction from the integration, all of the representations integrated in it will gain activation. Cue integration is therefore a rational way of increasing the validity and informativity of vision, where both validity and informativity enhance the chance of survival of the viewer. It rests on the fact that a visual signal comes from a natural scene. In natural language interpretation, the natural scene is replaced by the speaker intention, again a hierarchical and complex integration of constituent representations. This makes cue integration as rational in natural language interpretation as it is in vision: it increases the
206
chapter six
validity of the hypothesis about the speaker’s intention by predicting the utterance and the informativity by excluding possibilities. The glueing problem connects to an important event5 in the evolution of language: the point where two words were first combined into one complex message. One word signals correspond ambiguously to complete messages. (180) lion There is/was a lion/Watch out for the lion, etc wood I/you/we go to/was in/were in the woods, etc lion wood There is/was a lion in the wood/ a/the lion came out of the wood, etc.
The fact that two words are produced together cues the idea that there is a single intention behind them. Combining the representation of wood and lion is standard in vision and one can assume that the hearer of the first two word sentence just integrated them. The utterance itself cued—as for one word utterances—a range of intention types in which the integration should be incorporated. The leap of imagination is therefore the speaker’s: more things can be said more reliably with two word utterances. In this model, the single speaker intention takes the place of the visual scene as a consistent target of cue integration. Vision would also be the model for the integration process: concepts come with slots for other representations, cue integration then fills in slots in concepts with other representations and identifies currently cued representations with old ones. The interpretation of multi-word utterances as integrated representations follows the lead of vision and is reinforced by the same factors that promote it in vision: increased validity and informativity. The principle which shows up in pragmatics as *new reflects the idea that a sequence of words presented together should be interpreted as a single intention in that context. It asks interpreters to identify cued representations with given representations whenever that can be done and does not lead to a dramatic loss of prior probability. It applies to complex concepts, such as grasping, which is itself a cue to a grasper and a grasped object. Identifying a grasper and a grasped object with old or co-presented representations is also an instance of compliance with *new. The emergence of conceptual items that force such identifications instead of merely preferring
5 The other innovation behind human language is the introduction of new words by invention and ostension, something that is likely to have preceded combination of words.
final remarks
207
them was created by grammaticalisation. It should not serve in an explanation of why meanings have to be combined in a multi-word utterance, as in Frege’s picture. This brings us to the mechanics of such combination. Chapter 4 Interpretation claims that presupposition is the basic notion that needs to be refined to accommodation-free presupposition of which nominal reference is a special case. That is good enough, though the idea of cue integration is more general. Words (and combinations of visual cues) cue concepts and concepts cue their natural parts: their internal object (which can be identified with that of an already given object), proper parts of the internal objects, their spatial and temporal location (if any), and the fillers of thematic roles that may come with the concept. And all of these are open to identification with given or co-presented objects.
BIBLIOGRAPHY Aissen, J. (1999). Markedness and subject choice in Optimality Theory. Natural Language and Linguistic Theory, 17:673–711. Aissen, J. (2003a). Differential coding, partial blocking, and bidirectional OT. In Nowak, P. and Yoquelet, C., editors, Proceedings of the Berkeley Linguistic Society, volume 29, pages 1–16. Berkeley Linguistic Society, Berkeley (CA). Aissen, J. (2003b). Differential object marking: Iconicity vs. economy. Natural Language and Linguistic Theory, 21:435–448. Ajdukiewicz, K. (1935). Die syntaktische Konnexität. Studia Philosophica, 1:1–27. Appelt, D.E. (1992). Planning English Sentences. Cambridge University Press, Cambridge. Asher, N. (2011). Lexical Meaning in Context: A Web of Words. Cambridge University Press. Asher, N. and Lascarides, A. (2003). Logics of Conversation. Cambridge University Press, Cambridge. Asher, N. and Wada, H. (1988). A computational account of syntactic, semantic and discourse principles for anaphora resolution. Journal of Semantics, 6:309– 344. Bach, E. (1970). Problominalization. Linguistic Inquiry, 1:121–121. Barsalou, L. (1992). Frames, concepts, and conceptual fields. In Kittay, E. and Lehrer, A., editors, Frames, fields, and contrasts: New essays in semantic and lexical organization, pages 21–74. Lawrence Erlbaum, Hillsdale, New Jersey. Barsalou, L., Simmons, W., Barbey, A., and Wilson, C.D. (2003). Grounding conceptual knowledge in modality-specific systems. Trends in Cognitive Sciences, 7:84– 91. Beaver, D.I. (2004). The optimization of discourse anaphora. Linguistics and Philosophy, 27(1):3–56. Beaver, D. and Lee, H. (2003). Input-output mismatches in OT. In Blutner, R. and Zeevat, H., editors, Pragmatics and Optimality Theory, pages 112–153. Palgrave, Basingstoke. Beaver, D. and Zeevat, H. (2006). Accommodation. In Ramchand, G. and Reiss, C., editors, Oxford Handbook of Linguistic Interfaces, pages 503–538. OUP, Oxford. Blackmer, E.R. and Mitton, J.L. (1991). Theories of monitoring and the timing of repairs in spontaneous speech. Cognition, 39:173–194. Blutner, R. (2000). Some aspects of optimality in natural language interpretation. Journal of Semantics, 17:189–216. Boersma, P. (2001). Phonology-semantics interaction in OT, and its acquisition. In Kirchner, R., Wikeley, W., and Pater, J., editors, Papers in Experimental and Theoretical Linguistics, volume 6, pages 24–35. University of Alberta, Edmonton. Boersma, P. (2007). Some listener-oriented accounts of h-aspiré in French. Lingua, 117:19–89. Boersma, P. and Hayes, B. (2001). Empirical tests of the gradual learning algorithm. Linguistic Inquiry, 32:45–86.
210
bibliography
Boroditsky, L. (2003). Linguistic relativity. In Nadel, L., editor, Encyclopedia of cognitive science, pages 917–922. Macmillan, New York. Bos, J. (2003). Implementing the binding and accommodation theory for anaphora resolution and presupposition projection. Computational Linguistics, 29:179–210. Bouma, G. (2008). Starting a Sentence in Dutch: A corpus study of subject and objectfronting. PhD thesis, University of Groningen. Bowerman, M. and Choi, S. (2003). Space under construction: Language-specific spatial categorization in first language acquisition. In Gentner, D. and GoldinMeadow, S., editors, Language in Mind, pages 387–427. MIT Press, Cambridge (Mass). Bresnan, J. (2000). Optimal syntax. In Dekkers, J., van der Leeuw, F., and van de Weijer, J., editors, Optimality Theory: Phonology, Syntax and Acquisition, pages 334– 385. Oxford University Press, Oxford. Bresnan, J. (2001). The emergence of the unmarked pronoun. In Legendre, J.G. and Vikner, S., editors, Optimality-Theoretic Syntax, pages 113–142. MIT press, Cambridge (MA). Bresnan, J., Dingare, S., and Manning, C.D. (2001). Soft constraints mirror hard constraints: Voice and person in English and Lummi. In Proceedings of the LFG ’01 Conference. CSLI, Stanford University. Brockmann, C. and Lapata, M. (2003). Evaluating and combining approaches to selectional preference acquisition. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL-03), pages 27–34, Budapest. Brouwer, L. (1929). Mathematik, Wissenschaft und Sprache. Monatshefte für Mathematik und Physik, 36:153–164. Carpenter, B. and Penn, G. (1996). Efficient parsing of compiled typed attribute value logic grammars. In Recent Advances in Parsing Technology, pages 145–168. Kluwer, Dordrecht. Casadio, C. (1988). Semantic categories and the development of categorial grammars. In Bach, E., Oehrle, R.T.H., and Wheeler, D., editors, Categorial Grammars and Natural Language Structures, Proceedings of the Connference on Categorial Grammars, Tucson (Arizona) 31-05–2-06 1985, pages 95–123. Reidel, Dordrecht. Choi, H.-W. (1999). Optimizing Structure in Context: Scrambling and Information Structure. CSLI Publications, Stanford University. Chomsky, N. (1957). Syntactic Structures. Mouton, The Hague. Christopherson, P. (1939). The Articles: a study of their theory and use in English. Munksgaard, Copenhagen. Clark, E. and Hecht, B.F. (1983). Comprehension, production, and language acquisition. Annual Review of Psychology, 34:325–349. Clark, H.H. (1996). Using Language. Cambridge University Press, Cambridge. Crain, S. and Steedman, M. (1985). On not being led up the garden path: The use of context by the psychological syntax processor. In Karttunen, L., Dowty, D., and Zwicky, A., editors, Natural Language Parsing: Psychological, Computational and Theoretical Perspectives, pages 320–358. Cambridge University Press, Cambridge. Deutscher, G. (2010). Through the Language Glass. Heinemann, London. Dik, S.C. (1989). The Theory of Functional Grammar. Walter de Gruyter, Berlin.
bibliography
211
Dörre, J. and Dorna, M. (1993). CUF—a formalism for linguistic knowledge representation. In Deliverable R.1.2A, DYANA, pages 1–22. IMS, Stuttgart University. Dowty, D. (1990). Thematic Proto-Roles and Argument Selection. Language, 67:547–619. Eckardt, R. and Fraenkel, M. (2012). Particles, maximize presupposition and discourse management. Lingua, 122:1801–1818. Edelberg, W. (1992). Intentional identity and the attitudes. Linguistics and Philosophy, 15:561–596. Evans, G. (1977). Pronouns, quantifiers and relative clauses. The Canadian Journal of Philosophy, 7:467–536. Forsyth, D.A. and Ponce, J. (2009). Computer Vision. A modern approach. PHI Learning, New Delhi. Frank, R. and Satta, G. (1998). Optimality theory and the generative complexity of constraint violability. Computational Linguistics, 24:307–315. Franke, M. (2009). Signal to act: Game theory in pragmatics. PhD thesis, Institute for Logic, Language and Computation, University of Amsterdam. Frege, G. (1884). Die Grundlagen der Arithmetik: eine logisch-mathematische Untersuchung über den Begriff der Zahl. W. Koebner, Breslau. Frege, G. (1892). Über Sinn und Bedeutung. Zeitschrift für Philosophie und philosophische Kritik, 100:25–50. Frey, W. (2005). Pragmatic properties of certain German and English left-peripheral constructions. Linguistics 43. 89–129. Gärtner, H.-M. (2003). On the OT status of unambiguous encoding. In Blutner, R. and Zeevat, H., editors, Pragmatics and Optimality Theory. Palgrave, Basingstoke. Gazdar, G. (1979). Pragmatics: Implicature, Presupposition and Logical Form. Academic Press, New York. Gazdar, G. (1988). Applicability of indexed grammars to natural languages. In Reyle, U. and Rohrer, C., editors, Natural Language Parsing and Linguistic Theories, pages 69–94. Reidel, Dordrecht. Gazdar, G., Klein, E., Pullum, G., and Sag, I. (1985). Generalised phrase structure grammar. Basil Blackwell, Oxford. Geach, P. (1962). Reference and Generality. Cornell University Press, Ithaca (NY). Geurts, B. (1999). Presuppositions and Pronouns. Elsevier, Amsterdam. Gil, D. (2005). Word order without syntactic categories: How Riau Indonesian does it. In Carnie, A., Harley, H., and Dooley, S.A., editors, Verb First. On the syntax of verb-initial languages, pages 243–263. John Benjamins, Amsterdam. Green, G. (1968). On too and either, and not just too and either, either. In B. Darden, Bailey, C., and Davison, A., editors, Papers from the 4th Regional Meeting, pages 22–39. CLS, Chicago. Grice, P. (1957). Meaning. Philosophical Review, 66:377–388. Grice, P. (1975). Logic and conversation. In Cole, P. and Morgan, J., editors, Syntax and Semantics 3: Speech Acts, pages 41–58. Academic Press, New York. Grimshaw, J. (1997). Projection, heads, and optimality. Linguistic Inquiry, 28:373–422. Grosz, B., Joshi, A., and Weinstein, S. (1995). Centering: A framework for modeling the local coherence of discourse. Computational Linguistics, 21(2):203– 226. Gundel, J.K., Hedberg, N., and Zacharski, R. (1993). Cognitive status and the form of referring expressions in discourse. Language, 69:274–307.
212
bibliography
Haider, H. (1991). Fakultativ kohärente Infinitivkonstruktionen im Deutschen. Technical Report 17, Wiss. Zentrum der IBM Deutschland, Stuttgart. Hale, M. and Reiss, C. (1998). Formal and empirical arguments concerning phonological acquisition. Linguistic Inquiry, 29:656–683. Hawkins, J. (1978). Definiteness and Indefiniteness. Croom Helm, London. Heim, I. (1982). On the Semantics of Definite and Indefinite Noun Phrases. PhD thesis, University of Massachusetts at Amherst. Heim, I. (1983a). File change semantics and the familiarity theory of definiteness. In Bäuerle, R., Schwarze, C., and von Stechow, A., editors, Meaning, use, and interpretation of language. De Gruyter, Berlin. Heim, I. (1983b). On the projection problem for presuppositions. In Barlow, M., Flickinger, D., and Westcoat, M., editors, Second Annual West Coast Conference on Formal Linguistics, pages 114–126. Stanford University. Heim, I. (2012). Definiteness and indefiniteness. In von Heusinger, K., Maienborn, C., and Portner, P., editors, Semantics: An International Handbook of Natural Language Meaning, volume 2, pages 996–1025. De Gruyter, Berlin. Hendriks, P. and Spenader, J. (2005/2006). When production precedes comprehension: An optimization approach to the acquisition of pronouns. Language Acquisition: A Journal of Developmental Linguistics, 13:319–348. Hobbs, J. (1985). On the coherence and structure of discourse. Technical Report CSLI-85–37, Center for the Study of Language and Information, Stanford University. Hobbs, J., Stickel, M., Appelt, D., and Martin, P. (1990). Interpretation as abduction. Technical Report 499, SRI International, Menlo Park, California. Hogeweg, L. (2009). Word in Process. On the interpretation, acquisition and production of words. PhD thesis, Radboud University Nijmegen. Jacobson, R. (1958/1984). Morphological observations on Slavic declension (the structure of Russian case forms). In Waugh, L.R. and Halle, M., editors, Roman Jakobson. Russian and Slavic grammar: Studies 1931–1981., pages 105–133. Mouton de Gruyter, Berlin. Jäger, G. and Blutner, R. (2000). Against lexical decomposition in syntax. In Blutner, R. and Jäger, G., editors, Studies in Optimality Theory, pages 5–29. University of Potsdam. Janssen, T. (1984). Foundations and Applications of Montague Grammar. PhD thesis, University of Amsterdam. Jasinskaja, K. and Zeevat, H. (2010). Explaining additive, adversative and contrast marking in Russian and English. Revue de Sémantique et Pragmatique, 24:65–91. Kamp, H. (1981). A theory of truth and semantic representation. In Groenendijk, J., Janssen, T., and Stokhof, M., editors, Formal Methods in the Study of Language, Part 1, volume 135, pages 277–322. Mathematical Centre Tracts, Amsterdam. Kamp, H. (1990). Prolegomena to a structural account of belief and other attitudes. In Anderson, C. and Owens, J., editors, Propositional attitudes: the role of content in logic, pages 27–90. CSLI, Stanford University. Kamp, H. and Reyle, U. (1993). From Discourse to Logic: Introduction to Modeltheoretic Semantics of Natural Language, Formal Logic and Discourse Representation Theory. Kluwer, Dordrecht. Kaplan, D. (1968). Quantifying in. Synthese, 19:178–214.
bibliography
213
Kaplan, D. (1989). Demonstratives. In Almog, J., Perry, J., and Wettstein, H., editors, Themes from Kaplan, pages 481–566. Oxford University Press, New York. Kaplan, R. and Bresnan, J. (1982). Lexical-Functional Grammar: A formal system for grammatical representation. In Bresnan, J., editor, The Mental Representation of Grammatical Relations, pages 173–281. MIT Press, Cambridge (Mass.). Karttunen, L. (1989). Radical lexicalism. In Baltin, M.R. and Kroch, A.S., editors, Alternative Conceptions of Phrase Structure, pages 43–65. University of Chicago Press, Chicago. Karttunen, L. (1998). The proper treatment of optimality in computational phonology. In Proceedings of the International Workshop on Finite-State Methods in Natural Language Processing, pages 1–12. Bilkent University, Ankara. Kasper, W., Moens, M., and Zeevat, H. (1992). Anaphora resolution. In Bes, G. and Guillotin, T., editors, A Natural Language and Graphics Interface. Results and Perspectives from the ACORD Project, pages 65–93. Springer, Berlin. Katre, S.M., editor (1987). Astadhyayi of Panini. University of Texas Press, Austin (Texas). Kehler, A. (2002). Coherence, Reference, and the Theory of Grammar. CSLI Publications, Stanford University. Kilner, J.M., Friston, K.J., and Frith, C.D. (2007). Predictive coding: an account of the mirror neuron system. Cognitive processing, 8:159–166. Kiparsky, P. (1973). “Elsewhere” in phonology. In Anderson, S.R. and Kiparsky, P. editors, A Festschrift for Morris Halle, pages 93–107. Holt, New York. Kripke, S.A. (1979). A puzzle about belief. In Margalit, A., editor, Meaning and Use, pages 239–283. Reidel, Dordrecht. Kripke, S.A. (2009). Presupposition and anaphora: Remarks on the formulation of the projection problem. Linguistic Inquiry, 40:367–386. Lambek, J. (1958). The mathematics of sentence structure. American Mathematical Monthly, 65:154–170. Lee, H. (2001a). Markedness and word order freezing. In Sells, P., editor, Formal and Empirical Issues in Optimality-Theoretic Syntax, pages 63–128. CSLI Publications, Stanford University. Lee, H. (2001b). Optimization in Argument Expression and Interpretation: A Unified Approach. PhD thesis, Stanford University. Lenerz, J. (1977). Zur Abfolge nominaler Satzglieder im Deutschen. Narr, Tübingen. Levelt, W.J.M. (1983). Monitoring and self-repair in speech. Cognition, 14:41–104. Liberman, A., Cooper, F., Shankweiler, D., and Studdert-Kennedy, M. (1967). Perception of speech code. Psychological Review, 74:431–461. Löbner, S. (1985). Definites. Journal of Semantics, 4:279–326. Löbner, S. (2011). Concept types and determination. Journal of Semantics, 28:279– 333. Lotto, A., Hickok, G., and Holt, L. (2009). Reflections on mirror neurons and speech perception. Trends in Cognitive Science, 13:110–114. Lyons, C. (1999). Definiteness. Cambridge University Press, Cambridge. Mann, W.C. and Thompson, S. (1988). Rhetorical Structure Theory: Toward a functional theory of text organization. Text, 8:243–281. Marcus, M. (1978). A Theory of Syntactic Recognition for Natural Language. PhD thesis, MIT, Cambridge (Mass).
214
bibliography
Marriott, K. and Meyer, B. (1997). On the classification of visual languages by grammar hierarchies. Journal of Visual Languages and Computing, 8:375– 402. Martin-Löf, P. (1987). Truth of a proposition, evidence of a judgement, validity of a proof. Synthese, 73:407–420. Mattausch, J. (2001). On optimization in discourse generation. ILLC report MoL2001-04, MSc Thesis, University of Amsterdam. Mellish, C. (1988). Implementing systemic classification by unification. Computational Linguistics, 14:40–51. Minio-Paluello, L., editor (1963). Aristotle. Categoriae et Liber de Interpretatione. Clarendon Press, Oxford. Montague, R. (1974). The proper treatment of quantification in ordinary English. In Thomason, R., editor, Formal Philosophy: Selected Papers of Richard Montague, pages 221–242. Yale University Press, New Haven. Nunberg, G. (1995). Transfers of meaning. Journal of Semantics, 12:109–132. Perrier, P. (2005). Control and representations in speech production. In ZAS Papers in Linguistics 40, pages 109–132. ZAS, Berlin. Peters, S. and Ritchie, R. (1973). On the generative power of transformational grammars. Information Sciences, 6:49–83. Pickering, M. and Garrod, S. (2004). Towards a mechanistic psychology of dialogue. Brain and behavioral science, 27:169–190. Pickering, M.J. and Garrod, S. (2007). Do people use language production to make predictions during comprehension? Trends in Cognitive Sciences, 11:105–110. Pierrehumbert, J. and Hirschberg, J. (1990). The meaning of intonational contours in the interpretation of discourse. In Cohen, P.R., Morgan, J., and Pollack, M.E., editors, Intentions in Communication, pages 271–311. MIT Press, Cambridge (Mass). Polanyi, L. and Scha, R. (1984). A syntactic approach to discourse semantics. In Proceedings of the 10th International Conference on Computational Linguistics, pages 114–126. Stanford University. Pollard, C. and Sag, I.A. (1994). Head-driven Phrase Structure Grammar. University of Chicago Press, Chicago. Postma, A. (2000). Detection of errors during speech production: A review of speech monitoring models. Cognition, 77:97–131. Potts, C. (2003). The Logic of Conventional Implicatures. PhD thesis, University of California at Santa Cruz. Pullum, G.K. and Gazdar, G. (1982). Natural languages and context-free languages. Linguistics and Philosophy, 4(4):471–504. Quine, W.V. (1956). Quantifiers and propositional attitudes. Journal of Philosophy, 53:177–187. Reiter, E. and Dale, R. (2000). Building Natural-Language Generation Systems. Cambridge University Press, Cambridge. Reyle, U., Rossdeutscher, A., and Kamp, H. (2007). Ups and downs in the theory of temporal reference. Linguistics and Philosophy, 30:565–635. Rizzolatti, G. and Craighero, G. (2004). The mirror-neuron system. Annual Review of Neuroscience, 27:169–192. Rooth, M. (1992). A theory of focus interpretation. Natural language semantics, 1:75–116. Russell, B. (1905). On Denoting. Mind, 14:479–493.
bibliography
215
Sackokia, M. (2002). New analytical perfects in Modern Georgian. In de Jongh, D., Zeevat, H., and Nilsenova, M., editors, Proceedings of the 3rd and 4th International Tiblisi Symposium on Language, Logic and Computation, pages 1–10. ILLC/ICLC, Amsterdam/Tiblisi. Sæbø, K.J. (1996). Anaphoric presuppositions and zero anaphora. Linguistics and Philosophy, 19:187–209. Scha, R. and Polanyi, L. (1988). An augmented context free grammar for discourse. In Proceedings of the 12th International Conference on Computational Linguistics, pages 573–577, Budapest. Schiffer, S. (1992). Belief ascription. Journal of Philosophy, 89(10):499–521. Schulte im Walde, S. (2009). The induction of verb frames and verb classes from corpora. In Lüdeling, A. and Kytö, M., editors, Corpus Linguistics: An International Handbook. Mouton de Gruyter, Berlin. Schütze, H. (1998). Automatic word sense discrimination. Computational Linguistics, 24(1):97124. Shieber, S.M. (1985). Evidence against the context-freeness of natural language. Linguistics and Philosophy, 8:333–343. Smolensky, P. (1991). Connectionism, constituency and the language of thought. In Loewer, B.M. and Rey, G., editors, Meaning in Mind: Fodor and His Critics, pages 286–306. Blackwell, Oxford. Smolensky, P. (1996). On the comprehension/production dilemma in child language. Linguistic Inquiry, 27:720–731. Solstad, T. (2007). Mehrdeutigkeit und Kontexteinfluss: die Spezifikation kausaler Relationen am Beispiel von ‘durch’. PhD thesis, University of Oslo. Sperber, D. and Wilson, D. (1995). Relevance: Communication and cognition (2nd ed.). Blackwell, Oxford. Stalnaker, R. (1979). Assertion. In Cole, P., editor, Syntax and Semantics, volume 9, pages 315–332. Academic Press, London. Steedman, M. and Baldridge, J. (2011). Combinatory Categorial Grammar. In Borsley, R. and Borjars, K., editors, Non-Transformational Syntax, pages 181–224. Blackwell, Oxford. Strawson, P. (1950). On Referring. Mind, 59:320–344. Strube, M. and Hahn, U. (1999). Functional centering: Grounding referential coherence in information structure. Computational Linguistics, 25(3):309– 344. Taboada, M. and Mann, W.C. (2006). Rhetorical structure theory: Looking back and moving ahead. Discourse Studies, 8:567–588. Tesar, B. and Smolensky, P. (2000). Learnability in Optimality Theory. MIT Press. Tomasello, M. (1999). The Cultural Origin of Human Cognition. Harvard University Press. Twardowski, K. (1977). On the Content and Object of Presentations. A psychological investigation. Martinus Nijhoff, The Hague. Uszkoreit, H. (1987). Word Order and Constituent Structure in German, volume 8 of CSLI Lecture Notes. CSLI, Stanford University. Van der Sandt, R. (1992). Presupposition projection as anaphora resolution. Journal of Semantics, 9:333–377. Veltman, F. (1996). Defaults in update semantics. Journal of Philosophical Logic, 25:221–261.
216
bibliography
Webber, B., Stone, M., Joshi, A., and Knott, A. (2003). Anaphora and discourse structure. Computational Linguistics, 29:545–587. Wedekind, J. (1988). Generation as structure driven derivation. In COLING 1988, pages 732–737. Association for Computational Linguistics. Winterstein, G. (2010). La Dimension Probabiliste des Marqueurs de Discours. PhD thesis, UFR Linguistique, Université Paris Diderot-Paris 7. Ytrestøl, G. (2011). Cuteforce—Deep deterministic HPSG parsing. In Proceedings of the 12th International Conference on Parsing Technologies, pages 186–197. Association for Computational Linguistics. Zeevat, H. (1995). Idiomatic blocking and the Elsewhere Principle. In Everaert, M., editor, Idioms: Structural and Psychological Perspectives, pages 301–316. Psychology Press, London. Zeevat, H. (1997). The mechanics of the counterpart relation. In Künne, W., editor, Direct Reference, Indexicality, and Propositional Attitudes, page 155–184. CSLI, Stanford University. Zeevat, H. (2000). The asymmetry of optimality theoretic syntax and semantics. Journal of Semantics, 17(3):243–262. Zeevat, H. (2006a). Freezing and marking. Linguistics, 44–5:1097–1111. Zeevat, H. (2006b). Grammaticalisation and evolution. In Cangelosi, A., Smith, A., and K. Smith, editors, The Evolution of Language, pages 372–379. World Scientific, Singapore. Zeevat, H. (2007). Simulating recruitment in evolution. In Bouma, G., Krämer, I., and Zwarts, J., editors, Cognitive Foundations of Interpretation, pages 175–194. KNAW, Amsterdam. Zeevat, H. (2008). Constructive optimality theoretic syntax. In Villadsen, J. and Christiansen, H., editors, Constraints and Language Processing, pages 76–88, ESSLLI 2008, Hamburg. Zeevat, H. (2009a). “Only” as a mirative particle. Sprache und Datenverarbeitung, 33:179–196. Zeevat, H. (2009b). Optimal interpretation as an alternative to Gricean pragmatics. In Behrens, B. and Fabricius-Hansen, C. (eds.) Structuring information in discourse: the explicit/implicit dimension, pages 191–216 Oslo Studies in Language. OSLA, Oslo. Zeevat, H. (2010). Optimal interpretation for rhetorical relations. In Kühnlein, P., Benz, A., and Sidner, C.L., editors, Constraints in Discourse 2, pages 35–60. John Benjamins, Amsterdam. Zeevat, H. (2011). Rhetorical relations. In Maienborn, C., von Heusinger, K., and Portner, P., editors, Semantics: An International Handbook of Natural Language Meaning, volume 2, pages 946–972. De Gruyter, Berlin. Zeevat, H. and Jäger, G. (2002). A reinterpretation of syntactic alignment. In Jongh, D.H.J., Nilsenova, M. and Zeevat, H. editors, Proceedings of the 4th Tiblisi Symposium on Logic, Language and Linguistics, pages 1–15. ILLC/ICLC, Amsterdam/Tiblisi.
INDEX abduction, x, 2, 136, 189, 190, 194, 195, 195n2, 195n3 abductive pragmatics, x acceptability judgements, 3, 3n3, 25, 101 accommodation, 101, 110, 111, 113, 120, 123, 151, 166, 169, 173, 178, 180, 194, 195n2, 205 adversative marking, 83 agreement features, 63n15, 64 ALE, 43n3 anaphora resolution, 20, 21, 34, 200 Aristotelian competence grammar (ACG), 1–5, 8–11, 27, 28 Aristotelian grammar, 2, 4, 5, 21, 136, 138, 188 asymmetry, 30, 107, 164 Bach-Peters sentence, 179 Bayes’ theorem, 1, 19 belief attribution, 141, 159, 161 bidirectional optimality theory, 4, 86, 102 bidirectional phonology, 6 bidirectionality, ix, x, 28 bridging, 92, 93, 111, 123, 134, 166 case morphology, 19, 54, 85, 86 categorial grammar, 4, 10, 12, 16, 34, 35, 135, 138, 188, 189, 193, 199 centering theory, 13 character, 61, 110, 111, 113, 116, 181 coercion, 170, 173, 176 cognitive science, xiii, xiv Comanche, 93 common ground, 21n13, 61, 62, 67, 110n4, 111, 112, 117, 118, 132, 141, 161, 172, 175, 176, 195, 195n2, 205 communicative intention, 37, 78 competence grammar, 1–3, 3n2, 11, 12, 39 complete contexts, 144, 145, 179 compositional semantics, 14 compositionality, 136, 196, 197 computational linguistics, xiii, 3, 36, 186, 187, 190, 191 computer vision, xiii, xv, 1, 5, 20, 22, 120 confirmative marking, 83 context integration, 23 context restriction, 111, 139 contrastive topics, 47, 52, 59, 60, 64n17, 102
constraint resolution conventional implicature, 123n8 counterpart, 16, 146, 149, 159, 162–165 cue integration, 23, 28, 193–195, 205–207 CUF, 43n3 definite descriptions, 13, 14, 92, 166–168, 168n7, 169, 170 definite representation, 147, 155, 167, 168, 168n7, 169–172, 176 definiteness, 35, 36, 63, 64, 94, 165–168, 171, 173, 175, 175n10, 176, 182 definition theory, 176 demonstrative, 13, 80, 92, 160, 166, 169, 173, 205 derived weight, 60 deterministic, ix, 41, 43, 44, 72, 75, 188 differential case marking, 33, 34, 51, 77, 93 differential subject and object marking, 76, 94 discourse grammar, 122 discourse representation theory (DRS, DRT), x, 35, 85, 86, 109, 110, 141, 142, 148, 149, 151n3, 177, 179–181, 183 distal perception, 26, 30 dynamic semantics, 154, 177 economy constraints, 56, 76, 102 elision, 98 ellipsis, 90–92, 106, 174 Elsewhere Principle, 2, 13, 14, 16, 17 enchainement, 99 English, 7, 8n6, 14, 15, 17, 20, 38, 38n1, 45, 50, 52, 52n10, 53, 54, 59, 60, 65, 65n18, 67, 70, 80–82, 93, 95, 101, 111n5, 118, 124, 139, 151, 168n6, 173, 204 epenthesis, 39 epistemic downtoner, 83 epistemic may, 150 ergative, 94, 95 exhaustive answer, 84, 133 external object, 153, 155, 156, 158, 159, 165, 168 external property, 152, 153, 156 extraposition, 52
218
index
Fact, 150 FAITH, 39, 41, 86–89, 193, 194, 199 faithfulness, 6, 86, 92, 202 familiarity theory, 36, 166, 176 field theory, 14 file cards, 165 finite state transducer, 22, 43 Finnish, 136 Foot Feature Principle (FFP), 62, 63, 63n16, 65 free participial clauses, 123 free word order, 45, 51, 85 Frege’s problem, 159 French silent h, 76, 97 French, 14, 17, 34, 81, 95, 97, 97n14, 99, 99n15, 101, 174 functional application, 34, 135, 205 functional grammar, 4, 11, 16, 44, 75 functional theory, 36, 171 garden path effect, 85 Generalised Phrase Structure Grammar (GPSG), 4, 11, 11n8, 12, 14, 16, 38n1, 45, 65, 188 generalised presupposition, x generative semantics, 16, 44, 75, 198 German, 6, 8, 14, 16–17, 38, 45, 47, 51–55, 58–59, 64–65, 71–73, 80, 82, 85, 86n2, 88 government and binding, 16, 32, 38n1 grammar, ix, xiii, xv, 2, 2n2, 3, 4, 4n5, 5, 7–11, 11n8, 12, 13, 15–17, 21, 25, 27, 32, 33, 38, 39, 42, 44, 45, 51, 54, 57, 72, 91, 93, 103, 135, 136, 139, 183, 185, 186, 188–190, 193, 198, 200–204 grammaticalisation, 19, 54, 55, 60, 95, 100, 101, 110n4, 189, 193n1, 202–204, 207 Gricean implicatures, 193 Head Feature Principle (HFP), 62, 63, 66 head marking, 55, 85, 86 Head-Driven Phrase Structure Grammar (HPSG), 4, 11, 12, 16, 45, 107n1, 188 higher-level production, 32 Hindi, 85 historical linguistics, xiii, xiv, 189 incremental parser, xv incremental syntax checking, 67, 72 incrementality, 29, 105, 107 indefinite descriptions, 13, 82, 90, 92 information state, 110n3, 125, 141, 147, 151, 183
inherence, 65 interjections, 123, 123n8, 128, 135, 190 internal object, 148, 151, 153, 156, 167, 168, 176, 207 interpretation by abduction, x, 2, 136 interpretational blocking, ix intersubjective agreement, 141 intonation, 63, 126, 129–131, 135 inversion, 52, 53, 198 Jacobson’s Principle, 136, 137 Japanese, 80, 85, 93 joint completion, 25 Korean, 51, 85, 138 language evolution, 17, 32, 33, 196, 202 language interpretation, 2, 5, 9, 17, 23, 24, 28, 44, 71, 106, 121, 135, 187, 192, 195, 197, 201, 205 language production, xiii, xv, 3, 17, 26, 31, 32, 39, 72, 100 Latin, 17, 51, 80, 81, 85, 95, 136 Lexical Functional Grammar (LFG), 4, 11, 11n8, 12, 16, 35n20, 38n1, 188, 198–200 liaison, 99 linear grammar, 8, 9 linear time complexity, 2, 8, 8n6, 28, 43, 56 linguistic generalisations, x, 2, 4, 12, 13 linguistics, xiii, 2, 3, 11, 14, 25, 39, 100, 187, 188, 200 linked concept, 106, 108, 112–119, 121–124, 126, 138, 142, 147 links to the future, 114, 133, 134, 194 logical omniscience, 159, 160 long distance movement, 47 mental representation, x, xv, 3n2, 32, 35, 36, 45, 64, 66, 69, 72, 106, 108, 110n3, 113, 134, 141, 152–158, 161, 165, 167, 176, 177, 181, 182, 185, 193 metonymy, 20 middle field, 51, 52 minimalist framework, 16 mirative marking, 83 mirror neurons, 2, 24, 26, 27 mirroring, 25 monitoring effect, ixn1, 34, 76, 203 Montague Grammar, 4, 7, 85, 86, 179, 183, 186 most probable interpretation, x, xiv, xv, 1, 5, 9, 18–20, 23, 24, 27, 28, 37, 75, 89, 103, 105, 106, 185–188, 192, 199, 201, 203
index motor theory of speech perception, 2, 30, 96 n-best algorithm, 24, 28, 106, 188 natural language generation, ix, 11, 12, 37, 61, 90, 93, 187, 201 natural language semantics, xiii, 69, 108, 136, 141, 182, 183 natural language understanding, 9 neurolinguistics, 3 *NEW, 193, 194, 199, 206 non-idiomatic alternative, ix non-natural meaning, 2, 29 non-restrictive modifier, 123 non-verbal signals, 23, 121, 128 noun phrase selection, 17 NP-selection, 11, 33, 75–77, 90, 93 obligatory passivisation, 95 optimality theoretic pragmatics, ix, 193, 194 optimality theory, ix, 16, 22, 32–34, 38, 38n1, 39–41, 43, 43n4, 44, 45, 46n5, 55, 56, 64, 68, 69, 72, 73, 73n20, 96, 97, 99, 100, 102, 119, 138, 193, 198–201, 203 optimality-theoretic syntax, 4, 11, 38, 38n1, 42–45, 68, 71, 72, 79, 138 optional discourse markers, 77, 89 OT learning, 23, 200 OT-LFG, 11n8, 198–200 Panini, 11, 13, 14, 16, 17 parity, xiv, 29–31 particle, xiii, 11, 18, 33, 34, 76, 77, 79, 80, 82, 83, 83n7, 84, 94, 110n4, 111, 128, 129, 138, 175, 185, 193n1 partitive construction, 172 perception, x, 5, 26–28, 35 personal pronoun, 13, 14, 64, 93, 158, 166 philosophy of language, xiii phonological self-repair, 100 pipeline architecture, 23, 31 plausible, 193, 199 Polish, 51, 85, 162 politeness, 90, 91 possessive NP, 173, 174 posterior probability, 20 pragmatic inference, 22 pragmatics, ix, x, xiii, xiv, 2, 28, 36, 69, 106, 120, 123, 124, 152, 167n5, 182, 183, 187, 189, 193–196, 206 presentational construction, 53 presupposition, 20, 34, 35, 106, 109, 110, 110n4,
219
111–118, 120, 122, 123, 130, 142, 147, 150, 151, 157, 158, 166, 174, 180, 193–195, 195n3, 199, 200, 207 prior probability, xv, 1, 5, 20–24, 26, 28, 31, 32, 34, 79, 109, 114, 116, 119, 120, 122, 156, 178, 182, 183, 185, 190–193, 196, 199, 206 production blocking, ix, 16 production OT, 32–34, 77, 93, 200, 201 production phonology, 16, 41, 97 production probability, 17, 20–24, 26, 34, 35, 37, 72, 73, 73n20, 139, 185, 187, 189, 190, 192, 202 production-comprehension gap, 2 pronoun resolution, ix, x, xiv, 13, 15, 23, 191 pronoun, ix, 13, 18, 33, 34, 51, 90, 92, 111, 111n5, 112, 123, 135, 158, 166, 169, 170, 174, 179, 193n1, 199, 204 proper names, 13, 162 propositional attitude, 110n3, 132, 177 protagonist, 92 psychology of language, xiii quantification, 69–71, 122, 149, 181 rat-rad cases, ix referential hierarchy, 13, 90, 91, 93, 169 referring expression, 13, 108 Relative, 151, 168, 170 relevance implicatures, 134, 135 relevance, 193, 199 reverse optimisation, 18, 19 rhetorical relations, ix, 77, 78, 78n6, 79–82, 84, 122, 124, 128, 193 Riau Indonesian, 72 Right Frontier Constraint, 78n6, 126n10 robot soccer, 27 Russian, 33, 54, 80, 81, 85, 86n9, 95, 101 Sanskrit, 11, 95 scalar implicatures, 133, 134, 194 schwa-drop, 97, 99 schwa, 97 SDRT, 122, 189 self-monitoring, xv, 3, 10, 12, 18, 19, 19n11, 28, 32–34, 37–39, 42, 54, 67, 71, 75–77, 79–85, 87, 89, 91–97, 99–103, 106, 107, 117, 169, 185, 187, 193, 200–203 semantic representation, xv, 4, 11, 44, 45, 56, 59, 72, 75, 93, 100, 177, 180, 185, 190, 192, 198–200 signal processing, 17, 20 Silverstein’s generalisation, 93
220
index
simulated production, x, 2, 24–27, 29, 44, 70, 105–107, 114, 118, 135, 138, 182, 186, 193, 194, 199 Sioux, 95 sortal nouns, 171, 172 source-experiencer verbs, 54, 55 speech act, 35, 83n7, 110n3, 122, 124, 189 speech perception, 20, 26 stereotypical binding, 111 stochastic free categorial grammar, xv, 72 stochastic selection restrictions, 191 subsectional anaphora, 171 substitutive marking, 83 symbolic parsing, 9, 188 syncretism, 63, 85, 86 syntactic parsing, x, 44, 123, 183 syntactic realisation, 11, 37, 120, 187 synthetic speech, 31n19 systemic grammar, 4, 11, 12, 16, 44, 75 tableau, 41, 48–50 theorem proving, 20, 191
topicalisation, 52 transformational grammar, 4, 11, 16, 44, 188, 198 trivialisation, 110n4, 111, 129 typology, xiii, xiv, 35, 42, 94, 95, 202 underdetermination of meaning by form, xiv, 1, 189, 200, 201 unification grammar, 4, 56, 188 update semantics, 141, 151n3, 154 verb-second, 64n17, 88 visual grammars, 9 visual languages, 9, 35 weak presupposition, 110n4, 129 word-order freezing, 33, 34, 76, 77, 87–89, 93, 95, 203 Wordnet, 191