251 51 7MB
English Pages 118 [120] Year 1974
JANUA L I N G U A R U M STUDIA MEMORIAE NICOLAI VAN WIJK DEDICATA edenda curai C. H. V A N S C H O O N E V E L D Indiana University
Series Minor,
185
APPLICATIONS OF THE MATHEMATICAL THEORY OF LINGUISTICS
by RICHARD TIMON DALY
1974
MOUTON THE HAGUE • PARIS
© Copyright 1974 in The Netherlands. Mouton & Co. N.V., Publishers, The Hague. No part of this book may be translated or reproduced in any form, by print, photoprint, microfilm, or any other means, without written permission from the publishers.
LIBRARY OF CONGRESS CATALOG CARD NUMBER: 73-77387
Printed in Belgium, by NICI, Ghent.
ACKNOWLEDGMENTS
A version of this study was written as a doctoral dissertation under the supervision of Professor Joseph S. Ullian of the Department of Philosophy at Washington University, St. Louis, Missouri. I have benefited greatly from his encouragement and many helpful suggestions throughout the writing of the study. Thanks are due also to Donald Sievert for his careful reading of the manuscript in its later stages, to Mrs. Ruth Jackson who steered me through the many little complications of producing the final typescript, and to Mrs. Helga Levy who typed it. And finally special thanks are due to Robert B. Barrett, Jr., who first invited me to the study of linguistic philosophy and has helped and encouraged me ever since.
TABLE OF CONTENTS
Acknowledgments
5
1. Introduction 1.1. Application in Empirical Linguistics 1.2. Applications in Psychology
9 10 13
2. The 2.1. 2.2. 2.3. 2.4.
16 16 18 22 31
Mathematical Theory of Languages Preliminaries The Theory of Grammar Automata Theory Functions on Sets of Words
3. Arguments Against the Adequacy of Finite State Grammar 3.1. Introduction 3.2. Chomsky's Argument 3.3. An Argument by Bar-Hillel and Shamir 3.4. On the Justification of Empirical Premisses . . . .
33 33 35 44 47
4. Arguments Against the Adequacy of Context-Free Grammar 4.1. An Argument by Bar-Hillel and Shamir 4.2. Postal's Argument
56 57 61
5. Performance Models 5.1. Chomskian Linguistic Performance Models . . . . 5.2. Behaviorist Performance Models 5.3. The Concept of Behavioral Repertory
71 71 76 83
8
TABLE OF CONTENTS
6. Arguments Against the Adequacy of Behaviorist Performance Models 6.1. The Arguments 6.2. Analysis
87 87 98
7. Conclusion
Ill
Bibliography
115
Index . . .
117
1
INTRODUCTION
In the past several decades there has grown up a new field in mathematics, usually called mathematical linguistics. The early beginnings of the theory arose out of an interest in the grammatical structure of languages, both natural languages and those used in computer programming, but in more recent years the theory has become a fully developed area of mathematical interest. Questions are posed, definitions introduced, and theorems proved as much for their intrinsic mathematical interest as for their possible application. One of the more important contributors to the theory has been Noam Chomsky, a linguist whose interest in making more precise the notion of a rule of grammar has resulted in a number of contributions to the theory of grammar. 1 Largely under his influence the theory is still applied to problems of an empirical nature. Although a cursory look at the literature may give rise to the illusion that such areas of application include empirical linguistics, linguistic performance models, and language acquisition, further study will reveal that only in the first two of these has any serious, systematic, and sustained attempt been made to apply the mathematical theory. In the present essay I examine in detail those applications of mathematical theory which have been attempted. Because there has been no serious attempt to apply the theory to problems of language acquisition, I confine my discussion to applications to empirical linguistics and linguistic performance models. I argue that none of the applications has been successful. 1
Cf. for example, Noam Chomsky, "On Certain Formal Properties of Grammars", Information & Control, 1 (1959), 91-122.
10
INTRODUCTION 1.1. APPLICATION IN EMPIRICAL LINGUISTICS
One of Chomsky's most important contributions to linguistics has been his careful sharpening of our conception of a rule of grammar, and of our concept of grammar itself. Older grammars present us with an inventory of examples of the normal constructions in the language, together with a list of exceptions and special cases. The user must rely on his own intuitive knowledge of linguistic structure in applying the information to new cases. Chomsky's idea is that a grammar should contain rules which will enable a user to construct all the sentences in the language, without relying on anything other than the rules of grammar. In other words, a grammar should be a perfectly explicit and formal set of instructions for constructing all the sentences of the language. Such a grammar Chomsky calls a generative grammar. He imposes certain other restrictions on an adequate grammar, but I defer discussion of them until later. In his early work on grammar Chomsky formulated a type of grammar which he thought met the above condition. A phrase structure grammar, as he called it, was defined as follows. The grammar contains two sets of symbols, the variables and the terminal symbols. The terminal symbols are the vocabulary of the language in question, while the variables are thought of as representing the various grammatical types of the language. For example, a phrase structure grammar of English might have man and hits in the terminal alphabet, and Sentence, Noun Phrase and Verb Phrase in the variable alphabet. The rules of the grammar are ordered pairs (u, v) where u is a string of symbols from the variable alphabet, and v may be a string from both alphabets. (That is v may consist of a mixture of symbols from both alphabets.) One of the variables is set aside as the start symbol. A string is generated by the grammar if it is composed entirely of symbols from the terminal alphabet, and can be produced by starting with the start symbol and rewriting successive strings using the rules of the grammar. A rule (u, v) is interpreted as allowing us to rewrite the string u, in whatever context it occurs, as the string v.
INTRODUCTION
11
The set of all strings generable in this way is called the language generated by the grammar. The precise mathematical definition of phrase structure grammar, and the language it generates, is given in Chapter 2. Chomsky holds that a grammar should not only generate the language for which it is the grammar, but should assign to every string a description of its grammatical structure. There are various plausible ways of doing this with a phrase structure grammar, the most straight forward being to take the sequence of strings produced in the generation process as the structural description. There are other more elegant proposals, but the details are unimportant for our purposes. On the basis of these two ideas Chomsky formulates two criteria of adequacy for particular grammars. He considers a language to be a set of strings, the sentences of the language, plus the set of structural descriptions for these strings. A grammar is weakly adequate for a given language just in case it generates exactly the sentences of the language. It is strongly adequate if it is weakly adequate, and, further, associates with each string its correct structural description. Strong adequacy requires us to assume that the notion of structural description is tolerably clear prior to any discussion of grammar. This is a controversial hypothesis, but fortunately I shall not have much occasion to discuss strong adequacy in the present paper. We shall be confining our attention exclusively to the problem of weak adequacy. Within the family of phrase structure grammars, Chomsky distinguishes two empirically important subfamilies, the finite state grammars, and the context-free grammars. The context-free grammars, he thought, correctly formalized the practice of contemporary linguists of bracketing sentences into nested segments, called immediate constituents. Context-free grammars serve as the generative grammar counterpart to the grammar of immediate constituents commonly employed by the traditional linguists. The finite state grammars, a subfamily of the context-free grammars, seemed to correctly formalize the approach of statistical linguists, who study the probability of the occurrence of a given word in a
12
INTRODUCTION
sentence, given the occurrence of the words preceding it in the sentence. Both these approaches to the structure of language have been widely thought to be important, and have been much studied by linguists. Chomsky showed that they give rise to two families of grammars which are both of the phrase structure type. Both these types of grammar have been extensively studied mathematically and a great deal is known about their formal structure. That is, we know something about the sets which such grammars can generate, and can characterize these sets in ways independent of the grammars which generate them. As is well known, Chomsky does not now think that either of these kinds of grammar is adequate for natural languages, and has proposed his own kind of grammar, called transformational grammar. A transformational grammar is a generative grammar, but in addition to phrase structure rules it contains a special kind of rule, a transformation, which is not a phrase structure rule. The exact description of this kind of grammar need not detain us, since we shall be concerned with it only indirectly. Chomsky's ideas about transformational grammar have attracted a number of supporters among working linguists, enough so that we can now say that there is a definite school of transformational linguistics. These authors maintain, along with Chomsky, that phrase structure grammars, and, in particular, context-free and finite state grammars, are inadequate as theories of natural language. Chomsky himself has argued that the theories of finite state and context-free grammars are inadequate, by arguing that there is no weakly adequate finite state grammar for English, and that there is no strongly adequate context-free grammar for English. In Chapter 3 I consider Chomsky's argument against the weak adequacy of finite state grammar in detail, as well as another argument to the same effect. The strategies of the two arguments are similar. In each case, the authors define a property on sets of strings, and then show by mathematical arguments, drawing on the mathematical theory of language, that no language with that property can have a weakly adequate finite state grammar. The remainder of each argument consists in showing that English has
INTRODUCTION
13
the defined property. I shall defend the position that neither of these arguments succeeds in establishing its conclusion. I will not consider Chomsky's argument against the strong adequacy of context-free grammar, because Postal has proposed an argument against the weak adequacy of this kind of grammar. Postal's argument proceeds in a fashion similar to Chomsky's, first isolating a mathematical property such that no language having that property can have a weakly adequate context-free grammar. He then argues that Mohawk, a North American Indian language, has the defined property. I consider Postal's argument, and another by Bar-Hillel to the same effect, in Chapter 4, and show there that they also fail to establish their conclusion.
1.2. APPLICATION IN PSYCHOLOGY
Chomsky has extended his ideas about the nature of language and grammar into the domain of psychology and the philosophy of mind by identifying learning to use a language with learning, or as he calls it internalizing, a grammar for the language. The task of a person learning to speak a language, he thus holds, is comparable to that of a linguist engaged in constructing a grammar for the language. If he is right in his claim that the grammars of natural languages are transformational, then what a person learns in learning a natural language is a transformational grammar. Unfortunately, the raw linguistic data do not seem to confirm to this hypothesis. It is immediately obvious that a record of normal speech does not consist of a sequence of grammatical utterances. There are many changes of plan in mid-sentence, false starts, and so forth. To preserve his thesis, Chomsky introduces a distinction between linguistic competence and linguistic performance. Competence, he says, is the speaker's knowledge of his language, while performance consists of the observed linguistic behavior. This distinction gives rise, for Chomsky and the transformationalists, to the problem of performance, which is to explain how a speaker puts his competence to use in his actual
14
INTRODUCTION
speech behavior, i.e. to explain the relation between the internalized transformational grammar and the less than grammatical strings we find in a record of normal speech. At the present time, although there is a large amount of data on linguistic performance, there is as yet little theory to account for the data. One way of phrasing what is desired in this regard is to say that we are lacking an adequate linguistic performance model. A performance model, to a first approximation, is a description of an organism which would enable us to calculate the behavioral output of the organism, given its current input. We should distinguish between a performance model and an acquisition model. The latter is a device which models the learning behavior of the organism. Ideally, a linguistic acquisition model would, upon exposure to the primary linguistic data, produce as output a linguistic performance model. That is, it would 'learn' how to use the language. The transformational linguists, as a result of their belief that a speaker must internalize a transformational grammar, have felt constrained to defend a non-behavioristic position in psychology. For example, Chomsky, in 1959, published a review of Skinner's book Verbal Behavior, in which he attacks Skinner's naivete in dealing with certain facts of linguistic behavior, especially with regard to problems of reference.2 Since then it has been common for transformational linguists to attack behaviorism in linguistics, and behavioristic psychology in general. A number of authors have argued that behaviorist learning theory is too weak to account for the complexity of the learning task solved by normal human speakers, but as none of their arguments turn on applications of the mathematical theory, I won't discuss them in the present paper. Other authors, however, have tried to show that the principles of behaviorism do not permit the construction of adequate performance models for linguistic behavior, and their arguments do make use of the mathematical theory. In Chapters 5 and 6 I examine this latter contention in more detail. In Chapter 5 I try to sort out the difference between a 2
Noam Chomsky, "A Review of Skinner's Verbal Behavior", Language, 35, 1 (1959), 26-58.
INTRODUCTION
15
Chomskian performance model and a behaviorist performance model. With respect to the latter I propose a formal explication of behaviorist performance models, and illustrate the surprising descriptive power of these models. In Chapter 6 I examine three arguments brought by transformational linguists against the adequacy of the behaviorist performance models to account for the facts of linguistic behavior. I show that they are unsuccessful in establishing their contention, and that their arguments are based on a misapplication and misunderstanding of the mathematical theory of languages. I examine in detail in the present essay all the arguments in empirical linguistics and psychology which purport to make some serious application of results in the mathematical theory of linguistics. None of them have been successful. I conclude that there has yet to be a significant application of the mathematical theory to an empirical problem.
2 THE MATHEMATICAL THEORY OF LANGUAGES
In this chapter I present, and briefly discuss, some of the main definitions and theorems of what has come to be called the mathematical theory of languages. There is nothing essentially new about the material I cover, and the reader who is familiar with the theory can go on to subsequent chapters without reading this one. 1 I am primarily concerned to treat the subject in this chapter from a purely mathematical perspective, since the purpose of the rest of the dissertation is to discuss the application of the mathematical results to various empirical and philosophical questions, and to assess the merits and demerits of the various applications which have been attempted. I do, however, try to supply the intuitive significance behind some of the more important definitions.
2.1. PRELIMINARIES
I assume in the following that the reader is familiar with the more elementary concepts of set theory. For purposes of mathematical description an alphabet is simply a finite set of objects. Very little need be assumed about the nature of these objects. It has become customary to represent an alphabet by the Greek letter and I will continue in this tradition. A word over an alphabet is any finite sequence of elements from the alphabet. It is assumed that repetitions are possible, so that there is no upper bound on the 1
For a more complete exposition of the theory see Seymour Ginsburg, The Mathematical Theory of Context Free Languages (McGraw-Hill, New York, 1966).
THE MATHEMATICAL THEORY OF LANGUAGES
17
length of such sequences. In the chapters which follow, no distinction will be drawn between what counts as a word and what counts as a sentence. The terms have come to be used interchangeably in the development of the theory. Unfortunately, some of the authors we shall discuss sometimes use 'word' in the sense of 'element of the alphabet'. Although this terminology may appear confusing, it should never cause any difficulties in what follows. The basic idea is always that there is some set of elements, sometimes called the alphabet, sometimes called the vocabulary, and it is out of these elements that sequences, sometimes called words, sometimes called sentences, are built up. Let x = X1X2...X11 and y = yiy2...yn be words over some alphabet E, where the xi and yi are elements of 2 for all 1 < i < n. The length of any word x is the number of places in the sequence. We shall denote the length of x by |x|, where in this case |x| = n. For any two words x and y, x = y if and only if xi = yi for all 1 < i < n. We shall write xy to represent the concatination of two words. For two sets of words X and Y, we define XY = {xy| x in X and y in Y}. It is convenient for theoretical reasons to assume that there is a word of length zero. We shall denote this null word by I. It is important to note that while the null word has length zero, it is still to be considered as an object; that is, a set containing just the null word is not the empty set. The important property of the null word is that for any word x, xi = Ix = x. An important function on a set of words is the star operator. Let A be any set of words. We define the star of A, written A* as follows : A0 = I AnA A n+1 = A* =
uA" n > 0
A* is the set of all words that can be generated by concatenating various words in A. When the set is the alphabet 2, then 2* is the set of all words writable over that alphabet.
18
THE MATHEMATICAL THEORY OF LANGUAGES
2.2. THE THEORY OF GRAMMAR
The important definition for the theory of grammar is that of phrase structure grammar: a phrase structure grammar is a 4-tuple G = (V, X, P, . DEFINITION OF IMMEDIATE DERIVABILITY: f o r a n y x a n d y i n
V* we say that x => y if and only if there are words zi, Z2 u, and v such that x = Z1UZ2, y = Z1VZ2 and (u, v) is in P. This definition is extended in the following way: DEFINITION OF DERIVABILITY: f o r a n y x a n d y i n V * , w e s a y
that x => y if and only if either x = y or there is a sequence of words wo,...Wn such that wo = x, wn = y, and wi => wi+i for each 0 < i < n.
19
THE MATHEMATICAL THEORY OF LANGUAGES
With the aid of this definition the concept of a phrase structure language can be defined. DEFINITION OF PHRASE STRUCTURE LANGUAGE : l e t G =
(V, E,
P, a ) be a phrase structure grammar. Then the set L(G) = {x in E*|cr x} is called the phrase structure language generated by the grammar G. When we say that G generates the language L(G), there is no suggestion, of course, that we have specified or described in any way, a device which is capable of producing the language L(G). It is a purely mathematical sense of 'generate' appropriate to settheoretic definitions of sets. The idea behind the definition is quite simple. A string is a member of the set L(G) just in case there is a sequence of words, starting with the start symbol of the grammar, such that each can be derived immediately from the preceding word, and ending with the string in question. It is important to note that it is required by the definition that all the words of the language be strings from E*. As a simple example of a phrase structure grammar and language consider the following. Let Gi = (Vi, E, Pi, A) where Vi = {A, a, b}, E = {a b} and the set of productions is listed below: A A -
aAb ab
The phrase structure language generated by this grammar is, 2 Li = L(Gi) = {a n b n | n < 1} A derivation of the word a 4 b 4 is shown below : A -> aAb -> aaAbb ->• aaaAbbb
aaaabbbb
where the first production rule is used to generate the first three strings (after the initial A) and the second production rule is used to generate the last. It is known that the family of phrase structure languages is 2
The notation a n is used to represent the result of concatenating the symbol a n times with itself.
20
THE MATHEMATICAL THEORY OF LANGUAGES
equivalent to the family of recursively enumerable sets. This very broad class is thought to be too wide for the useful modeling of grammars for natural languages so a series of restrictions has been introduced and studied. DEFINITION OF CONTEXT-SENSITIVE GRAMMAR:
A phrase struc-
ture grammar is said to be context sensitive if every production in P has the form u A v -> uwv, where A is in V — 2 , u, v are in ( V — £ ) * and w in V * - I. A context-sensitive language is a phrase structure language with at least one context-sensitive grammar which generates it. It is known that the context-sensitive languages are recursive sets. Chomsky has shown that part of the power of context-sensitive grammars over other types of grammars (which we shall define below) lies in their ability to achieve permutations of strings. That is, while a rule of the form A B B A is not a context-sensitive rule, we can achieve the same effect with four rules which are: A B -» A C A C -> D C DC BC B C -> B A Thus in writing grammars, the appearance of a permutation rule does not mean that the language generated is not context-sensitive. For example, consider the language generated by the following grammar G3 = (V3, P, o) where V3 = {a, D , A , B, a, b} and P contains the productions below.
I
a a a A a -» B b D -> Bb
aA aa bA ba aB -> ba bB bb
V
a -> a b -
b
THE MATHEMATICAL THEORY OF LANGUAGES
21
The rules in Group III are not of the form required for a contextsensitive grammar. However, by exploiting the strategy above, we could replace each of these rules with four context-sensitive rules without changing the generative capacity of the old grammar. The result would be a context-sensitive grammar with thirty rules instead of the eighteen above. The language generated by the above grammar, as the reader can verify for himself, is : 3 L 3 = L ( G 3 ) = {xx in S * | x in £ £ * } The following definition limits the class of phrase structure languages even more severely: DEFINITION OF CONTEXT-FREE GRAMMAR : a g r a m m a r G =
(V,
E, P, a) is said to be context-free if and only if every production in P is of the form A w where A is in V — 2 , and w is in V * . A language is said to be a context-free language if and only if there is at least one context-free grammar which generates it. It is known that L3 above is not a context-free language. That is, there is no context-free grammar which generates L3. A simple language which is context-free is L 2 = { x x « | x in S I * } where x R is the reverse of x. A grammar for L2 is G2 = (V2, E, P2, bb The idea behind these restrictions is this: in the case of the context-sensitive rule, uAv ->• uwv, the variable A can be rewritten with w, but only in the context u — v. Context-free rules do not 3 In the condition defining the set L3 the notation XX* is used rather than just 2 * to ensure that |x| ^ 1.
22
THE MATHEMATICAL THEORY OF LANGUAGES
allow contexts to occur on the left side of the arrow and hence they do allow the variable to be rewritten in any context. A still more severe restriction, our last, is the following. DEFINITION OF A RIGHT LINEAR GRAMMAR: a grammar G = (V, P, o) is said to be a right linear grammar if and only if each of its productions has the form A w or A -> Bw, where A and B are in V — Z and w in I * .
A language is said to be a right linear language if and only if there is at least one right linear grammar which generates it. None of the languages defined so far, i.e. Li, L2, and L3, are right linear languages. Right linear grammars are obviously context-free grammars. Hence the right linear languages form a proper subset of the context-free languages.
2.3. AUTOMATA THEORY
An important development in the theory of languages has been the linking together of the theory of phrase structure grammar and automata theory. From the mathematical point of view, an automaton can be thought of as a special way of characterizing sets of words. It is only indirectly that a mathematical automaton and a physical device, say a digital computer, are in any way connected. In this section, I will define two important types of mathematical automata, the finite state automata and the push down automata, and state the important equivalence theorems linking them to the theory of grammar. : a finite state autois a 5-tuple (K, 5, po, F) where K is a finite nonempty set (of states) I is an alphabet (the set of inputs) 8 is a mapping of K x X into K (the next state function) po is a distinguished element (the start state) F is a subset of K (the set of final states).
DEFINITION OF A FINITE STATE AUTOMATON
maton (1) (2) (3) (4) (5)
THE MATHEMATICAL THEORY OF LANGUAGES
23
Intuitively, the idea behind the finite state automaton is as follows. We imagine that the automaton has a finite number of states, and an input tape on which is written an input word over the alphabet E. The automaton scans the input word, a symbol at a time, going from state to state according to the next state function. That is, we imagine that the automaton starts in the start state po and scans the first symbol of the input word. It then moves to a new state, which is determined by applying the next state function to po and the first symbol on the tape. From this new state, it reads the next symbol on the input tape, and passes to a third state, and so on, until the end of the input is reached. We say that the automaton accepts the word on the tape if and only if the last state the automaton reaches is in the set F of final states. Otherwise we say that the automaton rejects the input word. A precise mathematical definition of the foregoing discussion is the following. EXTENSION OF THE NEXT STATE FUNCTION : the next state function 5 can be extended to m a p K x E* into K as follows. For any p in K, a in E, and w in E*, let 8(p, I) = p, and 5(p, aw) = 8(8(p, a), w).
We say that the automaton accepts the word on the input tape just in case the result of reading the entire word puts the automaton into one of its final states. Otherwise we say that the automaton rejects the input word. : a word w is accepted by a finite state automaton A = (K, E, 5, po, F) just in case 5(po, w) in F. The set of words accepted by A is T(A) = {w in E* | 8(po, w) inF}. DEFINITION OF ACCEPTANCE
The family of finite state automata determines a family of associated sets of words, called regular sets, which has become particularly important to the theory of languages. DEFINITION OF REGULAR SETS: a set U < E* is called a regular set if and only if there is a finite state automaton A = (K, E, 5, po, F) such that U = T(A).
24
THE MATHEMATICAL THEORY OF LANGUAGES
The central theorem which links finite automata to the theory of grammar is the following: (2.1)
A set of words is regular if and only if it is a right linear language. 4
Thus the family of sets picked out by finite state automata exactly coincides with the family picked out by right linear grammar. The set accepted by a finite state automaton is sometimes called a finite state language. From the above theorem it follows that the terms 'finite state language', 'regular set', and 'right linear language' are all equivalent. In subsequent chapters we will sometimes speak of a finite state grammar, and by this mean a right linear grammar. In the informal description of a finite state automaton which I gave earlier I made use of the notion of an input tape. I spoke of the automaton as scanning the input tape, reading symbols, moving from one state to the next and such like. It is important to see that none of this sort of talk is involved in the actual definitions. The essential feature of finite state automata, for language theory, is its usefulness in defining the extended next state function, i.e. in defining a mapping from K X 2 * into K. If finite state automata are going to serve as mathematical models of physical devices which actually do perform certain actions, then more definitions will have to be provided to explain the relation between the various elements of the mathematical automaton, and elements of the physical device. It is certainly possible to do this, but it must be realized that it has not been done by the definitions provided within the theory of automata. There is a generalization of finite state automata which is useful in a later chapter. It is defined as follows. DEFINITION OF NON-DETERMINISTIC FINITE STATE AUTOMATON:
a non-deterministic finite state automaton is a f-tuple A = (K, I , 5, So, F), where (1) K is a finite nonempty set of states 4
For a proof of this cf. Ginsburg, 52.
THE MATHEMATICAL THEORY OF LANGUAGES
25
(2) £ is an alphabet (the set of inputs) (3) 8 is a mapping from K x 2 into P(K) (Where P(K) is the power set of K.) (4) So is a subset of K (the set of start states) (5) F is a subset of K (the set of final states) The non-deterministic automaton differs from the one defined earlier in having a set of start states, instead of a single one, and in having a next state function which maps pairs from K x E onto sets of states instead of a single state. The 'action' of the automaton is usually explained in the following way. The automaton 'chooses' one of the states in the set of start states. It then reads the first input symbol. The next state function provides it with a set of possible next states. It 'selects' one of these as its next state, and reads another input symbol, calculates the set of possible next states using the next state function, 'selects' one of these, and so on to the end of the input string. If it finds itself in a final state, the input word is accepted. If not, it starts all over, making different selections of states. If every possible choice of states leads to a non-final state, the word is rejected. If at least one choice of next states leads to a final state, the word is accepted. This intuitive description has as its formal counterpart the following definition. DEFINITION OF ACCEPTANCE: l e t A =
( K , E , 8, S 0 , F ) b e a
non-deterministic finite state automaton. A word w is accepted by A if and only if w = I and S 0 N F ^ 0, or w = xix2...xn where x¡ in £ for 1 < i < n and there exists a sequence s 0 , si, ..., s n of states in K such that (1) So in s 0 (2) si in 8(sj_i, xi) for 1 < i < n (3) s n in F A non-deterministic automaton determines a set of words in the usual way, i.e. T(A) is the set of all words accepted by A. The most important fact about such sets is that they are all regular. Since the converse is obvious, it follows that the family of sets
26
THE MATHEMATICAL THEORY OF LANGUAGES
determined by finite state automata, and the family of sets determined by the non-deterministic finite state automata coincide. Thus the generalization does not actually succeed in broadening the class of sets accepted. A further generalization does, however, finally succeed in producing an automaton whose acceptance capacity is more general than the finite state automata. Intuitively, we add to a non-deterministic finite state automaton an extra tape, called a push-down tape, on which the automaton can read and write symbols from a special alphabet. This tape is called a push-down tape because, informally, we imagine it to be arranged vertically in such a way that whenever a symbol is written on it, all the other symbols are pushed down one space, and whenever a symbol is read from the tape it is always the topmost symbol. In order to read a symbol further down, the automaton must first strip off, so to speak, all the symbols above it. Thus, the first symbol put in will be the last symbol out. The push-down tape is thought of as a specially organized sort of internal memory. A particularly crucial assumption about the push-down tape is that it has unbounded length. The next state function has to be altered to incorporate the use by the automaton of its new push-down tape. The formal definition is as follows. DEFINITION OF PUSH-DOWN AUTOMATON: a p u s h - d o w n a u t o -
maton (1) (2) (3) (4) v5) v6)
(7)
is a 7-tuple M = (K, E, T, 5, Z 0 , p 0 , F) where, K is a finite nonempty set (of states) Z is a finite alphabet (of inputs) r is a finite alphabet (of push-down symbols) 8 is a function from K x ( I u I) X T into finite subsets of K X E*. Z 0 is in r (the start push-down symbol) p 0 is in K (the start state) F is a subset of K (the set of final states).
The idea behind a PDA computation is as follows. At a particular stage in the computation, the automaton is in some state p, has some symbol x to be read next on the input tape, and some sjmbol
THE MATHEMATICAL THEORY OF LANGUAGES
27
Z topmost on the push-down tape. On the basis of these three items, the next state function yields a set of possible next moves. A move consists of changing to a new state, writing a word from r * on the top of the push-down tape, and expending the current input symbol. We suppose that the push-down symbol Z is erased in the process. We imagine that the automaton selects one of the moves made available by the next state function, executes it, and repeats the whole process until the input is completely expended. It is sometimes said that, since the next state function does not fix precisely the move to be made, that push-down automata are non-deterministic. As usual, one must be extremely careful in drawing any non-mathematical conclusions from usage of this sort. That the automata theorist is supposing any breach of causal law, for example, is a wholly unwarranted conclusion. The intuitive description of a move is formalized in the definition below. DEFINITION OF P D A MOVE: for Z in T and x in E, write (p, xw, yZ) |— (q, w, yz), called a move if and only if S(p, x, Z) contains (q, z). The relation |— is extended to cover chains of moves as follows. (1) For every p, w, z, write (p, w, z) |-i- (p, w, z). (2) For zi, Z2 in T*, and Xi in E u {I}, 1 < i < k, write (p, xi...xicw, zi) [i- (q, w, zi) if there exists a sequence of states p = pi, p2, ..., Pk+i = q in K and push down words zi = yi, y2, ..., yt+i = Z2 in T* such that (pi, xi...x k w, yi) [— (pi + i, x 1+ i...x k w, y 1+ i) for 1 < i < k.
Note that in defining a move, we let x be a symbol in E u {I}. This means that a possible input to a pushdown automaton is the null word. Since the null word occurs between any two contiguous symbols, so to speak, the definition allows the push down automaton to make a move without actually expending any input. The definition of the set of words accepted by a push down automaton differs from that given for a finite state automaton, since the next state function does not fully determine each move. The formal definition is as follows: DEFINITION OF ACCEPTANCE : a word w is accepted by a pushdown automaton M = (K, E, T, 5, Z 0 , qo, F) if and only if
28
THE MATHEMATICAL THEORY OF LANGUAGES
(.q0, w, Z 0 ) (q, I, z) for some q in F and z in T*. The set of all words accepted by M is denoted by T(M). The intuitive idea behind the definition is simply that a word is accepted if there is at least one sequence of moves which take the push down automaton from the start condition to an empty input tape and a final state. Again, it is important to be clear about the nature of the object just described. In the formal definitions there is no mention of input tapes, push-down tapes, the moving or scanning of these, going to the next state, or writing on the push down tape. It is simply a matter of defining a certain relation ¡-i-, which holds between ordered triples in K X I * X T*. This relation is in turn used to define the set of words associated with the 7-tuple of objects called a push-down automaton. What this 7-tuple has to do with actual physical computing devices, organic or otherwise, is far from clear. The informal description, of course, suggests ways of making a push down automaton serve as a model for a physical device, but a precise explication how this is to be understood has not yet been given. Serious obstacles lie in the way of such an attempt. For example, in the definition of [—, it is tacitly assumed that there is no upper bound on the length of the word on the push down tape. For a physical correlate of a push down tape, this is an unrealistic assumption. Hence, for some inputs, it may be impossible for the physical device to actually carry out the computation called for by the definition of |— What effect this has on the acceptance capacities of the device we discuss in more detail in Chapter 6. As an example of a push-down automaton, consider M = (K, T, 5, Z 0 , p 0 , F) where K = {p 0 , pi, P2, P3>, 2 = {a, b}, r = {a, b, A, B, Z 0 }, and F = {P2}. The next state function is shown below: 5(p0, 8(po, 5(p0, 8(p0,
a, b, a, b,
Z0) Z0) a) a)
= = = =
{(po, Z 0 A), (pi, Z 0 A)} {(p 0 Z 0 B), (pi, Z 0 B)} {u>0, a a), (pi, aa)} {(p0, ab), (p 1; ab)}
THE MATHEMATICAL THEORY OF LANGUAGES
29
a, b) = {(p 0 , ba), (pi, ba)} 5(p0, b, b) = {(p0, bb), (pi, bb)} 5(pi, a, a) = {(pi, I)} 5(pi,b,b) = { ( p i , I ) } 8(pi,a,A) = { ( p a , I ) } 5(pi,b,B) = { ( p 2 , I ) } 5(p2, a, Zo) = {(p3,1)} 5(p2, b, Zo) = {(pa, I)} 8(po,
For all other input triples the next state function yields the empty set. The strategy behind this particular next state function is the following. The automaton reads input symbols and copies them down on the input tape, the first symbol in the input string being copied in upper case to indicate that fact. While doing this the automaton remains in state p 0 . At some point it 'guesses' that the middle of the input string has been reached. It then goes into state pi, and begins taking symbols off the input tape and comparing them, in the reverse order to which they were put on, with the remainder of the input. As long as there is a point by point match, the automaton will have a next move. When it gets to the upper case symbol on the push down tape, and finds that it matches the input, it goes into the final state pz- If there is no more input, the word is accepted. If there is more, the automaton goes to nonfinal state p3 from which all further moves are blocked. For a word to be accepted, then, it must have the form xxR where x is word in IE*. Hence T(M) = L2, the context-free language defined in the preceding section. The identity found here is not just coincidence, but is an example of a general correspondence between the sets accepted by push down automata and contextfree languages. The following general theorem is known to be true. 5 (2.2)
5
For any set of words L, L is a context-free language if and only if there is a push down automaton M such that L = T(M).
For a proof of this cf. Ginsburg, 63 ff.
30
THE MATHEMATICAL THEORY OF LANGUAGES
In the informal description of the strategy behind the push down automaton described above, I spoke at one point about the automaton 'guessing' that it had reached the middle of the input. There is nothing psychological behind this way of talking. Remember that the point of defining push down automata is not to describe a computation process, but to facilitate the definition of certain sets of words. Given the way the next state function is arranged in the example, for a particular input, the definition of the move chain relation |— will determine a rather large set of possible sequences of moves, all of which are available. Since we are not interested in what the automaton 'does' (it doesn't as a matter of fact, 'do' anything) but only in whether or not a sequence of moves to the final state is available, the existence of these other possibilities is of no particular interest. That is, there is no problem as to how the automaton makes its 'guess' because we are not interested in automata as models of computing processes. All talk of guesses, therefore, is merely picturesque language. I stress this point because there is some tendency to speak of these automata as recognition devices, that is, a device capable of accepting or rejecting actual sentence tokens of some language. It may, for all we know, be possible to use mathematical descriptions of the two kinds of automata we have discussed as descriptions of the computing processes of actual computers. But if this is so, it is not at the moment, at least, clear how we should apply these descriptions to the modeling task. I shall discuss some attempts to do just this, and the attendant problems, in a later chapter. There have been other types of automata defined and studied. Myhill, for example, has defined a type of automaton called a linear bounded automaton, and has shown that the family of sets accepted by these automata coincides with the family of contextsensitive languages.6 And of course, it has long been known, that the Turing machine, the prototype automaton in this sort of enterprise, accepts just the recursively enumerable sets, and hence the very general family of phrase structure languages. In later 6
J. Myhill, "Linear Bounded Automata", Wright Air Development Tech. Note 60-165, 1960. Cf. also Ginsburg, 73, for a definition.
Division,
THE MATHEMATICAL THEORY OF LANGUAGES
31
chapters we will need automata no more complex than the two I have introduced here, so that I will not discuss these others here. 2.4. FUNCTIONS ON SETS OF WORDS
In the chapters that follow we will need some elementary facts about certain operations on sets of words. The class of regular sets, for example, is known to be closed under the Boolean operations of union, intersection, and complement, and, in addition, is also closed under the operations of concatenation and star. That is, (2.3)
If U and V are regular sets, then U u V, U n V, U', UV, U*, are regular sets also.
The family of context-free languages is not so well behaved. It is closed under union, though not under intersection. However, a somewhat weaker intersection theorem is, provable, namely: (2.4)
If L is a context-free language, and M is a regular set, then L n M i s a context-free language. 7
A more interesting operation on sets of words is what is called a generalized sequential machine mapping, or gsm mapping for short. A generalized sequential machine is also an automaton, very similar to a finite state automaton. Instead of having a set of final states, however, it has an extra alphabet A, and a function X , which maps pairs i n K X I into A*. The formal definition is DEFINITION OF GENERALIZED SEQUENTIAL MACHINE : a generalized sequential machine (a gsm) is a 6-tuple S = (K, X, A, 8, X , p 0 ), where 1) K is a finite nonempty set (of states) (2) I is an alphabet (of inputs) (3) A is an alphabet (of outputs) (4) 8 is a function from K x 2 into K (next state function) 7
For proof of this cf. Ginsburg, 88.
32
THE MATHEMATICAL THEORY OF LANGUAGES
(5) A, is a function from K x E into A* (the output function) (6) p 0 is a member of K (the start state). EXTENSION OF NEXT STATE AND OUTPUT FUNCTIONS: f o r a n y
p
in K, w in Z*, and a in E, let S(p, I) = p, and ^(p, I) = I. Let 8(p,wa) = 8( 5(p,a) w), and ^(p,wa) = ^.(p,a)>.(8(p,a),w). The intuitive idea behind a gsm is the following. We imagine that a gsm has an input tape from which it reads symbols from E one at a time, and on output tape on which it can write words from A*. The gsm begins in the start state p 0 , reads the first input symbol, moves to the next state and writes some word on the output tape. From its new state it reads the next input symbol, moves to a new state and writes another word on the output tape. It continues in this fashion till the input is expended, at which time it has a sequence of symbols from A on the output tape. This sequence is the output for that particular input. The foregoing description is formalized in the definition above, and in the following one. DEFINITION OF G S M MAPPING : l e t S =
( K , E , A , 5, X , p 0 )
be
a gsm. The function defined by S(x) = X(p0) x) for every x in E* is called a gsm mapping. The mapping, so defined on words, is extended in the usual settheoretic way to sets of words. An important and useful result involving gsm mappings is the fact that such mappings preserve regular sets and context-free languages. That is: (2.5) let L be a context-free language, M a regular set, and S a gsm. Then S(L) is a context-free language, and S(M) is a regular set.8 Theorems (2.3), (2.4), and (2.5) are useful later when trying to formulate strategies for proving that certain sets are not contextfree languages. 8
For proof of this cf. Ginsburg, 94.
3 ARGUMENTS AGAINST THE ADEQUACY OF FINITE STATE GRAMMAR
3.1. INTRODUCTION
In the arguments which we consider below, a language is defined as the set of all the grammatical strings which can be formed from the basic vocabulary of the language, together with a structural description of each string. The grammarian's problem is to produce a finite set of rules which explicitly generates the language. In Syntactic Structures, Chomsky distinguishes two different constraints on the set of grammatical rules.1 First, an adequate grammar must succeed in generating just the right set of strings. A grammar which does this Chomsky calls a weakly adequate grammar. There may, of course, be more than one such grammar for any given language. English, for example, may have a number of different grammars, all of which are weakly adequate. The second constraint is that a given grammar not only generate just the right set of strings, but that it assign to each string its correct structural description. A grammar which does this, Chomsky calls a strongly adequate grammar. Just how one determines what the correct structural description of a sentence is, or whether or not a given grammar provides such a description for all strings it generates, need not detain us here. Both of the arguments which I consider below concern weak adequacy only. When I say that a language is a finite state language, I mean that it has a weakly adequate finite state grammar. In Chapter 2 we discussed a type of grammar called a right linear grammar. There are infinitely many different grammars 1
Chomsky, Syntactic Structures (Mouton & Co., The Hague, 1957).
34
ARGUMENTS AGAINST FINITE STATE GRAMMAR
belonging to the family of right linear grammars, and each one determines an associated language. We also discussed an equivalent formulation of the family of regular languages in terms of finite state automata. In the arguments below, we shall be concerned with the question of whether or not English has a grammar which is both weakly adequate and belongs to the family of right linear grammars. Equivalently, we shall want to know if there is a finite state automaton which accepts all and only the grammatical strings of English, or as the authors which I discuss, sometimes put it, if there is a finite state grammar for English. Both the arguments which I discuss below purport to show that there is no weakly adequate finite state grammar for English. This is a very strong claim, and has the following important theoretical consequence. Linguists would like to characterize in some precise way the notion of a possible human language. For example, an interesting hypothesis would be: every human language is a finite state language. If English has no weakly adequate finite state grammar, however, we will have a straight forward counter example to this hypothesis. Hence the family of regular languages will not do as a characterization of the family of possible human languages. Thus, if these arguments can be made to go through, an important generalization or theoretical hypothesis of linguistics can be discarded as not fitting the facts. Both arguments which I discuss follow a common strategy. First a property is defined mathematically which no language generated by a regular grammar can have. This part of the arguments is purely mathematical, in the sense that no appeal is made to specific facts about English or any other natural language. With a few minor exceptions, which I shall mention, I think that the authors handle this part of their arguments correctly. In the second stage in their common strategy, however, serious difficulties arise. The authors attempt to show, more or less rigorously, that English has the offending property, and hence does not belong to the family of finite state languages. I argue that in neither case do the authors establish this fact. A t times they make simple factual errors about what is, and is not, a grammatical sentence
ARGUMENTS AGAINST FINITE STATE GRAMMAR
35
of English, and further, they make unwarranted assumptions about specific rules which English must have.
3.2. CHOMSKY'S ARGUMENT The first argument that I consider was put forward by N o a m Chomsky. 2 The fundamental mathematical idea behind Chomsky's argument is that of m-dependency, which is a three place relation holding between a sentence, S, a non-negative integer m, and a language L . Chomsky's definition is as follows: Suppose that A is the alphebet of language L, that ai of this alphabet, and that
a n are symbols
S = xiaix2a2...x m amzbiyib2y2...b m ym is a sentence of L. We say that S has an m-dependency with respect to L if and only if there is a unique permutation, a, of (1 m) meeting this condition: there are ci,...,c2m in A such that for each subsequence (ii,...,i p ) of (l,...,m), Si is not a sentence of L and S2 is a sentence of L, where Si is formed by substituting cy for ay in S for each j p; and S2 is formed by substituting cm+a 1, this would not show that Mohawk could not have a context-free grammar. It could only show that Mohawk was not a finite state language. Secondly, Postal has made no attempt to show that Mohawk contains (as members) the requisite strings with m-dependencies with respect to Mohawk. This construal of the 'dependency' strategy is not going to work either. Nevertheless, we might try to construct a plausible strategy along lines close to the above. There are three things that must be done. First we must define, in clear and precise terms, the notion of a dependency of type L3. The definition should be like Chomsky's definition of an m-dependency with respect to a language L. Chomsky defined what it was for a single sentence to have an m-dependency with respect to a given language. We could try an analogous definition of what it is for a sentence S to have an L3 type dependency of length n with respect to a language L. 1 don't know specifically how this should be done, nor is it my intention to pursue the exact form of such a definition here. It does seem possible, however, that a useful line to pursue would be to exploit Chomsky's definition itself in the following way. In applying the definition of m-dependency to the language L3, it is found that
ARGUMENTS AGAINST CONTEXT-FREE GRAMMAR
69
for each m, the permutation of (1, ... m) that works is the identity permutation. This seems to be what puts L3 beyond the range of context-free grammar. Intuitively, for a device to be able to recognize a language which contains sentences with m-dependencies of this type it must be able to remember items from the input string in the order in which they appeared in the input. But a push-down store must remember items in reverse order. Hence, if the device is to keep track of the n items in the first part of the string, it must do so by means of its state structure, much like a finite state automaton. But since it can have only a finite number of states, there is bound to be a limit on the length of the dependency strings it can discriminate. If there is no limit on these, the language cannot be accepted by a push-down automaton. This is only meant to make plausible the intuition behind the notion of an L3 type dependency. If we had a precise definition in hand, a completion of this approach to Postal's program could be the following. We will say that a language L has a dependency set of type L3 if and only if for every n > 1, there is a sentence of L with a type L3 dependency of length n with respect to L. This appears to avoid the trivially false route of (4.4) as an explication of the notion of a dependency set. Using this definition we can formulate the following theorem: (4.17)
If a language contains a dependency set of type L3, then it is not a context-free language.
The proof of this theorem may or may not be easy, I do not know. At any rate such a proof is required as the second step in this approach to Postal's problem. The third and final step in this strategy, is to show that Mohawk contains a type L3 dependency set. If the richness of Mohawk is comparable to English, then a great deal of care must be exercised in establishing this fact. The considerations which I raise concerning Chomsky's argument against finite state grammar in chapter three will apply equally here. A proof that Mohawk is not a context-free language which uses the notion of an L3 dependency must involve
70
ARGUMENTS AGAINST CONTEXT-FREE GRAMMAR
then, these three steps. It is easy to see that neither Postal, nor Chomsky, has even begun to carry them out. I do not mean to claim that it couldn't be done. I merely mean to point out that it hasn't as yet.
5. PERFORMANCE MODELS
In the preceding chapter we have been discussing the various arguments put forward by Chomsky and others in an attempt to show that natural languages cannot have certain kinds of grammar. In particular we saw that arguments purporting to show that English could not have a weakly adequate finite state grammar, and that Mohawk could not have a weakly adequate context-free grammar. In Chapter 6 we examine some applications of the mathematical theory of languages to a different, though related kind of problem. We examine several arguments brought by the transformational linguists following Chomsky, against what they call behaviorist, associationist, or S-R linguistic performance models. Before we turn to their arguments, however, we should try to state in some detail just what a performance model is. The rest of the present chapter will be devoted to just that. In section 1, we review Chomsky's views on linguistic performance models. In section 2, I present a brief discussion of behaviorism and an account of what I take behaviorist performance models to be. And in the final section, I discuss briefly the concept of a behavioral repertory.
5.1. CHOMSKIAN LINGUISTIC PERFORMANCE MODELS
Chomsky makes a sharp distinction between what he calls linguistic performance and what he calls linguistic competence. Competence is the "speaker-hearer's knowledge of his language". Performance is the "actual use of language in concrete situations". 1 1
N. Chomsky, Aspects of The Theory of Syntax (M.I.T. Press, 1965), 4.
72
PERFORMANCE MODELS
By knowledge of the language, Chomsky means little more than knowledge of the grammatical rules of the language. Obviously, a real speaker's knowledge of the language includes a great deal more than this. He knows a large number of words, knows their meaning, whatever this might come to in detail, and knows what is an appropriate utterance in a wide variety of circumstances. On the other hand, it is certainly not true that most speakers could describe the grammatical structure of their native toungue, nor could they formulate even rudimentary rules of grammar. Chomsky has a very special sense of 'know' in mind, and likewise a very restricted sense of 'language'. He sometimes refers to our ability to recognize and produce grammatical utterances as 'tacit knowledge' which is beyond the range of consciousness. Obviously, every speaker of a language has mastered and internalized a generative grammar that expresses his knowledge of the language. This is not to say that he is aware of them, or that his statements about his intuitive knowledge are necessarily accurate. Any interesting generative grammar will be dealing, for the most part, with mental processes that are far beyond the level of actual or even potential consciousness.2 1 think that it is best not to take such phrases as 'tacit knowledge', 'intuitive knowledge' and 'internalized generative grammar' as referring to any ordinary kind of knowledge. Presumably, what Chomsky has in mind here is something analogous to a computer program, with some part of the program corresponding to the internalized rules of the grammar. In order to avoid confusion here it should be pointed out that the rules of the grammar, even though it is called a generative grammar, are not to be thought of as a psychological model. The rules are supposed to generate the language only in the sense that they provide a convenient and concise way of characterizing the set of grammatical strings constituting the language. The grammar is not meant as a description of how a speaker of the language might actually go about producing a specific utterance. A description of this psychological process would be part of what Choms2
Chomsky, Aspects (1965), 8.
PERFORMANCE MODELS
73
ky means by a linguistic performance model, and would take into account facts of actual linguistic performances, facts of a quite different nature from those of interest to grammarians. A notorious example, often cited by Chomsky and others, is that a record of normal speech does not consist of an uninterrupted sequence of grammatical sentences. There are numerous mistakes, false starts, changes of plan in mid-sentence, meaningless sounds like ah and er interspersed, and so forth. Another is that certain grammatical constructions are more difficult to understand. A performance model, Chomsky thinks, should account for facts like these. Another distinction mentioned by Chomsky which it will be useful to have is that between language performance and language acquisition. The distinction is, roughly, between being able to use the language and learning the language. A performance model is supposed to explain how we put the 'internalized' grammatical rules to use in producing utterances. The acquisition model is supposed to explain how we 'learned' those rules in the first place. Chomsky tries to spell out in more detail just what is involved in his conception of an acquisition model. A child who is capable of language learning must have (i) a technique for representing input signals (ii) a way of representing structural information about these signals (iii) some initial delimitation of a class of possible hypotheses about language structure (iv) a method for determining what each such hypotheses implies with respect to each sentence (v) a method for selecting one of the (presumably, infinitely many) hypotheses that are allowed by (iii) and are compatible with the given primary linguistic data. 3 The acquisition model would provide explicit accounts of how each of these tasks is to be performed. The input to such a device, the so-called primary linguistic data, is to "consist of signals classified as sentences and non-sentences, and a partial and tentative pairing 3
Chomsky, Aspects (1965), 30.
74
PERFORMANCE MODELS
of signals with structural descriptions". 4 The output of the device is supposed to be the most likely grammar for the language of which the input is a sample. This grammar Chomsky sometimes refers to as a theory of the presented language. ...the device has now constructed a theory of the language of which the primary linguistic data are a sample. The theory that the device has now selected and internally represented specifies its tacit competence.5 It sounds as if a child who learns a language has carried out an extensive program of scientific research. Indeed, Chomsky often speaks as if the task of a child learning to speak and the task of a linguist studying the language are one and the same. This account of language learning can, obviously, be paraphrased directly as a description of how the linguist whose work is guided by a linguistic theory meeting conditions (i) - (v) would justify a grammar that he constructs for a language on the basis of given primary linguistic data. 6 It is not clear just how we are to think of the relation between the acquisition model and the performance model. From Chomsky's identification of the acquisition model with the activities of the professional linguist it seems clear that we cannot take such a model as a serious proposal about how actual children learn language. There is no reason to suppose that children learn language in the way that a linguist does, that is, by forming hypotheses and testing these against data, and so forth. For example, as an account of what children learn, when they learn a language, it is certainly wrong, or at least woefully incomplete. Chomsky leaves the impression that learning a language is just learning the grammar of the language. There is no suggestion that the child must also learn how to use the language, and certainly no description or mention of what this might involve. We might expect that a language acquisition device would, in the process of modeling the learning of a language, become in the successful case, a performance model. 4 5 6
Chomsky, Aspects, 32. Aspects, 32. Aspects,% 33.
PERFORMANCE MODELS
75
Otherwise it is hard to take seriously the claim that it is really a model of how children learn language. I think that we can best understand what is on Chomsky's mind in the following way. It is true that an adult speaker of a language can, when specifically asked, often tell us that a given utterance is grammatical. We can say, without stretching the words too much, that he knows that the utterance is grammatical. Does it follow that since he knows that the utterance is grammatical, that he must therefore know the grammatical rules of the language? He must, it seems; how else could he know that the given utterance was grammatical? Since it is patently obvious that most people cannot tell you what the rules of grammar of their language are, we must suppose that their knowledge of these rules is unconscious or tacit or intuitive. What is wrong with this little deduction? First, it does not follow that he must know what the rules are, in any sense, stretched or otherwise, in order to know that the utterance is grammatical. He may have been told that it was by someone else, a reliable informant we may suppose, and remember this. It may, for example, be a piece of primary linguistic data. Second, we should think more carefully about how we usually answer questions like "How does (can) he know this?" We don't ordinarily respond to such inquiries with psychological theories. We respond by explaining how the person is in a position to know what he knows. For example, we may say that he is a native speaker, or that he is a fluent speaker, or that he is a linguist specializing in that language, or that he just ran across that sentence in a book on grammar, or answers like these. Thus, the way in which Chomsky is proceeding here is not like the way we ordinarily go about answering these sorts of questions. This does not mean that he is wrong, or confused, but it does indicate that he wants a very special kind of answer to his 'how' questions, and that in his usage words like 'tacit' and 'intuitive knowledge' do not have anything like their ordinary meaning. Further, in the interesting case, we find that people have an ability to make judgments of grammaticalness. That is, a given subject can, for a wide variety of utterances, judge whether or not the property is present. Thus certain answers
76
PERFORMANCE MODELS
to the how question are ruled out. It becomes implausible to suppose that the subject has seen or heard all these utterances in the past, was told by a reliable informant about the grammaticality of each one, and is now remembering all that. We must supply some other kind of account. Chomsky thinks that to account for this systematic performance we must suppose that the subject is 'using' some kind of criterion; again 'using' cannot be taken in its literal sense, since the subject is clearly not aware of doing any such thing. The whole process takes on the aspects of an automatic mechanism, even though the language of conscious activity is used to describe it. It seems clear that, even with an extremely wide definition of what constitutes a grammar for a language, Chomsky's interests in language acquisition are very narrow, as they seem to be limited to learning the grammatical rules and little else. He does not mean by language acquisition, how a person comes to be able to use the language. Use of the language is supposed to be explained by a performance model, but only one aspect of this device is to be acquired by an acquisition model, namely, the grammar of the language. 5.2. BEHAVIORIST PERFORMANCE MODELS
Behaviorists, with their methodological roots in early empiricist philosophy, have been more interested in how behavioral patterns are learned than in how they are produced by the organism. Theories about learning have been dominated by the idea of a conditioned response, first explored experimentally by the Russian psychologist Pavlov, extended in this country by Watson, and later developed impressively by Skinner in his conditioning experiments with rats and pigeons. What went on inside the animal, while not completely ignored, tended to be slighted in favor of what happened outside the animal, mainly the record of impinging stimuli and emitted behavior. The concept of behavior is, like most basic concepts, difficult to define explicitly. Skinner says of it: "By behavior, then, I mean
PERFORMANCE MODELS
77
simply the movement of an organism or of its parts.... It is convenient to speak of this as the action of the organism upon the outside world." 7 The term 'action' obviously does not add much to our understanding, nor do terms like 'response' or 'bodily motion'. The latter has a nice, hard-nosed scientific sound to it, but even so staunch a behaviorist as Skinner takes descriptions like pulled a string in order to obtain a marble as perfectly acceptable. The concept of behavior is perhaps delimited best in the negative, as not including, for example, thoughts, wishes, pains, and other such 'mentalistic' entities. The idea that the activity be detectable, or, as we should say, publicly observable, is perhaps the more important part of the notion of behavior. The earl} founders of behaviorism wanted to exclude from the science of psychology those socalled private events, for whose study we must depend on the introspective reports of trained subjects. This original requirement of behaviorism, that behavior be publicly observable, was not meant, however, to confine behaviorists to study of the gross motions of the body. Changes wholly within the body, so-called covert responses like heart rate and blood pressure, as well as changes which can only be observed with sophisticated instruments, like electroencephalograms, were also included under the rubric of behavior. In discussing performance models, it will be convenient to have several elementary distinctions. First is the distinction between behavioral types and behavioral tokens. The behavioral token is an individual concrete physical event. For example, the first bar press that an animal makes in an operant conditioning experiment is a behavioral token. The token is locatable in space and time. The behavioral type, on the other hand, is the set of all behavioral tokens which meet some criterion of relevant similarity. For example, the animal does not always press the bar with the same force, or with the same paw, or at the same spot on the bar. There are many differences between one bar pressing event and the next. Yet they must be similar in some respect lest, I suppose, we would 7
B. F. Skinner, Behavior of Organisms (Appleton-Century-Crofts, New York,
1938), 6.
78
PERFORMANCE MODELS
not call them all bar pressing events. Given the criterion of relevant similarity, we shall refer to the set of all events (i.e. tokens) which meet the criterion as a behavioral type. Obviously, it will make some difference when we come to speak of the number of things that an animal can do, of its behavioral repertory, whether we are referring to behavioral types or to behavioral tokens. If it is to tokens, then we are perhaps remarking on no more than the creature's physical stamina, or the length of its life. If we are counting behavioral types, then we are interested in the kinds of things that the animal can do. In discussions of learning, it is behavioral types lhat is the important concept. There is no sense in the idea of learning a behavioral token. Learning a behavior involves being able to repeat the behavior, and one cannot repeat a dated event. What an animal learns, then, is a behavioral type. The second distinction that it will be convenient to have is between atomic and molecular behavior. I shall assume that atomic behaviors are behavioial types but not that all behavioral types are atomic. Molecular behavior is to be, as one might suspect, an ordered sequence of atomic behaviors, temporal order being the most common, but not the only, way of ordering. By the term 'atomic' I do not mean to suggest that behavior is composed of some kind of ultimate particle. What counts as an atomic behavior is determined by the vocabulary available for its description. The choice of vocabulary is determined by a number of factors, not the least of which is inventive genius. It may be that for some purposes one choice of vocabulary is better lhan another. For example, in studying the maturation process in infants, it may be more interesting to choose a vocabulary which allowed us to describe the motions of arms and legs in terms of angles at the joints, state of the various muscles, and such like. Walking will be a molecular behavior relative to this vocabulary. If, however, we are interested in exploratory behavior, walking may be taken as an atomic behavior. The choice is guided by the usual considerations of smooih running theory, enlightening generalizations, and simplicity. Now, except for a brief discussion of the process of conditioning, we are ready for a discussion of behaviorist performance models.
PERFORMANCE MODELS
79
Behaviorists divide behavior into two b r o a d categories. A respondent is a purely reflexive response elicited by a particular stimulus t o which it is linked genetically. The salivary response of a hungry d o g t o f o o d , Pavlov's original study, is an example of this type of response. In classical conditioning, studied by Pavlov, the respond e n t is associated with a new stimulus by the conditioning process. In Pavlov's experiment the dog was conditioned to salivate u p o n the ringing of a bell instead u p o n the presentation of f o o d . The other type of behavior is called a n operant, a term introduced by Skinner. A n operant is an emitted response f o r which there is n o obvious eliciting stimulus. I n Skinner's experiments, a rat is placed in a cage in which there is a b a r that the rat can press with its paws. Sooner or latter the rat comes to press on the bar. The rate at which the rat does this is called the base rate, a n d is referred to as the initial response strength. If the bar press is m a d e the condition u p o n which the rat is to receive f o o d , Skinner f o u n d t h a t the rate at which the rat pressed the b a r increased impressively. T h e presentation of f o o d is called, in Skinner's terminology, the reinforcing event, the f o o d is called the reinforcing stimulus, and the rate at which the rat emits the operant after the last reinforcing event minus the base rate is called the response strength of the conditioned operant. This p h e n o m e n o n is referred t o as simple o p e r a n t conditioning, a n d a n i m p o r t a n t feature of it is that there is n o eliciting stimulus. The only stimulus involved is the reinforcing stimulus, which occurs after the response. A n i m p o r t a n t contribution of Skinner's to the theory of conditioned learning is that of operant shaping. By a careful training to a series of successive approximations the animal could be t a u g h t to chain together a rather long series of responses. Skinner describes one of his experiments as follows: As a sort of tour de force I have trained a rat to execute an elaborate series of responses suggested by recent work on anthropoid apes. The behavior consists of pulling a string to obtain a marble from a rack, picking the marble up with the forepaws, carrying it to a tube projecting two inches above the floor of the cage, lifting it to the top of the tube, and dropping it inside. Every step had to be worked out through a
80
PERFORMANCE MODELS
series of approximations, since the component responses were not in the original repertoire of the rat.8 T h a t such complex behavior can be built u p by chaining together simpler atomic responses by a process of conditioning was an i m p o r t a n t discovery f o r behaviorist psychology. It is w o r t h noting that simple operant conditioning, of which this is an example, requires the f o r m a t i o n of response-response b o n d s t o p r o d u c e the chained sequence. A different kind of conditioning has also be extensively studied. Suppose the occurrence of the reinforcing stimulus, instead of being presented o n every occurrence of the operant, is conditional u p o n t h e prior occurrence of some other stimulus. It is f o u n d t h a t after a sufficient n u m b e r of trials, the r a t will come to press the b a r only o n those occasions when the special stimulus, called a discriminated stimulus, has occurred. F o r example, suppose t h a t the rat is rewarded f o r a bar press only after a light has flashed. In time, the rat will come to press the b a r only after the light flash. This type of conditioning, called stimulus discrimination, can be represented by a stimulus-response table. I n the case just described, the table has the following trivial f o r m : Stimulus
Response
(5.1) Light flash Bar press T h e table represents the desired o u t c o m e of the conditioning p r o cess. It does not represent the process itself, which is quite c o m plicated. T o say that a particular response is conditioned t o a given discriminated stimulus still has a certain slack, since this could m e a n t h a t the response invariably occurs following a presentation of the stimulus, or t h a t it is m o r e likely to occur. A measure of response strength can be w o r k e d out f o r this kind of conditioning, b u t for simplicity of exposition, I will assume t h a t we m e a n t h a t presentation of the stimulus guarantees the response. Obviously, it is possible to devise stimulus-response tables with 8
Skinner, Behavior of Organisms (1938), 339.
PERFORMANCE MODELS
81
more entries than one. For example, consider a rat running a multiple-T maze. We might try to condition the rat to turn right if a white card appears above a choice point in the maze, and turn left if a black card appears. For this experiment, the stimulusresponse table appears below. Stimulus (5.2)
White card Black card
Response Turn right Turn left
It is worth noting that an animal which had mastered this stimulusresponse table would be capable of running a maze of any length, (barring, of course, the untimely death of the creature) provided that the maze is suitably marked with black and white cards at the choice points. We ate now in a position to take up directly the question of behaviorist performance models. I will take a behaviorist performance model to be a 3-tuple M = (S, R, C) where S is a finite set of stimuli, R is a finite set of atomic behaviors, and C, the conditioning function is a mapping from S into R. There is nothing very surprising about this definition, which as a matter of fact, amounts to a formalization of the stimulus-response table. The surprising thing is the descriptive power which this allows the behaviorist. To illustrate this I consider two examples. As a first example, let us reconsider simple operant conditioning, and especially the formation of response-response chains. What we want to do here is bring that kind of conditioning under the rubric of a performance model as defined above. Suppose that R = {ri, V2, ..., r n } is a set of atomic behaviors, and that an organism has been trained by simple operant conditioning to emit these behaviors in a single sequence ri T2 ... r n upon the occurrence of some uncontrolled stimulus s 0 . The details of the conditioning process do not concern us, because we are only interested in representing the outcome of conditioning by means of some behaviorist performance model. Whatis required is thatwe make a special assumption about the organism. We assume that for each response n
82
PERFORMANCE MODELS
there is a unique stimulus Si, and that the emission of the response n automatically guarantees the occurrence of the stimulus si. This assumption amounts to bebaviorist way of saying that the organism can remember its previous response. We are, in effect, describing its memory as a stimulus. The assumption places minimal burden on the organism's memory, since it is only required to remember its immediately preceding response. We let the set of stimuli for the model be S = {si | o < i < n-1}. The conditioning function C is defined as follows: for O < i ^ n-1, Let C(si) = r i + i. Given the special assumption, then an organism trained to this conditioning function will respond to the stimulus s 0 with the entire sequence ri r 2 ... r n . Hence we have represented a response-response chain by means of a performance model. As a second, and perhaps more interesting example, I show how any finite state automaton can be represented by a behaviorist performance model. Let A = (K, X, 8, p 0 , F) be a finite state automaton. The behaviorist performance model M = (S, K, C) represents A, where the set of responses of the performance model is just the set of states of the automaton. We define S and C as follows. We make the same minimal assumption about remembering responses which we made in the preceding example. That is, let K # = {p # | p in K} be a set of stimuli such that the occurrence of state p guarantees the occurrence of stimulus p # . I call K # a set of partial stimuli for the following reason. We let the set of stimuli for the performance model be S = K #xE. That is we think of the total stimulus to the performance model as consisting of pairs (p # , a) where p # is a memory stimulus, and a is an external stimulus. The conditioning function C is defined as: for all (p # , a) in S, C(p # , a) = q if and only if 8(p, a) = q. What this example shows, I think, is that the behaviorists can count finite state automata as among their possible performance models. Actually, more than this can be claimed, since it is immediately evident that any function from a finite set into a finite set can be represented as a finite state automaton. Hence we can conclude that the class of behaviorist performance models is equivalent to the class of finite state automata.
PERFORMANCE MODELS
83
In closing this section a few remarks on the relation of these models to a learning theory, and to empirical studies are in order. First, the performance model is not a model of how the organism learns. It is possible to construct mathematical models of behaviorist learning theory, and in fact an enormous amount of effort has been devoted to just that. What we have provided here, however, is a way of modeling the organism's behavioral capacities as a result of learning.9 Secondly, I do not claim that for every performance model there is some physical organism which has learned, or even can learn, that particular conditioning function. I merely wished to indicate what a broad class of descriptions the behaviorist can lay claim to as behaviorist descriptions. And finally, I do not know whether or not there is a behaviorist performance model adequate to the linguistic behavior of a human speaker. All the nasty little problems that arise in trying to apply mathematical theories to scientific problems will arise in an attempt to answer this question. But we now know, and for our purposes in this paper this is sufficient, what the behaviorist's account of verbal behavior will look like. 5.3. THE CONCEPT OF BEHAVIORAL REPERTORY
A concept which will figure in two arguments which I will discuss in Chapter 6 is that of behavioral repertory. It is usually characterized as the set of all those things that the animal can do. Fodor, in his book Psychological Explanation says by way of clarification that "the potential behavior of the organism defines a space of which its actual behavior provides only a sample".10 There he is discussing the possibility of simulating a living organism by an artificial device, and trying to say why a life-long tape recording of a human speaker's utterances would not count as a simulation 9
For a thorough and formal treatment of the relation between behaviorist learning models and behaviorist performance models see, P. Suppes, "StimulusResponse Theory of Finite Automata", Journal of Mathematical Psychology, 6 (1969), 327-355. 10 Fodor, Jerry, Psychological Explanations (Random House, New York, 1968), 131.
84
PERFORMANCE MODELS
of the speaker. The reason, he thinks, is that while it is true that the tape recorder can 'say' everything that the speaker actually said, it cannot 'say' everything that the human speaker could have said. The emphasis is on the modal could. There are many things that a human speaker can say, even though in his entire lifetime he never actually utters them. Though the notion of a behavioral repertory seems initially clear, as a matter of fact, it is far from that. I do not intend to do more in this section than point out some of the difficulties that arise when one tries to count the items in this 'behavioral space'. Since the items to be counted are not all supposed to be actually exhibited by the organism, but for the most part merely possible or potential behaviors, they are not as easy to count as we should like. We might try counting in terms of behavioral tokens, but I think we could dismiss this proposal as lacking any interest. To speak of a behavioral repertory in this way would amount to little more than a reference to the life span of the organism, or perhaps its physical stamina. The behavioral repertory, if it is to be an interesting set at all, must contain behavioral types as elements. Even with the assumption that the repertory contains just types of things that the animal can do, it is still far from clear how we are to go about counting them. Consider the case of a rat learning to run a multiple-T maze. A rat placed in such a maze does not ordinarily run through the whole maze the first time without any errors. But could it? Sure, why not. There is no reason to suppose that a particularly lucky rat might not do just that. Thus, in one sense of the term, this is something that a rat could do. Further, even if our experimental animal does not do this, nor any other experimental animal either, it still makes sense to say that it could. Yet it is unlikely that Fodor would want to count such a potential performance as part of the animal's behavioral repertory. This points up the fact that the sense of could intended is not that of simple possibility, logical or physical. It won't do to say that the sense intended is that of capacity since it is well known that rats have a capacity for running mazes, even if they have never been trained. Similarly for ability. We might try something like
PERFORMANCE MODELS
85
the rat knows how to run the maze but we a r e n o w n o better off
than when we started. In short, it is not at all clear just what we are to count as among the things that a given rat (or anyone else) knows how to do. As an illustration of the further difficulties which lurk in this notion, consider the following example. Suppose that we have trained a rat to discriminate between black and white cards according to the stimulus-response table (5.2). We can suppose that the rat has been conditioned to avoid a shock in the entry arm of the maze by running into one of the choice arms, and that it has been conditioned to avoid shock in these arms by choosing the right arm if a white card appears and choosing the left arm if a black card appears. It is usually thought that we have now increased the animal's behavioral repertory. The question is, by how much? Under one description we can say that it has learned one thing, namely how to avoid electric shock in T mazes. Or we can say that it has learned to turn right at white cards, and left at black ones, and hence has learned two new things. Alternatively, let us call a maze which has n choice points, each suitably marked by a black or white card, a marked maze of length n. We might find that a rat who has been trained to the stimulus-response table above could run any marked maze we set it to. This would give us grounds for saying that it could run any such marked maze, no matter how long. Let us suppose that there are m such mazes. Does this mean that the rat has a behavioral repertory of at least size m ? Further, suppose there is a marked maze of length n where n is so large that the rat would die before it could reach the end. Can the rat run this maze? As we shall see in the next chapter, what is usually invoked to justify an affirmative answer in such cases is an in principle clause. That is, the rat could not actually run such a long maze, but given that it has learned the stimulusresponse table, it could do so in principle were it not for certain irrelevant physical limitations. Which of the above methods of counting is correct I will not try to settle here. Suffice it to say at this point, that the size of the behavioral repertory, if this is a sensible notion at all, depends upon
86
PERFORMANCE MODELS
(1) what is meant by can in the usual characterization of the notion, (2) how we describe the behavior learned, and (3) the articulation and acceptance of some criterion of irrelevant physical limitation.
6.
A R G U M E N T S AGAINST THE ADEQUACY OF BEHAVIORIST P E R F O R M A N C E MODELS
In this chapter I present, and examine in some detail, three arguments which have been put forward against behaviorist performance models. The attack has been led by linguists, psycholinguists, and philosophers associated with the transformational grammatical theories of Noam Chomsky. None of the arguments that I consider succeed in discrediting the behaviorist approach. In analyzing these unsound and sometimes invalid arguments I will not necessarily be taking the other side. While behaviorist theory is a very powerful theory whose limits of explanation have not yet been reached, my remarks in this chapter are not meant to be an argument for its correctness. I merely wish to show that certain attacks upon behaviorist theory are unfounded.
6.1. THE ARGUMENTS
The first argument that I discuss is due to Bever, Fodor, and Garrett. 1 Their argument is meant to be a very powerful attack on behaviorist theory. They claim that no explanation framed in accordance with behaviorist principles could ever be adequate to account for certain special kinds of behavior, in particular, for verbal behavior. Their argument has essentially two premisses. One concerns the limitations of behaviorist explanations, and the other concerns the requirements which explanations of verbal behavior must satisfy. The two are thought to be incompatible, 1 T. Bever, J. Fodor, and M. Garrett, "A Formal Limitation of Associationism" in: T. R. Dixon and D. L. Horton, eds., Verbal Behavior and General Behavior Theory (Prentice-Hall, 1968), 582-585.
88
ARGUMENTS AGAINST BEHAVIORIST PERFORMANCE MODELS
hence behaviorist theory is inadequate. Unfortunately, the premisses which they give do not even remotely begin to establish such a conclusion. In fact we shall be hard pressed to see how the two premisses are even related to each other or the conclusion. Their whole argument seems to be simply a massive confusion, and its sole virtue is that in trying (vainly) to make sense of it we uncover some very general problems in using facts from mathematical linguistics in psychological discussions. I shall first present their argument, in pretty much their own words, without trying to make much sense of it. Later I will come back to it and discuss in some detail various ways of trying to make it into a coherent argument. They begin by articulating what they take to be a necessary condition on any set of principles which can be called 'associative'. They say that no theory can be called a behaviorist theory unless it satisfies the following condition: The Terminal Meta-Postulate: Associative principles are rules defined over the "terminal" vocabulary of the theory, i.e. over the vocabulary in which behavior is described. Any description of an n-tuple of elements between which an association can hold must be a possible description of the actual behavior.
They then go on to add that: The postulate requires only that the vocabulary chosen for psychological descriptions of the output states must also be the vocabulary over which the associative rules are defined. 2
Just what they mean by this is, perhaps, less than clear. It is clear, however, that they are referring to associations formed between behavioral elements, that is the response-response associations which are the basic element in conditioning behavioral chains. They evidently mean this kind of association and not those which are formed in discrimination conditioning, where a stimulus becomes linked to a response. What is not clear is what they mean by an associative "rule", and what it means for a rule to be "defined over" a vocabulary. An especially puzzling locution is "possible 2 Bever, Fodor, and Garrett, "A Formal Limitation of Associationism" (1968), 583.
ARGUMENTS AGAINST BEHAVIORIST PERFORMANCE MODELS
89
description of the actual behavior". We return to these matters a little later. It is perhaps worth noting that they do not seem to mean by "associative principles" an account of the conditions under which associations are likely to be formed. For example, they do not mean by "associative rule" a claim like the following: associations are strengthened between atomic behaviors which occur in temporal sequence if every such paired occurrence is reinforced. They are talking simply about the kinds of descriptions of behavior which they think a behaviorist can use. As an example of the burdensome limitations which the Terminal Meta-Postulate is supposed to impose upon its adherents, the authors bid us consider the following example. Consider what a subject does when he learns to recognize mirror-image symmetry in figures without explicitly marked contours. The infinite set of strings belonging to the mirror-image language is paradigmatic of such symmetrical figures...someone who learns this language has acquired the ability to accept as well formed in the language all and only the strings consisting of a sequence of a's and b's followed immediately by the reverse of that sequence.3 The language referred to here was first studied by Chomsky, and is familiar to us as L2 which we discussed in Chapters 2 and 3. Chomsky has shown that this language is not a finite state language, though it is a context-free language. The formal description which we gave earlier is: (6.1)
L2 = {x in £* | x = yy R for some y in EE*}.
where E = {a, b} and y R is y spelled backwards. The fact that this language is not a finite state language means, of course, that no finite state automaton is capable of accepting this language, where "accept" has the precise mathematical sense of Chapter 2. It is with respect to this language that the authors pose what they take to be the crucial question: The question is whether or not an organism whose behavior is determined by associative principles can select just the set of sequences that satisfy this criterion.4 3 4
Bever, Fodor, and Garrett, "A Formal Limitation" (1968), 583. Bever, Fodor, and Garrett, "A Formal Limitation" (1968), 584.
90
ARGUMENTS AGAINST BEHAVIORIST PERFORMANCE MODELS
The answer, they think, is provably no. They do not, as far as I can tell, actually provide a proof. What they do say is: ...the weakest system of rules which allows for the construction of a mirror-image language is precisely one which violates the terminal postulate. That is, it is one which allows rules defined over elements that are precluded from appearing in the terminal vocabulary.5 The set of rules to which they refer, contains just the usual phrase structure rules given below: (6.2)
X -> aXa X -> bXb X I
where X is the initial symbol of the grammar, £ = {a, b} is the terminal vocabulary, and I is the null word or erasure. I am sure that what they mean by "weakest" in this context is that there is no set of right linear productions which will generate the language. Hence there is no right linear grammar for L2, a well-known fact. Just what conclusion we should draw from the above facts may not be obvious to the reader, but the authors are perfectly explicit. ...the X in these rules explicitly violates the terminal postulate. Thus an organism that has learned the mirror-image language has learned a concept that cannot rest on the formation of associations between behavioral elements. They go on to add that: In general, behavioral abilities which involve recursion over abstract elements violate the terminal meta-postulate, since there are usually elements in the description which do not appear in the behavior.6 We might hope for more elucidation of such vague notions as "behavioral abilities which involve recursion" but, alas, there isn't any more said. This appears to be it for their argument. The authors conclude straightaway that: "This appears to us to provide a conclusive proof of the inadequacy of associationism for these 5 6
Bever, Fodor, and Garrett, "A Formal Limitation" (1968), 584. Bever, Fodor, and Ganett, "A Formal Limitation" (1968), 584.
ARGUMENTS AGAINST BEHAVIORIST PERFORMANCE MODELS
91
natural kinds of behaviors".7 It is perhaps of some note that the authors, in this particular article, agree with our conclusions of chapter five, namely that behaviorists can legitimately make use of the finite state automaton as a performance model. They allow that: "... there are indefinitely many behavioral repertoires which can be described by associationistic principles (e.g. all the finite state languages in the sense of Chomsky)".8 There are so many things wrong with this alleged 'argument', that it is difficult to know where to begin. Not the least difficult, T am aware, is figuring out just what the premisses are supposed to be, ignoring for the moment, how they get us to the conclusion. Before we consider various proposals in this regard, I want to lay out two more arguments in a similar vein. In a paper entitled "Psychological Theories and Linguistic Constructs", Fodor and Garrett argue to the same conclusion as before, but from different premisses.9 They begin with a short exposition of transformational linguistics. A transformational grammar, they say, is a finite set of rules which generate an infinite set of possible sentences of a natural language. The term 'generate' is used here in its usual mathematical sense, and is not meant to refer to any kind of psychological mechanism or process. For our purposes, we need not worry about the precise nature of the transformational grammar. It is sufficient that we note that it determines in the typical case, an infinite set of strings which count as the grammatical sentences of the language. The grammatical rules are 'internalized' by the speaker and they constitute his linguistic competence. Their argument does not turn in any way on the kind of grammatical rules appropriate to the spoken language, but only on the fact that they characterize an infinite set of sentences. They go on to say that, "Given such competence, the problem for psycholinguistic theory is to explain how the 7
Bever, Fodor, and Garrett, "A Formal Limitation" (1968), 584. Bever, Fodor, and Garrett, "A Formal Limitation" (1968), 585. 9 J. Fodor, and M. Garrett, "Psychological Theories and Linguistic Constructs", in: T. R. Dixon and D. L. Horton, eds., Verbal Behavior and General Behavior Theory (Prentice-Hall, 1968), 451-477.
8
92
ARGUMENTS AGAINST BEHAVIORIST PERFORMANCE MODELS
speaker utilizes his competence in his linguistic p e r f o r m a n c e " . 1 0 They are not, of course, actually going to solve this problem. They d o n o t even intend to try. W h a t they wish to do is merely discredit a certain a p p r o a c h to the problem, namely, the behaviorist or S - R a p p r o a c h . A m a j o r premiss in their argument, then, is the following c l a i m : ...it is essential to the competence model that the organism's behavioral repertoire be inherently infinite. That is to say, that the finite set of linguistic rules a speaker internalizes are, in principle, sufficient to provide him with a repertoire of infinitely many distinct responses.11 It is clear f r o m the above passage t h a t the authors identify the set of sentences generated by the g r a m m a r with the behavioral repertory of the speaker. O n the basis of this identification, a n d the fact t h a t the set of possible sentences is infinite, they conclude that the behavioral repertory of a native speaker is infinite. This does n o t mean, of course, t h a t they think that anyone could actually say an infinite n u m b e r of things, just that there are an infinite n u m b e r of things t h a t he could say, given that he has 'internalized' the appropriate g r a m m a r . This, I suppose, is the point of such phrases as "inherently infinite" a n d "in principle". Their first premiss is, then, ihat language users, a n d hence adequate perform a n c e models of them, must have infinite behavioral repertories. T h e second part of their a r g u m e n t concerns the limitations of the behaviorists' position. They attribute the following views to the S - R theorist. The S-R theorist views the organism as possessing a finite repertoire of behaviors, any one of which may, on occasion, be selected by specified stimulus parameters. They go o n t o a d d t h a t : The set of such responses at the organism's disposal is thus inherently finite, although it may provide for appropriate behavior in indefinitely 10
Fodor and Garrett, "Psychological Theories and Linguistics Constructs" (1968), 454. Fodor and Garrett, "Psychological Theories" (1968), 455.
11
ARGUMENTS AGAINST BEHAVIORIST PERFORMANCE MODELS
93
many input situations so long as these situations are similar in relevant respects.12 Although, their use of the word "indefinitely" suggests that the organism might have infinitely many inputs, this is not actually what they mean. There is an implicit distinction made here between what we might call stimulus types and stimulus tokens. The indefinitely many input situations spoken of above are the stimulus tokens. They are the individual concrete stimulus events, all of which share a common characteristic, or set of common characteristics, which allows us to group them into a single stimulus type. For example, we spoken of a black card as a stimulus in an earlier example. In the course of training, or even in the course of running a single maze, the experimental animal may encounter any number of black cards at choice points in the maze. None of these cards will be exactly the same, some being larger than others, or perhaps slightly bent, or a slightly different shade of black, and so forth. Nevertheless, they all count as a single stimulus type because they are similar in the relevant respect. The authors summarize their characterization of the S-R position in the following way: We assume, then that the S-R model, if it make any claim at all, must at least, require, (a) that the behavioral repertoire is finite, and (b) that the stimulus parameters that select behaviors are characterizable by reference to some finite set of parameters common to all stimuli capable of eliciting the same response.13 Their second premiss is, then, that the behaviorist is committed to models with a finite number of inputs and a finite number of outputs. The authors provide little in the way of clarification of what they mean by "behavioral repertoiie", and not much justification for the assumption that it must be finite. About all they say is that: The rejection of S-R performance models by psychologists who adhere to the generative competence model is required by the fact that the 12 13
Fodor and Garrett (1968), 455. Fodor and Garrett (1968), 455.
94
ARGUMENTS AGAINST BEHAVIORIST PERFORMANCE MODELS
former proposes a view of the organism as consisting of no more than a response library and a look-up. ...if the mathematical character of the organism is to be expressed by a finite set of pairings of input and output states (by a finite set of associations between stimuli and responses) an impoverished view of the organism's behavioral repertoire is entailed.14 It is clear from these passages that they have made certain assumptions about the nature of the behavioral repertory. In particular they do not count behavioral sequences as members of the behavioral repertory. Otherwise, we could simply claim that there is no limit on the length of input sequences, hence no limit on the length of output sequences, and from this it would follow that the organism could produce an infinite number of such sequences. Since they claim that the behavioral repertory is finite, they must assume that these sequences are not part of the repertory. Their argument, then, has essentially these two premisses. First, they assume that the infinite set of sentences generated by a transformational grammar constitutes an infinite behavioral repertory for speakers of a language. And secondly, they assume that behaviorist performance models must have a finite behavioral repertory. From these two very strong assumptions they conclude, validly, I think, that: ...if the linguistic competence model discussed above is correct, no performance model having these essential characteristics of S-R theories could be adequate for verbal behavior.15 I say that they conclude validly because I think that this conclusion follows from their premisses. I do not, however, think that both of their premisses are true. I below argue that at least one of these premisses is false, but first let us consider a third, and final, argument against behaviorist performance models. The preceding two arguments have been premissed on very general considerations concerning the nature of behavioral repertories. The first pointed to limitations in models based on response14 15
Fodor and Garrett (1968), 455-156. Fodor and Garrett (1968), 456.
ARGUMENTS AGAINST BEHAVIORIST PERFORMANCE MODELS
95
response associations, while the second emphasized limitations in models based on stimulus-response associations. The third argument, also due to Fodor and Garrett, appeals to more specific linguistic facts, and emphasizes difficulties on the stimulus side of the model. They argue that there are certain data associated with the perception of speech sounds which are inexplicable on the basis of behaviorist principles. They hope to produce a very strong argument against S-R theories by showing that phonetics, the most 'behavioristic' part of linguistics, gives rise to insoluble problems for behaviorist theory. The general problem to which the authors refer is that of speech recognition. It is an obvious fact that the same sentence can be uttered by different speakers. It is equally obvious that while they are utterances of the same sentence, they do not have the same physical characteristics. The sentence Where are you? uttered by a forty year old man will have markedly different physical characteristics if uttered by a four year old girl. What we perceive in either case, however, is the same sentence. The problem of speech recognition is to explain how we get from the acoustic input to the perception of the message. A possible approach to the speech recognition problem is through the linguist's system for describing spoken utterances. Linguists have long recognized that it is possible to divide a spoken utterance into a linear sequence of discrete units. Most writing alphabets are based, to some extent at least, on this insight. The forty or so linguistic atoms now in use are variously called allophones, sounds segments, or phonemes, and with these atoms linguists are able to describe an utterance in any of the known human languages. There have been various methods used in defining or describing individual phonemes. The most common is by example: the phoneme /s/ is defined (somewhat vaguely) as the set of all those sounds which are like the last sound in bus. Other methods have been proposed, however. For example we can try to characterize the phoneme in terms of the articulatory configuration of the vocal tract required to produce the phoneme, but these details need not detain us here.
96
ARGUMENTS AGAINST BEHAVIORIST PERFORMANCE MODELS
The phoneme system, however the individual phonemes are described, is now widely used by linguists. Its use has led to a number of insights, and has made possible the statement of linguistic rules in a simple and elegant fashion. For example, the rule for forming plurals of regular English nouns can be stated as: if the noun ends in /s/, /z/, /s/, /a/, /c/, /$/, (e.g. bus, cause, bush, garage, beach, badge) add /z/ at the end; if the noun ends in /p/, /!/, /k/, /©/, or /f/ (e.g. cap, cat, cake, fourth, c u f f ) add /s/ at the end; for all other cases add /z/ at the end. Any other method proposed for stating this rule has been unduly cumbersome, inelegant, or incorrect. Although the assumption that speech is composed of discrete entities is now widely accepted among linguists, there are certain difficulties with the notion. For example, if one examines an utterance as a physical event, one finds that there are no acoustical properties which mark the boundaries of the phonemes which the linguist says compose the utterance. We would like to find some physical event correlated one-to-one with the transition from one supposed phoneme to the next. Nothing has as yet been found which appears plausible. It is, of course, possible to attack the segmentation problem backwards and somewhat arbitrarily. By using tape recordings, and physically cutting and splicing the tape, it is possible for linguists to map their judgements of phonetic juncture backwards onto the physical sound as preserved on the tape. Once this is done it becomes possible to examine the physical structure of the acoustical signal associated with a particular phoneme. In other words, it is possible to bypass the segmentation problem by recording the utterance on recording tape, and then marking the tape at those points at which a trained linguist says that a juncture between phonemes occurs. The actual procedure is, to be sure, more involved than I have described it, but this is in essence what they do. A second problem studied by instrumental phoneticians is that of developing a concise and revealing description of the
ARGUMENTS AGAINST BEHAVIORIST PERFORMANCE MODELS
97
physical structure of the acoustic signal associated with each phoneme by the above method. They would like to be able to characterize each phoneme in terms of a small number of physical characteristics, say power spectrum, duration, formant behavior, and the like. The problem is to come up with a set of characteristics which are common and jointly peculiar to, say, /s/ in every acoustic environment in which it occurs, and for every speaker. This problem also has defied solution. Let us refer to this problem as the physical phoneme problem. Research into the physical phoneme problem has given rise to data which Fodor and Garrett claim are relevant to, and troublesome for, behaviorist performance models. The behaviorist can construe the problem as a stimulus discrimination phenomenon, with the physical phoneme as stimulus, and the 'perceptual' phoneme, as recognized by a trained listener, as the response. The problem is to set up the stimulus-response table, and of course, the difficulty lies in developing the stimulus side. Fodor and Garrett argue that the problem is, in fact, insoluble. They say that: There appears to be no clear structure in the acoustic signal which can be the basis for defining perceptual response to particular constructs. Rather, the basis for discrimination must be sought again in a structure which the perceiver imposes on the acoustic signal.16 What they mean by "constructs" here is the set of characteristics which is used to describe the segment of the acoustic signal associated with a particular phoneme, in other words the physical phoneme. The "perceptual response" is just the particular phoneme, which they are identifying with a behavioral type. What they mean by "a structure which the perceiver imposes" we defer to a later part of the discussion. The authors cite a result from instrumental phonetics which bears on the present question. They claim that: "... the same physical event, the same acoustic stimulus, gives rise to systematically different percepts...." 17 In a 1952 study by Cooper, Borst, and 16 17
Fodor and Garrett, "Psychological Theories" (1968), 460. Fodor and Garrett (1968), 458.
98
ARGUMENTS AGAINST BEHAVIORIST PERFORMANCE MODELS
Liberman, it was found that the same burst of energy at a particular point in the sound spectrum is identified as a different sound, depending on the vowel it is adjacent to. For example, a 1440 cps burst is identified as /p/ when adjacent to fij but as /k/ when adjacent to /o/. The 1440 cps burst is duplicated on recording tape, and physically spliced into two tapes, one containing the /i/ and the other containing the /o/. This is evidently what Fodor and Garrett mean by "the same acoustic stimulus gives rise to systematically different percepts". The problem for the behaviorist is the following. No matter how we characterize the 1440 cps burst (the simplest way being as just a 1440 cps burst) we are not going to be able to use it as an element on the stimulus side of the stimulus-response table. We may be able to find a set of characteristics which are common to every acoustic segment associated with /p/. But we will never find a set which is jointly peculiar to it. The trouble is that whatever set of characteristics we devise, they will apply equally well to some acoustic segments associated with the phoneme /k/. Hence there isn't any set of characteristics common and jointly peculiar to every acoustic segment associated with the phoneme /p/. It will therefore be impossible to correlate each phoneme with a set of acoustic segments. Hence, it will not be possible to construct the appropriate stimulus-response table necessary for the behaviorist performance model. The behaviorist cannot, they conclude, explain how we identify phonemes.
6.2. ANALYSIS
In the preceding section we have presented three arguments purporting to show that behaviorist performance models are inadequate to account for a large class of natural behavior. In this section we discuss these in more detail, and raise a number of new issues concerning the relationship between linguistics and psychology. The first argument we presented is the most difficult to handle
ARGUMENTS AGAINST BEHAVIORIST PERFORMANCE MODELS
99
because it seems almost impossible to understand how it is supposed to work. We can begin with the terminal meta-postulate, so-called. The major difficulty with it lies in the notion of an associative rule. What are associative rules? From the fact that the production rules of the grammar for the mirror-image language L2 violate the terminal meta-postulate we can assume that the authors believe that a production rule is an associative rule. This is further supported by the fact that they mention terminal vocabulary in the statement of the postulate and the X in the production rules is not in the terminal vocabulary of the grammar for L2. It should be remembered that terminal vocabulary has a perfectly precise meaning when used in connection with grammars, and the authors do not provide for us an equally precise meaning when it is used in connection with behavior. We can only surmise from their usage that they are somehow identifying the two, or perhaps making some kind of close analogy. If it is correct to say that the production rules violate the meta-postulate, then we can ask what it is that these rules describe. If anything, it is the set L2. It does not seem likely, however, that they think that L2 is a possible behavior. It is more likely that they think that L2 is a behavioral repertory. Associative rules would then be rules which are used to describe the behavioral repertory of the organism, and hence they determine what the possible behaviors of the organism are. It is still not clear, however, how we can take L2 as a behavioral repertory. I suppose we can think of the organism as one which produces inscriptions, say, of strings from L2. We must be somewhat loose here, and construe the members of L2 both as descriptions of behavior, and as behaviors themselves. The terminal meta-postulate requires, under this interpretation, that in describing a behavioral repertory we never use any vocabulary that cannot be used to describe behavior directly. This requirement seems to reduce the behaviorist to the expedient of simply listing all the behaviors in the behavioral repertory. This limitation is patently absurd. There is just no reason to suppose that the behaviorist is limited to describing any behavioral repertory in this silly way. Nothing in behaviorist principle
100
ARGUMENTS AGAINST BEHAVIORIST PERFORMANCE MODELS
prevents him from describing behavior or sets of behavior in any way he likes. It is not how something is described, but what is described that the behaviorist is concerned about. He doesn't want to be describing 'mental' states, to be sure, but unless the authors think the X in the rules refers to (denotes, describes, or whatever) a mental state, there is no good reason for a behaviorist to object to it. If I am right in this interpretation of the terminal meta-postulate, then I don't see any reason to think that it is true. As a second step in their argument the authors ask us to consider what a subject does when he learns to recognize the mirror-image language. Let us take this request seriously. 1 would suppose that the subject can scan an inscription of a member of L2 and after a certain amount of time emit some behavior which is appropriate to the inscription presented. If we were to set this up as a stimulusresponse table, we would not put strings from L2 on the response side of the table. I am not sure what we would put there, but it looks as though we would not need, as the authors seem to think, an infinite number of entries. There is no reason at all for taking the strings from L 2 as the behavior, if it really is a matter of learning to recognize these strings. The behavior involved will be something else entirely. Further, it may not be possible for a physical organism to recognize the mirror-image language, if by recognize is meant the same as accept L2 in the precise mathematical sense of Chapter 2. It is well known that L2 is not a regular set and that there is no finite state automaton which accepts it. It is equally well known that L2 is context-free, and that there is a push down automaton which does accept it. Could there be a physical device which was a push down automaton accepting this language? This question hides some extremely difficult problems. First of all, as I stressed in Chapter 2, it is important to distinguish sharply between the formal definition of a push down automaton as a certain 7-tuple, and the informal description which is usually given along with the formal definition. The informal description, in terms of tapes which the automaton reads and writes upon, obscures the fact that accept as formally defined does not refer to any sort of activity at all.
ARGUMENTS AGAINST BEHAVIORIST PERFORMANCE MODELS
101
There is, for example, no formal sense to a locution like the automaton is now in the process of accepting w. Let me add, however, that I do not think that it would be impossible to provide a formal and precise definition of this usage; my point is that it has not been done by the formal definition of accepts. In order to explore the problems here a little further, let us use the informal description of a push down automaton as a guide, or quasi-formal definition. With this understanding of what a push down automaton can do, is it possible for a physical device actually to carry out its activity? The most plausible answer, I believe, is no. (I say "plausible" because there is no possibility of carrying out a rigorous argument in the present circumstances.) In the computation of a push down automaton which accepts L2, it is necessary to write on the push down tape. In the automaton 1 described in Chapter 2, exactly half of the input word must be written on the push down tape in the course of the computation. It seems plausible to suppose that a physical device serving as a physical example of that automaton would have a push down tape with some fixed length. We do not have to suppose, of course, that by length here we mean 'so-and-so many yards'. Roughly speaking, length is a measure of how much information a memory device can store, and this will always be some finite number, however information is defined exactly. Let us suppose this length bound is n. Then for an input word of length 2 n + 2 , the device will be unable to get the entire first half of the input onto the push down tape. Obviously, then, it would be unable to tell whether or not the input word was to be accepted or not. It is no help to expand the push down tape's capacity, because whatever choice of length we make, there are always input words too long for it to handle. It seems likely that this difficulty in one form or other would confront any physical device trying to perform the same task, namely, that some input word would exceed the capacity of its internal memory. What follows from this is that recognize does not mean the same thing as accept. It is surely true that a human subject, say, can learn to recognize the mirror-image language. The task is really quite simple, just the sort of thing found
102
ARGUMENTS AGAINST BEHAVIORIST PERFORMANCE MODELS
on 1Q tests for children. But even when we say that a child can recognize the mirror-image language we do not mean that he could render a decision on any string we might give him. To say just what we do mean here would take us pretty deep, but it is clear that we don't suppose that the child could carry out the computation necessary to accept or reject a string of, say, length 1010. The authors might object here that while we do not suppose that the child could actually carry out the computation, we do mean that he could in principle do it. What does in principle mean here? One construal, often put forward, is that the only limitation on the execution of the task is that imposed by finiteness of life span and memory. But such a claim depends on an implicit distinction between the memory of a device and the rest of its structure. This distinction is much more difficult to draw than is ordinarily supposed. It may seem, for example, that in the case of an informal push down automaton (if I may speak this way for a moment) the distinction is clear enough. The push down tape is the memory, while the set of states (or perhaps the next state function) is the non-memory part. Hence, while the push down tape may have some finite bound, we can nevertheless see quite clearly that no matter how long an input word might be, we could render the automaton capable of making a decision for it simply by extending the push down tape. No alteration in the next state function is necessary. Hence we could say that the automaton can, in principle, accept L2. Unfortunately, a similar argument could be made to show that there is a finite state automaton which can, in principle, accept L2. First of all, the set of words accepted, in the mathematically precise sense, by a push down automaton with a length bound on the push down tape is a finite state language. To establish this we have 10 make the following assumptions. First let us suppose that M = (KM, r , 5, Zo, po, FM) is a push down automaton with the following property: for every p in KM, X in X, 8(p, I, x) = €K We can call such a device an I-free push down automaton. It is known that for every context-free language L, there is an I-free
ARGUMENTS AGAINST BEHAVIORIST PERFORMANCE MODELS
103
push down automaton M such that T(M) = L-I. That is, there is an I-free automaton which accepts the language which is exactly like L with the possible exception of the null word. Next we must produce a plausible way of defining formally a push down automaton with a length bound of, say, n, on its push down tape. A likely candidate is a non-deterministic finite state automaton. The details are as follows: DEFINITION: Let j = | w | where w is the longest word such that (q, w) in 8(p, x, y) for some p in KM, X in and y in T. Let T(j + n) be the set of all words w in T* such that 1 < w < j + n. Let A = (KA, 8, So, FA) be a non-deterministic finite state automaton where K M X T O + N) A = So = {(po, Z 0 )} F A = F m X T G + n) K
Let (p, wx) be any member of KA such that |wx| < n and x in T. Then for any a in I , 5A((p, WX), a) = {(qi, wzi), ..., (q m , wzm)} where (qi, Zi) in 5M(P, a, x) for all 1 < i < m. For all other (p, wx) in K a let SA(.(p, wx), a) = 0 . The idea behind the definition is this: as long as the word on the push down tape of M does not exceed n in length, the moves which the finite state automaton makes will exactly mirror those of the push down automaton. That is, if M moves to state p with w on the push down tape, then A will move to state (p, w). Note that w may exceed the length bound n, but in that case, all subsequent moves by the finite state automaton will be blocked. It is this feature that models the cut off point of the push down tape. If this definition correctly formalizes the intuitive idea behind a push down automaton with finite push down tape, then it follows immediately that the set of words accepted by such an automaton is a finite state language. A further, and perhaps more interesting result, is that there is
104
ARGUMENTS AGAINST BEHAVIORIST PERFORMANCE MODELS
a perfectly straight-forward way in which the bound n can be increased without changing the essential structure of the next state function. That is, if we want a formal description of the result of adding, say one more slot to the push down tape, we do not have to alter the structure of the next state function for any (q, w) in KA such that | w | < n. We merely have to fill in the next state sets for each (q, w) such that | w | = n + 1, and add to KA all the ordered pairs in KM X D(n + 1 ) where D(n + 1) is the set of all words in T* such that | w | = n + 1 + j. This process is so simple and straight-forward that it seems plausible to describe it in the following way. It is merely a matter of increasing A's ability to remember things on its quasi-push down tape. If we apply this line of reasoning to the push down automaton which accepts L2, we come to the following conclusion. The non-deterministic finite state automaton which formalizes the push down automaton M2 with a length bound on its push down tape can actually accept only a finite subset of La. But merely by adding to its memory in the above fashion, we can increase the size of the subset as much as we like. Hence, we are entitled to say that, while the finite automaton cannot actually accept L2, it can do so in principle. Thus, under one construal of 'in principle', a very common one, finite state automata can do everything that push down automata can do, at least in principle. This seems to me a peculiar conclusion. I think it points up the importance of providing a much more careful specification of what it is for a device or organism to be in principle capable of something. In objecting to the behaviorist's inability to account for behavior associated with recognizing (so-called) the language L2, the authors mention recursion. They say that "behavioral abilities which involve recursion over abstract elements violate the terminal meta-postulate". The puzzle here is what they mean by "involve". If we assume for the moment that L2 is a behavioral repertory (which isn't likely), we can see that one way of describing L2 involves recursion in the following precise sense: the production rules of the grammar for L2 have the variable X appearing on both sides of the arrow. This is what is usually meant by saying
ARGUMENTS AGAINST BEHAVIORIST PERFORMANCE MODELS
105
that a set of rules involves recursion. But there is nothing essential about this. There are other perfectly good definitions of L2 which do not involve recursion in any way. The authors themselves provide one when they describe the mirror-image language as the set of "all and only those strings consisting of a sequence of a's and b's followed immediately by the reverse of that sequence". This is a definition of L2 which does not involve recursion, but which is nevertheless adequately precise. A n d of course, there are numerous other definitions which are satisfactory. Whether or not the authors would find any of these in violation of the terminal meta-postulate, I do not know. It is possible that the authors think that recursion is involved in some more directly psychological way. That is, they may think that a subject who has learned to recognize the mirror-image language has 'internalized' the production rules of G2, and 'uses' these rules in making judgements about presented strings. The analogy here, I suppose, is to some kind of computer program, though I am hard put to work out just how we should take this analogy. Crudely we might suppose that the subject represents the rules in some other language, or in some system of representation, and has available a program for putting the rules to use in the recognition task. This program might, for example, generate copies of strings from L2, in some exhaustive order, and compare them with the input string. There is no reason, however, to suppose that a human subject, who has learned to recognize L2 does anything like this at all. He just as well may have 'internalized' the next state function of finite state automaton which represents a push down automaton with finite push down tape. There is no reason to suppose that one is more likely to be 'internalized' than the other, since as a matter of fact, there is no reason, period, to suppose that either is involved in any way, in the activity of a human subject performing such a recognition task. Before we move on to discuss the second argument, let us summarize the discussion of the first one. Concerning the terminal meta-postulate, so-called, I have argued that it amounts to an unwarranted restriction on the way in which sets of behaviors are
106
ARGUMENTS AGAINST BEHAVIOR 1ST PERFORMANCE MODELS
described. There is no reason to suppose that behaviorists cannot describe behavior, and sets of behavior, in any way they like; their only objection is to descriptions of 'mentalistic' entities. Concerning the recognition of the mirror-image language, I argued two things. First, if we take the 'recognition' task literally, there seems to be no connection between the terminal meta-postulate premiss and this one, since the behavioral repertory for the recognition task need have at most two members. And secondly, the recognition task for a human subject has little in common with the formal definition of accept used in automata theory, and there is little reason to believe that a human subject could actually carry out the full task of accepting L2. We considered a possible objection to this argument in which it is held that a human subject could in principle carry out this task. Using the usual rather vague explication of 'in principle' I argued that a finite state automaton could in principle accept L2, and took this result to show that a much more careful explication of 'in principle' must be given. Finally I considered briefly the irrelevance of the authors' charge that recursion is the source of the behaviorist's alleged difficulties. Let us now consider the second argument, that by Fodor and Garrett. They base their argument, it will be recalled, on two premisses. First, that the theory of transformational grammar implies that speakers of a natural language must have an infinite behavioral repertory. The second is that behaviorists are committed to explanations of at most a finite behavioral repertory. At least one of these two premisses is surely false. The essential feature of their argument, it will be recalled, is that a speaker of a natural language has available to him a grammar which generates an infinite set of sentences. It does not matter for their argument what the precise nature of the grammar is, so long as it generates an infinite set. It is clear that the authors identify the set of sentences in the language with the behavioral repertory of the speaker. This is what they mean by saying that the rules "a speaker internalizes are, in principle, sufficient to provide him with infinitely many distinct linguistic responses". It is hard to see, however, how such a bold claim could be support-
ARGUMENTS AGAINST BEHAVIORIST PERFORMANCE MODELS
107
ed. While it may make some sense to say that there is no bound on the length of sentences in the language, it does not make much sense to say that there is no bound on the length of an actual utterance. I take it as obvious that no human speaker could complete a sentence containing 1010 words. Life is too short. It is sometimes alleged that since we cannot draw the line of admissible length, that therefore, there isn't any. But it does not follow that since we cannot place a least upper bound, we cannot place some upper bound. An utterance which would take four billion years to complete is not one which a human being can say. Obviously, then, with an upper bound on length, the number of sentences a speaker can say is finite, contrary to their assumption. In placing such a bound on length of producible utterances, we are not making any assumptions about the finitude of the language in question. That is a matter settled by recourse to the grammatical rules of the language. It may be that these rules do indeed generate an infinite set of strings, and that a speaker of the languages knows these rules, without it being true that he can produce every string generated. There is more to producing an utterance than merely knowing the rules of the grammar. It is surprising, but it seems that the authors have confused the notion of generate as it is used in grammatical theory, with the psychological process of producing an utterance. The two are quite obviously different. There are, at least, limitations imposed by finite life span and finite memory. The authors may object here that they are quite aware of these considerations, and never supposed that a speaker could actually produce these unusually long utterances. Rather they meant that such utterances are in principle, producible by the speaker. They actually do say that the speaker's set of possible linguistic responses is in principle infinite. The criticisms I have made above against this notion of in principle apply equally well here. I have already indicated how difficult it is to form any exact and useful conception of what this phrase indicates. We can, however, adduce a further difficulty in connection with the second premiss of this argument. It is claimed by the authors that behaviorists are committed
108
ARGUMENTS AGAINST BEHAVIORIST PERFORMANCE MODELS
to finite behavioral repertories. There is good reason, I believe, to accept this as true. However, if we grant the truth of their first premiss, on the grounds that speakers have infinite behavioral repertories in principle, then a good case can be made for saying the same thing for the behaviorist. As an example, consider the maze running experiment discussed in Chapter 5. A rat was taught to run T mazes in which the choice points were marked by black or white cards. A black card indicated that it was to turn left, while a white card indicated that it was to turn right .We can think of this as a case of the rat 'internalizing' two rules. On the basis of these rules, the rat is able to run correctly on the first trial many novel mazes which it has never encountered before. (We might call this the creative aspect of maze running.) Now it is doubtless true, that some mazes could be so long that the rat wouldn't live to see it through. But, and this is the point, we can claim, just as the authors do for the language user, that the rat nevertheless could in principle run these long mazes. In fact we can claim that it can in principle run any maze which is suitably marked. We therefore have grounds at least as good as the authors', for saying that the rat has an infinite behavioral repertory. But the source of the rat's ability here lies solely in the stimulus discrimination conditioning it received to the black and white cards. Nothing could be more clearly behavioristic than that. It seems then, that the authors' second premiss is false, provided they wish to defend the first one on the basis of in principle considerations. In summary, then, the trouble with this second argument is the following. The first premiss is false, if one makes the usual distinction between linguistic competence and linguistic performance. If the authors wish to defend the first premiss by appeal to in principle considerations, then the second premiss fails. Either, in both cases behavioral repertories are finite, or in both cases they are infinite. Hence the authors have provided no grounds for assenting to their conclusion. The third argument which we must consider attempts to show that behaviorist principles are inadequate to account for certain facts in phonetics. We can restate the problem they pose in the
ARGUMENTS AGAINST BEHAVIORIST PERFORMANCE MODELS
109
following general terms. There is a phonetic alphabet, now almost universally used to represent the utterances of the various spoken languages of the world. We can think of any utterance as segmentable into sequences of these phonemes. From the point of view of speech perception we can consider the phonetic alphabet as an output alphabet. The input alphabet should be sets of physical characteristics, in terms of which the acoustic input associated with any given phoneme can be characterized. The perceptual process could then be explained, at least the earlier stages of it, as a mapping from the sequence of acoustic segments onto a sequence of phonemic segments. The behaviorist could then explain this as an example of stimulus discrimination, with the stimuli being characterized by finite sets of properties, and the response being the associated phoneme in the phonetic alphabet. Experiments with physically splicing the same acoustic stimulus into different acoustic environments, and then noting how observers identify the sound in terms of the phonetic alphabet has shown that the same sound is identified as a different phoneme depending on its acoustic neighbor. Thus, the authors conclude that a one-one mapping from physical stimulus to perceptual response is not possible. Let us assume that the authors are right about this, namely that a one-one mapping is not possible. It does not follow from this that behaviorist performance models are excluded. It is clear from the data they cite, that the mapping device must be able to remember past inputs and look ahead to ones coming. That is, it must be able to calculate the output for present input on the basis of flanking input. This is not so difficult as they seem to think, being well within the capacities of finite state automata. The biggest problem is being able to look ahead on the input tape, but that is easily solved by noting that looking ahead is equivalent to delaying output, and looking back. For example, if we want the mapping to take into account one flanking symbol on either side of the current input, we require that the mapping take into account two symbols preceding the present input. We can approach the physical phoneme problem, now, in a much more sophisticated
110
ARGUMENTS AGAINST BEHAVIORIST PERFORMANCE MODELS
way. Let us suppose that we have segmented the physical signal at phoneme junctures as described earlier. What we want to do is partition all these segments into a finite set of classes, each of which will be uniquely characterized by some set of physical properties. Each of these classes will serve as an input symbol to the automaton. We can call this set S. We want each triple from this set to produce a separate response, which we think of as the response for the symbol in a flanking context. In accordance with our practice of Chapter 5, we provide one automaton state for each response. Let Ki = {[w] | w in £* and 1 < w < 2}. Then K = {per} u Ki is the set of states of the automaton, where po is the start state. The next state function is defined as follows: for any x, y, and z in I , 5(po, x) = [x]; 8([x], y) = [xy]; and 5([xy], z) = [yz]. The [x] states serve as a sort of null response while the organism waits for enough input to accumulate. The set of final states can be empty since we are more interested in the response structure of the automaton, than in the set of words accepted. Thus the kind of mapping which the authors' data seem to call for can be modeled by a behaviorist performance model without difficulty. There is, therefore, no theoretical obstacle to a behaviorist account of the phonetic data.
7. CONCLUSION
In the preceding chapters I examined, altogether, seven different arguments which made use of results from the mathematical theory of language. In Chapter 31 discussed two arguments against the weak adequacy of finite state grammars, one by Chomsky, and one by Bar-Hillel and Shamir. In both cases, the weakest part of the argument turned out to be an empirical premiss. That is, both arguments rested on dubious claims about the structure of English. In the latter part of Chapter 3, I examined various arguments which Chomsky has advanced in defense of his empirical premiss, and showed that there is not much reason to accept them. In Chapter 4 I examined in some detail an interesting argument by Postal purporting to show that Mohawk is not a context-free language. Unlike the earlier two arguments, it is the mathematical portion of Postal's argument which fails. I considered several strategies which Postal might employ in repairing his argument, but was unable to find a method which would establish the desired conclusion on the basis of the empirical data he gives. I concluded that the weak adequacy of the theory of context-free grammar is still an open question. In Chapter 61 discussed several attempts to apply the mathematical theory in arguments in psychology. By considering only those arguments which made some serious attempt to exploit the mathematical theory, the number of arguments we needed to consider was narrowed to three, all of which dealt with the problem of linguistic performance models. The authors, Bever, Fodor, and Garrett, attempted to prove that there could not be an adequate behaviorist performance model. 1 argued that they were unsuccessful in their attempt, but did not defend the general thesis that such
112
CONCLUSION
models are adequate for linguistic behavior. In the remainder of this final chapter I consider briefly some implications of these largely negative results. The failure of all the arguments against the weak adequacy of finite state and context-free grammars greatly increases the importance of the transformational linguist's arguments against strong adequacy. Chomsky, of course, has held all along that phrase structure grammars cannot adequately generate the 'underlying' grammatical descriptions associated with each sentence. As is well known, he has modified his own views on what the form of a correct grammatical description looks like. In his more recent writing, for example, he holds that every sentence has both a surface grammatical structure, and a deep grammatical structure. While surface structures of John is easy to please and John is eager to please are identical, Chomsky thinks that their deep structures are quite different. In the surface structure of both sentences John is the subject of the sentence, while in the deep structure John is the logical subject of please in the second sentence, but its logical object in the first. Chomsky has proposed a number of interesting ideas on the way in which a grammar should be organized so that it displays all this alleged grammatical structure. By including such notions as 'logical subject' and 'logical object' in the grammar, Chomsky is clearly employing semantical criteria in his latest grammatical theories. Chomsky hopes, of course, to convert these criteria into syntactical ones by building them into the gramm?r itself. Hence, for a grammatical theory to be strongly adequate it must meet extremely strong conditions, not all of which are purely syntactical. Chomsky's claims about what counts as the grammatical structure become more important since they are now the sole criteria by which he can exclude the simple phrase structure grammars as inadequate. That is, since no case has yet been successfully made for their weak inadequacy, the only remaining ground for declaring them inadequate is that they don't provide the correct structural descriptions. Hence the case against them now depends crucially on what is meant by 'the correct structural description of a sentence'.
CONCLUSION
113
Similar comments apply to the arguments discussed in Chapters 5 and 6, although it is not clear what strong adequacy amounts to in psychology either. My argument in Chapter 5 that the behaviorist can use finite state automata for performance models shows that the descriptive power of behaviorist theory is very broad. It is unlikely that there is any input-output function computed by a physical device which cannot be modeled by a finite state automaton. The transformationalists should not find this very interesting, however, for the following reason. The states of a finite state automaton are useful in modeling the external or total states of a physical device. They do not lend themselves to modeling the internal states. For example, consider the digital memory of an ordinary computer. If we assume a modest memory size of 105 words, each 10 binary digits long, a finite state automaton which models this memory will have 2 1 0 6 = lO 300 - 000 states. While this shows that we should not be dismayed by automaton models with astronomically large numbers of states, and that they are perfectly feasible physically, the transformationalist will surely object that the very fact that the model states are external or total states renders the model singularly uninteresting. In terms of the example, we cannot describe the contents of one memory location without describing the total contents of the memory. Models of linguistic behavior will be similarly opaque; they will enable us to compute the output behavior as a function of the input, but they will not enlighten us on how the device model itself computed the output. The transformationalists want the model to do more than just mimic the input-output function of the modeled device. They want it to calculate the function in the same way as the modeled device. I think it should be clear that the transformationalist is demanding, in effect, that the model be strongly adequate. Psycholinguists like Bever, Fodor, and Garrett are convinced that the transformation account of grammar is correct. They take seriously the idea that a speaker of a language has 'internalized' a transformational grammar for the language, and that this internally represented grammar is in some crucial way involved in the actual production of sentences. The real trouble with behav-
114
CONCLUSION
iorism, they think, is that it does not make any room for the incorporation of this grammar in a performance model. There has never been much empirical evidence advanced in support of this hypothesis, and it remains at the present time a cloudy subject. There are a number of problems which arise at the theoretical level itself. For example, does the rat which has learned to run marked mazes internalize the following rule: turn left at black cards and right at white ones? Does the rat have an internal representation of this rule? Is someone doing long division using a rule in the same way as someone using a rule of grammar? What is the difference between these three cases ? These questions and a host of others come up as soon as we try to take seriously the transformationalists' criteria for an adequate performance model. The foregoing considerations indicate, I think, that there are a number of very serious problems with the notion of strong adequacy, both in the theory of grammar and in the theory of performance models. The failure of all the arguments concerning weak adequacy brings to prominence in an especially urgent way the necessity of resolving these problems.
BIBLIOGRAPHY
Arbib, Michael, "Memory Limitations of Stimulus-Response Models", Psychological Review, 76 (1969), 507-510. Bar-Hillel, Y., and E. Shamir, "Finite State Languages: Formal Representations and Adequacy Problems", The Bulletin of the Research Council of Israel, vol. 8F (1960), 155-166. Reprinted in: Yehoshua Bar-Hillel, ed. Language and Information (Addison-Wesley Publishing Co., Reading, Massachusetts, 1964), 87-98. Bever, T., J. A. Fodor, and M. Garrett, "A Formal Limitation of Associationism", in: Theodore R. Dixon and David L. Horton, eds., Verbal Behavior and General Behavior Theory (Prentice-Hall, Englewood Cliffs, New Jersey, 1968), 582-585. Bever, T., J. Fodor, and W. Weksel, "On the Acquisition of Syntax", Psychological Review, vol. 72 (1965), 467-482. Chomsky, Noam, "Three Models for the Description of Language", I. R. E. Transactions on Information Theory, IT-2/3 (1956), 113-124. Reprinted with corrections in: R. D. Luce, R. R. Bush, and E. Galanter, eds., Readings in Mathematical Psychology, vol. 2 (John Wiley & Sons, Inc., New York, 1965). —, Syntactic Structures (Mouton & Co., The Hague, 1957). —, "On Certain Formal Properties of Grammars", Information and Control, vol. 1 (1959), 91-112. Reprinted in: R. D. Luce, R. R. Bush, and E. Galanter, eds., Readings in Mathematical Psychology, vol. 2 (John Wiley & Sons, Inc., New York, 1965). —, "Formal Properties of Grammars", in: R. D. Luce, R. R. Bush, and E. Galanter, eds., Handbook of Mathematical Psychology (John Wiley & Sons, Inc., New York, 1963), vol. 2. —, Aspects of a Theory of Syntax (M.I.T. Press, Cambridge, Massachusetts, 1965). Fodor, Jerry, Psychological Explanation (Random House, New York, 1968). Fodor, J., and M. Garrett, "Psychological Theories and Linguistic Constructs", in : Theodore R. Dixon and David L. Horton, eds., Verbal Behavior and General Behavior Theory (Prentice-Hall, Englewood Cliffs, New Jersey, 1968), 451-477. Ginsburg, Seymour, The Mathematical Theory of Context Free Languages (McGraw-Hill, New York, 1966). Halle, Morris, "On the Bases of Phonology", in: Jerry Fodor and Jerrold Katz, eds., The Structure of Language (Prentice-Hall, Englewood Cliffs, New Jersey, 1964), 324-333.
116
BIBLIOGRAPHY
Halle, Morris, and Kenneth Stevens, "Speech Recognition: a Model and a Program for Research", in Jerry Fodor and Jerrold Katz, eds., The Structure of Language (Prentice-Hall, Englewood Cliffs, New Jersey, 1964), 604-612. McNeill, David, "On Theories of Language Acquisition", in: Theodore R. Dixon and David L. Horton, eds., Verbal Behavior and General Behavior Theory (Prentice-Hall, Englewood Cliffs, New Jersey, 1968), 406-420. Miller, G. A., and N. Chomsky, "Finitary Models of Language Users", in: R. D. Luce, R. R. Bush, and E. Galanter, eds., Handbook of Mathematical Psychology, vol. 2 (John Wiley & Sons, Inc., New York, 1963), 419-492. Myhill, John, "Linear Bounded Automata", Wright Air Development Division, Tech. Note 60-165, 1960. Postal, Paul, "Limitations of Phrase Structure Grammar", in: J. A. Fodor and J. J. Katz, eds., The Structure of Language (Prentice-Hall, Englewood Cliffs, New Jersey, 1964), 137-154. Skinner, B. F., Behavior of Organisms (Appleton-Century-Crofts, New York, 1938). —, Verbal Behavior (Appleton-Century-Crofts, New York, 1957). Suppes, Patrick, "Stimulus-Response Theory of Automata and Tote-Hierarchies: a Reply to Arbib", Psychological Review, 76 (1969), 511-514. —, "Stimulus-Response Theory of Finite Automata", Journal of Mathematical Psychology, 6 (1969), 327-355.
INDEX
Acceptance 23, 25, 27 acquisition model 73 adequacy of a grammar 11, 33, 56 of a performance model 87 alphabet 16 atomic behaviors 78, 81 Automation finite state 22, 82 non-deterministic finite state 24 push-down 26 linear bounded 30 Bar-Hillel, Y, 13, 44, 46-7, 57, 111 behavior 77 behavioral repertory 83-6, 92, 94, 106 token 77-8, 84 type 77-8, 84 behaviorist performance models 76, 79, 81 Bever, T, 87, 111 Chomsky, Noam 9, 10, 33, 35, 111 conditioning function 81 congruence relation 44 creative aspect of language 52 Fodor, Jerry, 83, 87, 91, 95, 111 Garrett, M, 87, 91, 95, 111 Generalize Sequential Machine 31 Grammar, Phrase structure 10, 18 Generative 10, 72 Finite State 11, 34-5, 57
Context sensitive 20, 37 Context free 11, 21, 37, 56 transformational 12, 91 right linear 22, 33 GSM Mapping 32 infinite 52, 92, 106 Language, Phrase structure 19 linguistic competence 71 M-dependency 35, 40, 47 mirror-image language 37 molecular behavior 78 Myhill, John, 30 Null word 17 operant 79 operant conditioning 79, 81 performance model 14, 71, 73, 76, 79 phoneme 95 phonetics 95, 108 Postal, Paul 13, 56, 61, 111 recursive rules 53, 90, 104 Regular Set 23 Shamir, E. 44, 46-7, 57, 111 Skinner, B. F. 76, 79 stimulus 79 stimulus-response table 80 Syntactyc Structures 33, 47 Terminal Meta-Postulate 88